SCache is a shuffle cache/daemon integrated with the local Hadoop + Spark trees in this workspace.
The previous (upstream-style) README is archived as README.old.md.
SCache now depends on these repositories being present at the following paths:
${HOME}/hadoop/: Hadoop${HOME}/spark-3.5/: Spark${HOME}/spark-apps/: Spark Apps${HOME}/spark-apps/HiBench-7.1.1/: HiBench Suite
cd ${HOME}/SCache
sbt publishM2 # publishes org.scache to ~/.m2 (needed by spark and hadoop)
sbt assembly # fat jar for deploymentArtifacts:
target/scala-2.13/scache_2.13-0.1.0-SNAPSHOT.jartarget/scala-2.13/SCache-assembly-0.1.0-SNAPSHOT.jar
cd ${HOME}/hadoop
mvn -DskipTests -Pdist -Dtar package
# For faster build
mvn package -T 1C -Pdist -DskipTests -Dtar -Dmaven.javadoc.skip=true -Denforcer.skip=truecd $HOME/spark-3.5
# Verified
./build/sbt -Phadoop-3 -Pscala-2.13 package
# Not verified yet
./dev/make-distribution.sh -DskipTests- Configure cluster hosts in
conf/slavesand settings inconf/scache.conf. - Distribute SCache to the cluster and start it:
cd $HOME/SCache
sbin/copy-dir.sh
sbin/start-scache.shStop:
cd $HOME/SCache
sbin/stop-scache.sh- Put
target/scala-2.13/SCache-assembly-0.1.0-SNAPSHOT.jaron the YARN classpath (for example, copy it to$HADOOP_HOME/share/hadoop/yarn/lib/on every node). - Set the following in
$HADOOP_HOME/etc/hadoop/mapred-site.xml:
mapreduce.job.map.output.collector.class=org.apache.hadoop.mapred.MapTask$ScacheOutputBuffer
mapreduce.job.reduce.shuffle.consumer.plugin.class=org.apache.hadoop.mapreduce.task.reduce.ScacheShuffle
mapreduce.scache.home=$HOME/SCache
- Make the SCache jar visible to drivers/executors (either copy to
$SPARK_HOME/jars/or setspark.scache.jars). - Set (for example in
$SPARK_HOME/conf/spark-defaults.conf):
spark.scache.enable true
spark.scache.home $HOME/SCache
spark.scache.jars $HOME/SCache/target/scala-2.13/SCache-assembly-0.1.0-SNAPSHOT.jar
spark.shuffle.useOldFetchProtocol true
SCache can exchange shuffle block bytes between Spark's in-process daemon and the node-local
ScacheClient via a single shared mmap pool file (offset/len). This is configured via
scache.daemon.ipc.backend=pool in conf/scache.conf and a pool path such as a DAX-mounted file.
This quick check verifies that two independent mmaps of the same pool file observe each other's
writes correctly (i.e., the basic shared-memory mechanism works):
cd $HOME/SCache
sbt "runMain org.scache.deploy.PoolIpcSelfTest --path /dev/shm/scache-ipc.pool --size 1g --chunk 256m"- Standalone Spark scripts and HiBench in
$HOME/spark-apps/(see$HOME/spark-apps/README.md).