Skip to content

ultra-shuffle/SCache

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SCache

SCache is a shuffle cache/daemon integrated with the local Hadoop + Spark trees in this workspace.

The previous (upstream-style) README is archived as README.old.md.

Dependencies (local git repos)

SCache now depends on these repositories being present at the following paths:

Build

Build SCache

cd ${HOME}/SCache
sbt publishM2   # publishes org.scache to ~/.m2 (needed by spark and hadoop)
sbt assembly    # fat jar for deployment

Artifacts:

  • target/scala-2.13/scache_2.13-0.1.0-SNAPSHOT.jar
  • target/scala-2.13/SCache-assembly-0.1.0-SNAPSHOT.jar

Build Hadoop

cd ${HOME}/hadoop
mvn -DskipTests -Pdist -Dtar package
# For faster build
mvn package -T 1C -Pdist -DskipTests -Dtar -Dmaven.javadoc.skip=true -Denforcer.skip=true

Build Spark 3.5

cd $HOME/spark-3.5
# Verified 
./build/sbt -Phadoop-3 -Pscala-2.13 package
# Not verified yet
./dev/make-distribution.sh -DskipTests

Deploy / Run

Start SCache

  1. Configure cluster hosts in conf/slaves and settings in conf/scache.conf.
  2. Distribute SCache to the cluster and start it:
cd $HOME/SCache
sbin/copy-dir.sh
sbin/start-scache.sh

Stop:

cd $HOME/SCache
sbin/stop-scache.sh

Enable in Hadoop MapReduce

  • Put target/scala-2.13/SCache-assembly-0.1.0-SNAPSHOT.jar on the YARN classpath (for example, copy it to $HADOOP_HOME/share/hadoop/yarn/lib/ on every node).
  • Set the following in $HADOOP_HOME/etc/hadoop/mapred-site.xml:
mapreduce.job.map.output.collector.class=org.apache.hadoop.mapred.MapTask$ScacheOutputBuffer
mapreduce.job.reduce.shuffle.consumer.plugin.class=org.apache.hadoop.mapreduce.task.reduce.ScacheShuffle
mapreduce.scache.home=$HOME/SCache

Enable in Spark

  • Make the SCache jar visible to drivers/executors (either copy to $SPARK_HOME/jars/ or set spark.scache.jars).
  • Set (for example in $SPARK_HOME/conf/spark-defaults.conf):
spark.scache.enable true
spark.scache.home $HOME/SCache
spark.scache.jars $HOME/SCache/target/scala-2.13/SCache-assembly-0.1.0-SNAPSHOT.jar
spark.shuffle.useOldFetchProtocol true

IPC Pool Backend (mmap)

SCache can exchange shuffle block bytes between Spark's in-process daemon and the node-local ScacheClient via a single shared mmap pool file (offset/len). This is configured via scache.daemon.ipc.backend=pool in conf/scache.conf and a pool path such as a DAX-mounted file.

Self-test (no RPC)

This quick check verifies that two independent mmaps of the same pool file observe each other's writes correctly (i.e., the basic shared-memory mechanism works):

cd $HOME/SCache
sbt "runMain org.scache.deploy.PoolIpcSelfTest --path /dev/shm/scache-ipc.pool --size 1g --chunk 256m"

Workloads / benchmarks

  • Standalone Spark scripts and HiBench in $HOME/spark-apps/ (see $HOME/spark-apps/README.md).

About

A distributed memory cache system for shuffle in map-reduce

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Scala 66.2%
  • Java 29.8%
  • Python 2.7%
  • Shell 1.3%