This is a demonstration of supervised and unsupervised machine learning techniques in Spark
This is Workshop 11 Spark Machine Learning which one of a workshop series given as part of the Big Data Engineering for Analytics module which fulfills a requirement for the Engineering Big Data certificate issued by NUS-ISS
I have translated the original Python code to Scala
git clone https://github.com/frenoid/tour-of-spark.git
- src/main/scala/com/normanlimxk/SparkML contains the ML code
- src/main/resources contains data grouped by ML algorithm
- build.sbt contains a list of dependencies. Similar to pom.xml in Maven
You have 2 options to run the spark job
- Compile and run on a spark-cluster
- Use Intellij (Recommended)
Do this if you have a spark cluster to spark-submit to
Take note of these versions. See also build.sbt
scala = 2.12.10
spark = 3.0.3
sbt = 1.6.1
Use sbt to compile into a jar
sbt compile
The jar file will be in target/scala-2.12
Use spark-submit to submit the spark job
spark-submit {your-jar-file}
Install Intellij and use it to Open the build.sbt file as a Project
Intellij will resolve the dependencies listed in build.sbt
Go to Run > Edit Configurations > Modify options > Add dependencies with "provided" scope to classpath
Run > Run class of your choice
The data was provided by Dr LIU FAN from NUS-ISS
Each class under src.main.scala.com.normanlimxk.SparkML contains examples of each Machine Learning method
The project uses the spark-sbt.g8 template from MrPowers