Skip to content

A demonstration of the Spark Machine Learning Library (MLlib) on popular datasets

Notifications You must be signed in to change notification settings

frenoid/spark-machine-learning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

spark-machine-learning

This is a demonstration of supervised and unsupervised machine learning techniques in Spark

This is Workshop 11 Spark Machine Learning which one of a workshop series given as part of the Big Data Engineering for Analytics module which fulfills a requirement for the Engineering Big Data certificate issued by NUS-ISS

I have translated the original Python code to Scala

Getting started

Clone the repo

git clone https://github.com/frenoid/tour-of-spark.git

Structure

  1. src/main/scala/com/normanlimxk/SparkML contains the ML code
  2. src/main/resources contains data grouped by ML algorithm
  3. build.sbt contains a list of dependencies. Similar to pom.xml in Maven

Running the Spark job

You have 2 options to run the spark job

  1. Compile and run on a spark-cluster
  2. Use Intellij (Recommended)

(Option 1) Compile and run on a spark-cluster

Do this if you have a spark cluster to spark-submit to
Take note of these versions. See also build.sbt

scala = 2.12.10
spark = 3.0.3
sbt = 1.6.1

Use sbt to compile into a jar

sbt compile

The jar file will be in target/scala-2.12

Use spark-submit to submit the spark job

spark-submit {your-jar-file}

(Option 2 RECOMMENDED) Use Intellij

Install Intellij and use it to Open the build.sbt file as a Project

Intellij will resolve the dependencies listed in build.sbt

Go to Run > Edit Configurations > Modify options > Add dependencies with "provided" scope to classpath

Run > Run class of your choice

Data

The data was provided by Dr LIU FAN from NUS-ISS

  1. Titanic dataset
  2. Seeds dataset

Structure

Each class under src.main.scala.com.normanlimxk.SparkML contains examples of each Machine Learning method

  1. Linear Regression
  2. Classification
  3. Clustering

The project uses the spark-sbt.g8 template from MrPowers

About

A demonstration of the Spark Machine Learning Library (MLlib) on popular datasets

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages