Using MMLSpark to classify adult income level

This sample demonstrates the power of simplification by implementing a binary classfier using the popular Adult Census dataset, first with the open-source mmlspark Spark package then comparing that with the standad Spark ML constructs.

mmlspark vs. Spark ML

As a quick comparision, here is the one-line training code using mmlspark:

model = TrainClassifier(model=LogisticRegression(regParam=reg), labelCol=" income", numFeatures=256).fit(train)

And here is the equivalent code in standard Spark ML:

# create a new Logistic Regression model.
lr = LogisticRegression(regParam=reg)

# string-index and one-hot encode the education column
si1 = StringIndexer(inputCol=' education', outputCol='ed')
ohe1 = OneHotEncoder(inputCol='ed', outputCol='ed-encoded')

# string-index and one-hot encode the matrial-status column
si2 = StringIndexer(inputCol=' marital-status', outputCol='ms')
ohe2 = OneHotEncoder(inputCol='ms', outputCol='ms-encoded')

# string-index the label column into a column named "label"
si3 = StringIndexer(inputCol=' income', outputCol='label')

# assemble the encoded feature columns in to a column named "features"
assembler = VectorAssembler(inputCols=['ed-encoded', 'ms-encoded', ' hours-per-week'], outputCol="features")

# put together the pipeline
pipe = Pipeline(stages=[si1, ohe1, si2, ohe2, si3, assembler, lr])

# train the model
model = pipe.fit(train)

To learn more about mmlspark Spark package, please visit: http://github.com/azure/mmlspark.

Run this sample:

Run train_mmlspark.py in a local Docker container.

$ az ml experiment submit -c docker train_mmlspark.py 0.1

Create myvm.compute file to point to a remove VM

$ az ml computetarget attach --name <myvm> --address <ip address or FQDN> --username <username> --password <pwd> --type remotedocker

Run train_mmlspark.py in a Docker container (with Spark) in a remote VM:

$ az ml experiment submit -c myvm train_mmlspark.py 0.3

Create myhdi.compute to point to an HDI cluster

$ az ml computetarget attach --name <myhdi> --address <ip address or FQDN of the head node> --username <username> --password <pwd> --type cluster

Run it in a remote HDInsight cluster:

$ az ml experiment submit -c myhdi train_mmlspark.py 0.5

Name		Name	Last commit message	Last commit date
Latest commit History 79 Commits
aml_config		aml_config
LICENSE		LICENSE
readme.md		readme.md
run.py		run.py
train_mmlspark.py		train_mmlspark.py
train_sparkml.py		train_sparkml.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Using MMLSpark to classify adult income level

mmlspark vs. Spark ML

Run this sample:

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

License

amlsamples/mmlspark

Folders and files

Latest commit

History

Repository files navigation

Using MMLSpark to classify adult income level

mmlspark vs. Spark ML

Run this sample:

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages