Apache Spark

Apache Spark is an open-source distributed general-purpose cluster-computing framework. Spark provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.1 It is an alternative parallel programming approach that is regularly used in "Big Data" applications. The goal of this assignment is to give you some experience with an alternative to cluster programming with MPI.

Project Requirements.

Spark is commonly used to simplify and distribute making sense of large data sets across a cluster of computers. To gain some familiarity and experience with this, you will need to do the following:

Find a large/massive dataset, ideally of something that you might be curious/interested in.

An example would be from one of these JSON data sets from the US Government:
- https://catalog.data.gov/dataset?res_format=JSON

Write a simple Spark application, I would recommend in pyspark, that aggregate the data into a simpler form. Like taking all of the temperature data from the world and simplifying it down to the average temp of the earth for each month.
Graph your simplified data and/or write up a short discussion of how you can now make generalizations from this data reduction.
Contrast writing Spark code vs MPI code. Write this contrast up into a short report.

Was it easier or harder to write?
Are the goals the same?
What would be easier with Spark than MPI?
What would be harder with Spark than MPI?
etc.

Running Project

If you aren't using a NVidia Jetson Nano Cluster with Apache Spark. You can alternatively use Apache Spark on Amazon EMR or Apache Spark on GCP. There may be other providers of Apache Spark.

Project Deliverables

Submit a tar.gz file to Tyson's turnin system with the following.

Datasource being by your Apache Spark application
Apache Spark code
Report with:
- data results/conclusions from the Spark application
- Contrast of Spark vs MPI

Evaluation

The assignment value will be broken down by the following:

10 - Data source
40 - Apache Spark code compiles/runs generating results
25 - Report details on results/conclusion from the data results of your spark code.
25 - Report conclusions contrasting MPI & Spark.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Apache Spark

Project Requirements.

Running Project

Project Deliverables

Evaluation

About

Uh oh!

Releases

Packages

License

csuchico-csci551/Spark

Folders and files

Latest commit

History

Repository files navigation

Apache Spark

Project Requirements.

Running Project

Project Deliverables

Evaluation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages