This repository contains a collection of PySpark examples.
.
├── data
│ └── input_stream_dir
│ ├── file1.txt
│ └── file2.txt
├── examples
│ ├── airflow
│ │ └── orchestrator.py
│ ├── aws
│ │ ├── emr-models.md
│ │ ├── launch-aws-ems-job.py
│ │ └── run-on-aws-emr-console.py
│ ├── basic
│ │ ├── caching.py
│ │ ├── data-skew.py
│ │ ├── groupby-and-agg.py
│ │ ├── join.py
│ │ ├── spark-sql.py
│ │ ├── spark-ui.py
│ │ ├── streaming-window.py
│ │ ├── structured-streaming.py
│ │ ├── user-defined-functions.py
│ │ ├── write-parquet-csv.py
│ │ └── your-first-pyspark-code.py
│ ├── delta-table
│ │ ├── compact-and-sort.py
│ │ ├── merge.py
│ │ └── usage.py
│ ├── mllib
│ │ ├── classification-pipeline.py
│ │ └── cross-validation-tuning.py
│ ├── performance-tuning.py
│ │ ├── broadcast-hash-join.py
│ │ ├── partitioning.py
│ │ └── resillient-distributed-data.py
│ └── production
│ └── workflow.md
├── extras
│ └── spark-ui-custom.py
├── README.md
└── requirements.txt
This directory contains basic PySpark examples, such as:
- Reading and writing data
- Dataframe operations
- Spark SQL
- User-defined functions (UDFs)
- Caching
- Spark UI
- Structured Streaming
This directory contains examples of how to run PySpark on AWS, such as:
- Running PySpark on EMR
- Launching EMR jobs
This directory contains examples of how to use Delta Lake with PySpark, such as:
- Creating and reading Delta tables
- Updating and merging Delta tables
- Compacting and sorting Delta tables
This directory contains examples of how to use MLlib with PySpark, such as:
- Classification pipelines
- Cross-validation and tuning
This directory contains examples of how to tune PySpark performance, such as:
- Broadcast hash join
- Partitioning
- Resilient Distributed Datasets (RDDs)
This directory contains examples of how to run PySpark in production, such as:
- Workflow management
This directory contains examples of how to use Airflow with PySpark, such as:
- Orchestrating PySpark jobs
-
Install the dependencies:
pip install -r requirements.txt
-
Run the examples using
spark-submit:spark-submit examples/basic/your-first-pyspark-code.py