PySpark Learning

This repository contains a collection of PySpark examples.

Project Structure

.
├── data
│   └── input_stream_dir
│       ├── file1.txt
│       └── file2.txt
├── examples
│   ├── airflow
│   │   └── orchestrator.py
│   ├── aws
│   │   ├── emr-models.md
│   │   ├── launch-aws-ems-job.py
│   │   └── run-on-aws-emr-console.py
│   ├── basic
│   │   ├── caching.py
│   │   ├── data-skew.py
│   │   ├── groupby-and-agg.py
│   │   ├── join.py
│   │   ├── spark-sql.py
│   │   ├── spark-ui.py
│   │   ├── streaming-window.py
│   │   ├── structured-streaming.py
│   │   ├── user-defined-functions.py
│   │   ├── write-parquet-csv.py
│   │   └── your-first-pyspark-code.py
│   ├── delta-table
│   │   ├── compact-and-sort.py
│   │   ├── merge.py
│   │   └── usage.py
│   ├── mllib
│   │   ├── classification-pipeline.py
│   │   └── cross-validation-tuning.py
│   ├── performance-tuning.py
│   │   ├── broadcast-hash-join.py
│   │   ├── partitioning.py
│   │   └── resillient-distributed-data.py
│   └── production
│       └── workflow.md
├── extras
│   └── spark-ui-custom.py
├── README.md
└── requirements.txt

Examples

Basic

This directory contains basic PySpark examples, such as:

Reading and writing data
Dataframe operations
Spark SQL
User-defined functions (UDFs)
Caching
Spark UI
Structured Streaming

AWS

This directory contains examples of how to run PySpark on AWS, such as:

Running PySpark on EMR
Launching EMR jobs

Delta Table

This directory contains examples of how to use Delta Lake with PySpark, such as:

Creating and reading Delta tables
Updating and merging Delta tables
Compacting and sorting Delta tables

MLlib

This directory contains examples of how to use MLlib with PySpark, such as:

Classification pipelines
Cross-validation and tuning

Performance Tuning

This directory contains examples of how to tune PySpark performance, such as:

Broadcast hash join
Partitioning
Resilient Distributed Datasets (RDDs)

Production

This directory contains examples of how to run PySpark in production, such as:

Workflow management

Airflow

This directory contains examples of how to use Airflow with PySpark, such as:

Orchestrating PySpark jobs

Running the Examples

Install the dependencies:
```
pip install -r requirements.txt
```

Run the examples using spark-submit:

spark-submit examples/basic/your-first-pyspark-code.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PySpark Learning

Table of Contents

Project Structure

Examples

Basic

AWS

Delta Table

MLlib

Performance Tuning

Production

Airflow

Running the Examples

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
examples		examples
extras		extras
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

PySpark Learning

Table of Contents

Project Structure

Examples

Basic

AWS

Delta Table

MLlib

Performance Tuning

Production

Airflow

Running the Examples

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages