Skip to content

luisresende13/pyspark-learning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PySpark Learning

This repository contains a collection of PySpark examples.

Table of Contents

Project Structure

.
├── data
│   └── input_stream_dir
│       ├── file1.txt
│       └── file2.txt
├── examples
│   ├── airflow
│   │   └── orchestrator.py
│   ├── aws
│   │   ├── emr-models.md
│   │   ├── launch-aws-ems-job.py
│   │   └── run-on-aws-emr-console.py
│   ├── basic
│   │   ├── caching.py
│   │   ├── data-skew.py
│   │   ├── groupby-and-agg.py
│   │   ├── join.py
│   │   ├── spark-sql.py
│   │   ├── spark-ui.py
│   │   ├── streaming-window.py
│   │   ├── structured-streaming.py
│   │   ├── user-defined-functions.py
│   │   ├── write-parquet-csv.py
│   │   └── your-first-pyspark-code.py
│   ├── delta-table
│   │   ├── compact-and-sort.py
│   │   ├── merge.py
│   │   └── usage.py
│   ├── mllib
│   │   ├── classification-pipeline.py
│   │   └── cross-validation-tuning.py
│   ├── performance-tuning.py
│   │   ├── broadcast-hash-join.py
│   │   ├── partitioning.py
│   │   └── resillient-distributed-data.py
│   └── production
│       └── workflow.md
├── extras
│   └── spark-ui-custom.py
├── README.md
└── requirements.txt

Examples

Basic

This directory contains basic PySpark examples, such as:

  • Reading and writing data
  • Dataframe operations
  • Spark SQL
  • User-defined functions (UDFs)
  • Caching
  • Spark UI
  • Structured Streaming

AWS

This directory contains examples of how to run PySpark on AWS, such as:

  • Running PySpark on EMR
  • Launching EMR jobs

Delta Table

This directory contains examples of how to use Delta Lake with PySpark, such as:

  • Creating and reading Delta tables
  • Updating and merging Delta tables
  • Compacting and sorting Delta tables

MLlib

This directory contains examples of how to use MLlib with PySpark, such as:

  • Classification pipelines
  • Cross-validation and tuning

Performance Tuning

This directory contains examples of how to tune PySpark performance, such as:

  • Broadcast hash join
  • Partitioning
  • Resilient Distributed Datasets (RDDs)

Production

This directory contains examples of how to run PySpark in production, such as:

  • Workflow management

Airflow

This directory contains examples of how to use Airflow with PySpark, such as:

  • Orchestrating PySpark jobs

Running the Examples

  1. Install the dependencies:

    pip install -r requirements.txt
  2. Run the examples using spark-submit:

    spark-submit examples/basic/your-first-pyspark-code.py

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages