Module 6 — Batch Processing with Apache Spark

Part of the DataTalksClub Data Engineering Zoomcamp 2026

Setup

Requirements

Ubuntu 24.04 (or WSL)
Java 17+
Python 3.12+
uv package manager

Install PySpark

uv init
uv add pyspark jupyter ipykernel

# Register the kernel so Jupyter uses the correct environment
uv run python -m ipykernel install --user --name=pyspark-env --display-name "PySpark (uv)"

# Launch Jupyter
uv run jupyter notebook

When opening any notebook, set the kernel to PySpark (uv) via Kernel > Change Kernel.

Verify installation

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .master("local[*]") \
    .appName("test") \
    .getOrCreate()

print(spark.version)

Project Structure

PySpark/
│
├── 03_test.ipynb             # Verify Spark works, create first SparkSession
├── 04_pyspark.ipynb          # DataFrames, schemas, UDFs using FHVHV taxi data
├── 05_taxi_schema.ipynb      # Define schemas for Yellow & Green taxi, save to Parquet
├── 06_spark_sql.ipynb        # SparkSQL, temp views, union Yellow + Green data
├── 07_groupby_join.ipynb     # GroupBy internals, Sort-Merge vs Broadcast joins
├── 08_rdds.ipynb             # RDD operations, mapPartitions
├── 09_spark_gcs.ipynb        # Connect Spark to Google Cloud Storage
├── HomeWork.ipynb            # 2026 Homework solutions
│
├── data/
│   ├── raw/                  # Downloaded CSV source files
│   ├── pq/                   # Processed Parquet files
│   └── report/               # Query output / reports
│
└── taxi_zone_lookup.csv      # NYC taxi zone lookup table

Data Sources

All data comes from the DataTalksClub NYC TLC mirror and the official NYC TLC website.

# FHVHV (For-Hire High Volume) - used in 04_pyspark.ipynb
wget https://github.com/DataTalksClub/nyc-tlc-data/releases/download/fhvhv/fhvhv_tripdata_2021-01.csv.gz

# Yellow taxi
wget https://github.com/DataTalksClub/nyc-tlc-data/releases/download/yellow/yellow_tripdata_2021-01.csv.gz

# Green taxi
wget https://github.com/DataTalksClub/nyc-tlc-data/releases/download/green/green_tripdata_2021-01.csv.gz

# Zone lookup table
wget https://d37ci6vzurychx.cloudfront.net/misc/taxi_zone_lookup.csv

# Homework dataset - Yellow November 2025
wget https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2025-11.parquet

Notebooks Overview

03_test.ipynb

Verify PySpark is installed and working. Creates a basic SparkSession and runs spark.range(10).

04_pyspark.ipynb

First real look at PySpark DataFrames using FHVHV taxi data. Covers reading CSV with and without explicit schemas, using pandas to sample and infer types, repartitioning, saving to Parquet, and writing User-Defined Functions (UDFs).

05_taxi_schema.ipynb

Defines explicit StructType schemas for both Yellow and Green taxi datasets. Loops over multiple months converting CSV files to Parquet. Explains the difference between Yellow (tpep_) and Green (lpep_) datetime column naming.

06_spark_sql.ipynb

Combines Yellow and Green taxi data using SparkSQL. Renames columns to align both datasets, unions them into a single view, and runs aggregation queries using spark.sql(). Saves a monthly revenue report to Parquet.

07_groupby_join.ipynb

Explores Spark internals. Covers the two-phase GroupBy shuffle (map-side aggregation, network shuffle, reduce-side aggregation), Sort-Merge joins between large tables, and Broadcast joins using the small zone lookup table. Uses df.explain() to inspect execution plans and see where shuffles occur.

08_rdds.ipynb

Works with the low-level RDD API. Covers map, filter, flatMap, reduceByKey, DataFrame to RDD conversion, and mapPartitions for partition-level processing such as loading a model once per partition rather than once per row.

09_spark_gcs.ipynb

Configures Spark with the GCS connector JAR to read and write directly to Google Cloud Storage using gs:// paths. Requires a GCP account and gcloud auth application-default login.

HomeWork.ipynb

Solutions to the Module 6 homework using Yellow taxi November 2025 data.

Key Concepts

Why Parquet over CSV

	CSV	Parquet
Size on disk	Large	~5x smaller
Schema	None (all strings)	Typed
Read speed	Slow	Fast (columnar reads)
Spark preference	No	Yes

GroupBy internals

Every GroupBy triggers a shuffle — data moves across the network so rows with matching keys land on the same partition. This is the most expensive operation in Spark and is visible as an Exchange node in df.explain().

Broadcast Join vs Sort-Merge Join

# Sort-Merge (default) — shuffles both sides across the network
df_large.join(df_medium, on="key")

# Broadcast — copies small table to every executor, no shuffle needed
from pyspark.sql import functions as F
df_large.join(F.broadcast(df_small), on="key")

Use broadcast joins whenever one table is small (under a few hundred MB). The execution plan will show BroadcastHashJoin instead of Exchange.

Spark UI

Available at http://localhost:4040 while a Spark session is running. Shows jobs, stages, tasks, DAG visualisation, and shuffle read/write sizes.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
PySpark		PySpark
README.md		README.md
Spark HomeWork.ipynb		Spark HomeWork.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Module 6 — Batch Processing with Apache Spark

Setup

Requirements

Install PySpark

Verify installation

Project Structure

Data Sources

Notebooks Overview

03_test.ipynb

04_pyspark.ipynb

05_taxi_schema.ipynb

06_spark_sql.ipynb

07_groupby_join.ipynb

08_rdds.ipynb

09_spark_gcs.ipynb

HomeWork.ipynb

Key Concepts

Why Parquet over CSV

GroupBy internals

Broadcast Join vs Sort-Merge Join

Spark UI

Resources

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Module 6 — Batch Processing with Apache Spark

Setup

Requirements

Install PySpark

Verify installation

Project Structure

Data Sources

Notebooks Overview

03_test.ipynb

04_pyspark.ipynb

05_taxi_schema.ipynb

06_spark_sql.ipynb

07_groupby_join.ipynb

08_rdds.ipynb

09_spark_gcs.ipynb

HomeWork.ipynb

Key Concepts

Why Parquet over CSV

GroupBy internals

Broadcast Join vs Sort-Merge Join

Spark UI

Resources

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages