Theme: From Fundamentals to Production Optimization
Core Stack: PySpark, Performance Optimization, Structured Streaming
Target Platform: Databricks Free Edition (or 14-Day Trial)
A hands-on workshop designed to teach essential PySpark skills from fundamentals to production optimization. Participants will learn distributed data processing, performance optimization techniques, and production-ready patterns using real-world datasets (TPC-H and TPC-DS).
Target Audience: Data engineers, data analysts, and developers working with Apache Spark
Prerequisites: Basic Python knowledge, SQL familiarity (optional but helpful)
Platform: Databricks Free Edition (completely free!)
- β Spark fundamentals and lazy evaluation
- β DataFrame operations (filter, join, groupBy, aggregations)
- β Data engineering patterns (cleaning, joining, saving)
- β Intermediate patterns (window functions, caching, partitioning)
- β Production optimization (shuffles, skew, broadcast joins, AQE)
- β Streaming optimization (checkpoints, watermarks, output modes)
- β Advanced PySpark patterns for production
- Process large datasets efficiently
- Diagnose and fix performance bottlenecks
- Optimize Spark jobs using Spark UI
- Apply production-ready patterns
- Build reliable streaming pipelines
- Master advanced PySpark techniques
PySpark-Workshop/
βββ README.md # This file
βββ WORKSHOP_GUIDE.md # Complete workshop guide
βββ DATABRICKS_FREE_EDITION_GUIDE.md # Concise Databricks guide
βββ PYSPARK_CHEATSHEET.md # Quick reference guide
β
βββ Part1_PySpark_Speedrun.ipynb # Part 1: The PySpark Speedrun (20 min)
βββ Part2_Data_Engineer_Toolkit.ipynb # Part 2: Data Engineer Toolkit (35 min)
βββ Part3_Intermediate_Patterns.ipynb # Part 3: Intermediate Patterns (40 min)
βββ Part4_Batch_Processing_Optimization.ipynb # Part 4: Batch Optimization (20 min)
β
βββ BONUS/
βββ Part6_Stream_Optimization_BONUS.ipynb # Bonus: Streaming Optimization (20 min)
βββ Part7_Advanced_Pyspark_BONUS.ipynb # Bonus: Advanced PySpark (40 min)
- Go to databricks.com/try-databricks
- Create a free account (no credit card required!)
- Verify your email address
- Navigate to Compute β Create Cluster
- Name: "workshop-cluster"
- Runtime: Latest LTS ML version (e.g., 14.3 LTS ML)
- Click Create (takes 3-5 minutes to start)
-
Download the core workshop notebooks from this repository:
Part1_PySpark_Speedrun.ipynb- Spark fundamentalsPart2_Data_Engineer_Toolkit.ipynb- Data engineering patternsPart3_Intermediate_Patterns.ipynb- Advanced patternsPart4_Batch_Processing_Optimization.ipynb- Production optimization
Optional Bonus Notebooks (for advanced learners):
Part6_Stream_Optimization_BONUS.ipynb- Streaming production issuesPart7_Advanced_Pyspark_BONUS.ipynb- Comprehensive advanced patterns
-
In Databricks: Workspace β Your User Folder β Import
-
Upload each
.ipynbfile -
Attach to your cluster (dropdown at top of notebook)
- No data download required! The notebooks use built-in datasets in Databricks
- TPC-H dataset (Parts 1-3): Pre-loaded at
samples.tpch.*tables- Includes:
orders,customer,lineitem,nation
- Includes:
- TPC-DS dataset (Part 4): Pre-loaded at
samples.tpcds_sf1.*tables- Includes:
store_sales,customer,item,date_dim
- Includes:
- Simply run the notebooks - the data is already available!
Full setup instructions: See WORKSHOP_GUIDE.md
Objective: Get everyone writing code and understanding Spark's core "lazy" concept within minutes.
- Module 1.1: First Load (10 mins) - Load and explore data
- Module 1.2: Core API (10 mins) - Essential DataFrame operations
Key Concepts: DataFrame, Transformations (Lazy), Actions (Eager)
Objective: Master the true 80/20 of data engineering: shaping, joining, and aggregating data.
- Module 2.1: Aggregating (10 mins) - Group by order status and calculate order statistics
- Module 2.2: Joining (15 mins) - Combine customer data with orders (left vs inner join)
- Module 2.3: Cleaning & Saving (10 mins) - Handle nulls and save to Delta Lake
Objective: Master essential patterns that appear in 80% of production PySpark jobs.
- Module 3.1: Window Functions (15 mins) - Ranking, running totals, lag/lead
- Module 3.2: Caching & Persistence (10 mins) - When and how to cache effectively
- Module 3.3: Partitioning Fundamentals (10 mins) - Data layout and partition control
- Module 3.4: Query Plans & explain() (5 mins) - Understanding Spark's execution
Objective: Identify, diagnose, and fix the most common Spark performance issues.
- Issue #1: Shuffle Explosion & Column Pruning
- Issue #2: Broadcast Joins for Small Dimensions
- Issue #3: Python UDF Performance Killer
- Issue #4: Data Skew - The Silent Killer
- Issue #5: Adaptive Query Execution (AQE)
Objective: Fix critical Structured Streaming production issues.
- Checkpoint management for fault tolerance
- Watermarks to prevent unbounded state growth
- Choosing the right output mode
- Handling small files in streaming sinks
- Idempotent writes with foreachBatch
Objective: Comprehensive advanced patterns for production data engineering.
- Performance optimization (shuffles, partitions, caching)
- Join strategies and AQE
- Schema hygiene and pushdown
- ML pipelines with proper validation
- Streaming with watermarks and checkpointing
- Recap what we built
- The path forward (scaling, reliability, production)
- Final Q&A
Total Time: ~2.5 hours (core workshop) + 1 hour (bonus content)
The core notebooks use the TPC-H (Transaction Processing Performance Council - Benchmark H) dataset, which is built into Databricks and simulates a business data warehouse environment.
- o_orderkey: Unique identifier for each order
- o_custkey: Customer key (foreign key to customer table)
- o_orderstatus: Order status (O, F, P)
- o_totalprice: Total price of the order
- o_orderdate: Date of the order
- o_orderpriority: Order priority
- o_clerk: Clerk who processed the order
- o_shippriority: Shipping priority
- o_comment: Order comment
- c_custkey: Unique identifier for each customer
- c_name: Customer name
- c_address: Customer address
- c_nationkey: Nation key (foreign key)
- c_phone: Customer phone number
- c_acctbal: Customer account balance
- c_mktsegment: Market segment (AUTOMOBILE, BUILDING, MACHINERY, HOUSEHOLD, FURNITURE)
- c_comment: Customer comment
Data Characteristics:
- Pre-loaded in Databricks as
samples.tpch.*tables - Optimized Parquet format for fast reads
- Realistic business data patterns
- Perfect for learning joins, aggregations, and intermediate patterns
- No download or upload required - ready to use!
Part 4 uses the TPC-DS (Decision Support) dataset, a more complex benchmark perfect for performance testing and optimization exercises.
Key Tables:
- store_sales: Large fact table with sales transactions
- customer: Customer dimension table
- item: Item/product dimension table
- date_dim: Date dimension table
Data Characteristics:
- Pre-loaded in Databricks as
samples.tpcds_sf1.*tables - Scale Factor 1 (~1GB) - perfect for performance demos
- Simulates retail environment with stores, customers, and sales
- Ideal for demonstrating optimization techniques
After completing this workshop, you will be able to:
-
Understand Spark Fundamentals
- Explain lazy evaluation and the catalyst optimizer
- Describe transformations vs actions
- Use Spark UI for debugging and performance analysis
-
Process Data at Scale
- Load and transform large datasets efficiently
- Perform complex joins and aggregations
- Clean data and save to Delta Lake
- Use window functions for advanced analytics
-
Optimize Spark Performance
- Diagnose performance bottlenecks using Spark UI
- Fix shuffle explosion with column pruning
- Use broadcast joins for small dimensions
- Mitigate data skew with salting techniques
- Enable and leverage Adaptive Query Execution (AQE)
-
Master Production Patterns
- Apply caching strategies effectively
- Control partitioning for optimal performance
- Read and interpret query execution plans
- Handle complex data types (arrays, structs)
-
Build Reliable Streaming Pipelines (Bonus)
- Configure checkpoints for fault tolerance
- Use watermarks to prevent unbounded state
- Choose correct output modes
- Implement idempotent sinks
- Laptop with modern browser (Chrome/Firefox/Safari)
- Internet connection (stable, for Databricks)
- Databricks account (free - no credit card required!)
- No local installation required!
- Databricks account (free edition works perfectly!)
- No additional setup required - datasets are built into Databricks
- WORKSHOP_GUIDE.md - Complete workshop guide with agenda, timing, and teaching tips
- DATABRICKS_FREE_EDITION_GUIDE.md - Concise guide to Databricks Free Edition
- PYSPARK_CHEATSHEET.md - Quick reference for common PySpark operations
Core Workshop:
- Part1_PySpark_Speedrun.ipynb - Spark fundamentals (20 min)
- Part2_Data_Engineer_Toolkit.ipynb - Data engineering patterns (35 min)
- Part3_Intermediate_Patterns.ipynb - Advanced patterns (40 min)
- Part4_Batch_Processing_Optimization.ipynb - Production optimization (20 min)
Bonus Content:
- Part6_Stream_Optimization_BONUS.ipynb - Streaming production issues (20 min)
- Part7_Advanced_Pyspark_BONUS.ipynb - Comprehensive advanced patterns (40 min)
This workshop is optimized for Databricks Free Edition (completely free!):
β
Full PySpark functionality
β
Single cluster (15GB RAM, 2 cores)
β
2GB DBFS storage (we'll use cloud storage for datasets)
β
MLflow tracking (limited features)
β
Spark UI for debugging
- Clusters auto-terminate after 2 hours of inactivity (notebook state is saved)
- No job scheduling (manual execution only)
- Single user (no team collaboration)
Perfect for: Learning, prototypes, workshops, personal projects!
See DATABRICKS_FREE_EDITION_GUIDE.md for details.
If you're using Databricks Serverless compute tier (not Free Edition), note that these operations are NOT supported:
- β
.cache()and.persist()operations - β
.rddAPI access (including.rdd.getNumPartitions()) - β Manual Spark configuration tuning
Why? Serverless automatically optimizes data caching, partitioning, and resource allocation. These manual optimizations are not needed (and not available) on serverless.
Workshop Impact:
- Part 3 (Caching section) - Skip or treat as conceptual learning
- Part 3 (Partitioning checks) - Operations work, but partition counts not visible
- All other content works normally on serverless!
This workshop is open source! Contributions welcome:
- Bug Reports: Open an issue
- Improvements: Submit a pull request
- New Exercises: Add to bonus section
- Translations: Help make it accessible
This workshop is released under the MIT License. Feel free to use, modify, and distribute for educational purposes.
- Ask questions anytime
- Teaching assistants available
- Refer to workshop guide for timing
- GitHub Issues: For bugs or questions
- Databricks Community Forums: community.databricks.com
- Databricks Academy: academy.databricks.com - Free courses
Help us improve! After the workshop, please:
- Complete the feedback survey
- Star this repository if you found it helpful
- Share with colleagues who might benefit
- Suggest topics for future workshops
- Apache Spark Community: For this amazing tool
- Databricks: For Free Edition and documentation
- Workshop Participants: Your engagement makes this worthwhile!
- β Complete pre-workshop setup
- β Review PySpark cheat sheet
- β Read Databricks guide
- β Bring curiosity and questions!
See you at the workshop! Let's build something amazing with Spark! β‘
Happy Learning! πβ¨