Data Engineering is the field of designing and building systems that collect, store, transform, and deliver data. Through data pipelines, raw data is transformed into analyzable formats for data analysts and scientists to utilize.
Target Audience: Data pipeline beginners to intermediate level (practical fundamentals)
Data Engineering Overview → Data Modeling → ETL/ELT Concepts
↓
Prefect ← Airflow Advanced ← Airflow Basics ←┘
↓
Spark Basics → PySpark DataFrame → Spark Optimization
↓
Practical Project ← Data Quality ← dbt ← Data Lake/Warehouse ← Kafka Streaming
| Filename | Topic | Difficulty | Key Content |
|---|---|---|---|
| 01_Data_Engineering_Overview.md | Data Engineering Overview | ⭐ | Roles, pipelines, batch vs streaming, architecture patterns |
| 02_Data_Modeling_Basics.md | Data Modeling Basics | ⭐⭐ | Dimensional modeling, star/snowflake schema, SCD |
| 03_ETL_vs_ELT.md | ETL vs ELT | ⭐⭐ | Traditional ETL, modern ELT, tool comparison, use cases |
| 04_Apache_Airflow_Basics.md | Airflow Basics | ⭐⭐ | Architecture, DAG, Task, Operator, scheduling |
| 05_Airflow_Advanced.md | Airflow Advanced | ⭐⭐⭐ | XCom, dynamic DAG, Sensor, Hook, TaskGroup |
| 06_Prefect_Modern_Orchestration.md | Prefect Modern Orchestration | ⭐⭐ | Flow, Task, Airflow comparison, Deployment |
| 07_Apache_Spark_Basics.md | Apache Spark Basics | ⭐⭐⭐ | Architecture, RDD, cluster modes, installation |
| 08_PySpark_DataFrames.md | PySpark DataFrame | ⭐⭐⭐ | SparkSession, DataFrame, transformations, actions, UDF |
| 09_Spark_SQL_Optimization.md | Spark SQL Optimization | ⭐⭐⭐ | Catalyst, partitioning, caching, join strategies, tuning |
| 10_Kafka_Streaming.md | Kafka Streaming | ⭐⭐⭐ | Kafka overview, Topic, Producer/Consumer, real-time processing |
| 11_Data_Lake_Warehouse.md | Data Lake and Warehouse | ⭐⭐ | Lake, Warehouse, Lakehouse, Delta Lake, Iceberg |
| 12_dbt_Transformation.md | dbt Transformation Tool | ⭐⭐⭐ | Models, sources, tests, documentation, Jinja templates |
| 13_Data_Quality_Governance.md | Data Quality and Governance | ⭐⭐⭐ | Quality dimensions, Great Expectations, catalog, lineage |
| 14_Practical_Pipeline_Project.md | Practical Pipeline Project | ⭐⭐⭐⭐ | E2E design, Airflow+Spark+dbt, quality validation, monitoring |
| 15_Airflow_TaskFlow_API.md | Airflow TaskFlow API | ⭐⭐⭐ | @task decorator, dynamic mapping, TaskGroups, testing |
| 16_Kafka_Streams_ksqlDB.md | Kafka Streams & ksqlDB | ⭐⭐⭐ | Faust stream processing, windowed aggregation, ksqlDB push/pull queries |
| 17_Spark_Structured_Streaming.md | Spark Structured Streaming | ⭐⭐⭐⭐ | Unbounded tables, watermarking, stateful processing, Kafka integration |
| 18_CDC_Debezium.md | CDC with Debezium | ⭐⭐⭐ | Log-based CDC, Debezium+Kafka Connect, event format, outbox pattern |
| 19_Lakehouse_Practical_Patterns.md | Lakehouse Practical Patterns | ⭐⭐⭐⭐ | Medallion architecture, MERGE/SCD2, compaction, time travel, Iceberg |
| 20_Dagster_Asset_Orchestration.md | Dagster Asset Orchestration | ⭐⭐⭐ | Software-defined assets, Dagster vs Airflow, partitioned assets, sensors, dbt integration |
| 21_Data_Versioning_and_Contracts.md | Data Versioning and Data Contracts | ⭐⭐⭐⭐ | lakeFS, DVC comparison, schema contracts, contract testing, data mesh |
| 22_Vector_Storage_and_Indexing.md | Vector Storage and Indexing | ⭐⭐⭐ | FAISS indexes, Milvus, Weaviate, Qdrant, Pinecone, Chroma, benchmarks |
| 23_Production_Vector_Search.md | Production Vector Search | ⭐⭐⭐⭐ | Hybrid search, metadata filtering, reranking, scaling, monitoring, cost optimization |
Total Lessons: 23
# Build complete environment with Docker Compose
# After creating docker-compose.yml file:
# Start Airflow environment
docker compose up -d airflow-webserver airflow-scheduler
# Start Spark environment
docker compose up -d spark-master spark-worker
# Start Kafka environment
docker compose up -d zookeeper kafka# Apache Airflow
pip install apache-airflow
# PySpark
pip install pyspark
# Kafka Python client
pip install confluent-kafka
# dbt
pip install dbt-core dbt-postgres
# Great Expectations
pip install great_expectations
# Prefect
pip install prefect| Tool | Version |
|---|---|
| Python | 3.9+ |
| Apache Airflow | 2.7+ |
| Apache Spark | 3.4+ |
| Apache Kafka | 3.5+ |
| dbt-core | 1.6+ |
| Prefect | 2.x |
- Data Engineering Overview: Roles, pipeline concepts
- Data Modeling: Dimensional modeling, schema design
- ETL vs ELT: Understanding data processing patterns
- Airflow Basics: Writing DAGs, scheduling
- Airflow Advanced: XCom, Sensor, dynamic DAGs
- Prefect: Modern orchestration tool
- Spark Basics: RDD, cluster architecture
- PySpark DataFrame: Using DataFrame API
- Spark Optimization: Performance tuning, join strategies
- Kafka: Real-time data streaming
- Data Lake/Warehouse: Storage architecture
- dbt: SQL-based data transformation
- Data Quality: Great Expectations, governance
- E2E Pipeline: Airflow + Spark + dbt integration
- Airflow TaskFlow API: Modern Airflow patterns
- Kafka Streams & ksqlDB: Stream processing with SQL
- Spark Structured Streaming: Real-time Spark
- CDC with Debezium: Change data capture
- Lakehouse Patterns: Production Delta Lake / Iceberg
- Dagster: Asset-based orchestration
- Data Versioning & Contracts: lakeFS, data contracts, data mesh
- Vector Storage & Indexing: FAISS, Milvus, Weaviate, Qdrant, Pinecone, Chroma
- Production Vector Search: Hybrid search, reranking, scaling, monitoring
- Apache Airflow Documentation
- Apache Spark Documentation
- Apache Kafka Documentation
- dbt Documentation
- Prefect Documentation
- Great Expectations Documentation
- Data_Analysis/ - NumPy, Pandas basics
- PostgreSQL/ - SQL, database fundamentals
- Docker/ - Container environment setup
- Cloud_Computing/ - Cloud services (S3, BigQuery)