End-to-End Ride-Hailing Data Pipeline using Databricks Lakeflow Spark Declarative Pipelines (SDP)
This project demonstrates a scalable, production-ready data engineering pipeline for processing ride-hailing / trip data (inspired by real-world transportation scenarios like inDrive, Uber, Careem, Bykea, etc.). ///////////
It handles massive streaming volumes, unpredictable data arrivals, out-of-order events, and schema changes — delivering fresh, high-quality data ready for analytics, BI, and AI/ML applications such as demand forecasting, dynamic pricing, and anomaly detection.
Built with Databricks Lakeflow Spark Declarative Pipelines (SDP) on a classic Medallion Architecture (Bronze → Silver → Gold), using Auto Loader for dynamic ingestion, native CDC for incremental updates, and declarative flows to minimize code and manual orchestration..
In today's ride-hailing world — explosive growth across cities, millions of daily trips, surging demand for real-time insights on pricing, fleet optimization, and rider behavior — traditional pipelines struggle with:
- Massive streaming volumes while keeping low-latency analytics
- Out-of-order, delayed, or schema-changing data drops into cloud storage (e.g., S3)
- Scaling reliably without constant manual fixes, orchestration headaches, or full reprocessing
- Preparing clean, governed data fast enough for AI/ML models (demand forecasting, dynamic pricing, anomaly detection) amid data quality challenges
This project addresses these modern pain points in a 2025–2026 context.
Medallion Architecture with three layers:
-
Bronze Layer (Raw Ingestion)
- Trip CSVs (~356K records across streaming batches) + city metadata land in AWS S3 buckets (or any cloud storage).
- Auto Loader dynamically detects new files incrementally — no manual triggers needed.
- Creates streaming tables, injects audit metadata (filename, ingest timestamp).
- Rescues bad records for schema evolution.
- Supports hybrid batch (dimensions like cities) + streaming (trip facts) modes.
-
Silver Layer (Cleansed & Enriched)
- Declarative cleaning, validation, and quality rules run incrementally.
- Native CDC (AUTO CDC flows) captures changes, handles deduplication, out-of-order events, and deltas — efficient even at scale.
-
Gold Layer (Business-Ready)
- Joins fact tables (trips enriched with dates, ratings, revenue) to dimension tables (cities, dates for time-based slicing).
- Produces denormalized, materialized views optimized for fast queries and region-specific analytics.
- Purely declarative ("what" the data should look like, not procedural "how") → ~60% less code, zero manual orchestration
- Built-in auto-retries, checkpoints, scaling, and continuous streaming — triggers automatically as soon as new S3 data arrives
- Truly dynamic & stable pipeline: Handles incremental + CDC natively for near-real-time freshness without custom jobs or brittleness
A resilient, low-maintenance pipeline that:
- Ingests and processes huge trip streams incrementally
- Stays fresh with near-real-time updates
- Delivers high-quality gold data primed for:
- AI-driven predictions
- Data science experiments
- Demand forecasting
- Advanced BI/analytics
Scales seamlessly as ride volumes grow!
Transportation_DataEngineering_PipeLine/ ├── project_setup.ipynb # Initial setup: Unity Catalog, schemas, volumes, sample data upload ├── transportation_pipeline/ │ └── transformations/ # SDP pipeline definitions (flows, tables, materialized views) └── README.md # This file
- Databricks (Free Edition compatible)
- Lakeflow Spark Declarative Pipelines (SDP)
- Auto Loader (
cloud_files) - Delta Lake / Unity Catalog
- AWS S3 (source storage)
- Spark Streaming + CDC
- Medallion Architecture (Bronze, Silver, Gold)
-
Prerequisites
- Databricks workspace (Free Edition works)
- Unity Catalog enabled
- AWS S3 bucket with sample trip CSVs and city metadata
-
Setup
Runproject_setup.ipynbto create catalogs, schemas (bronze/silver/gold), volumes, and upload sample data. -
Build & Run Pipeline
- Define SDP flows in the
transformationsfolder (or use Databricks UI Advanced mode) - Create a Lakeflow Pipeline referencing your source files
- Run in continuous mode for streaming ingestion
- Define SDP flows in the
-
Explore
Query gold-layer views for region-specific insights (e.g., rides, revenue, ratings by city/date).
Built following best practices from Databricks Lakeflow tutorials (medallion + Auto Loader + CDC patterns). Great for portfolios, interviews, or real-world ride-hailing data operations.
Feel free to fork, open issues, or submit PRs (e.g., add sample data, diagrams, deployment scripts)!
MIT License — free to use, modify, and share.
⭐ Star the repo if this helps your data engineering journey!
#Databricks #DataEngineering #MedallionArchitecture #SparkDeclarativePipelines #AutoLoader #CDC #BigData #AIReadyData #RideHailing