A technical comparison of two modern data‑engineering approaches for processing customer, orders, and product datasets through an MDM‑aligned pipelines.
This repository compares two modern approaches for processing customer, orders, and product datasets using Master Data Management (MDM) principles.
Both pipelines implement the same RAW → BRONZE → SILVER → GOLD logic, but with different technologies:
- Azure Data Factory (ADF) Data Flows
- Databricks PySpark Notebooks
The goal is to show how each platform handles identical transformations, highlight strengths and limitations, and provide a clear architectural comparison for data engineers and architects.
Three sample datasets are used throughout both pipelines:
- customers.csv
- orders.csv
- products.csv
Each dataset contains intentional inconsistencies to demonstrate schema drift handling, deduplication, and identity resolution.
ADF is used to implement the pipeline visually through Data Flows.
Key responsibilities:
- RAW → BRONZE processing
- Schema drift handling
- Column standardization
- Type casting
- Basic deduplication using window logic
- Landing Delta files
This approach demonstrates UI‑driven ETL/ELT with minimal coding.
Databricks implements the same logic using PySpark notebooks.
Key responsibilities:
- BRONZE → SILVER → GOLD transformations
- Window‑based dedupe
- Identity resolution
- Null handling
- Business rule enforcement
- Delta Lake optimization
This approach demonstrates code‑driven ELT with full control and transparency.
Both pipelines enforce core Master Data Management principles:
- Keys: customer_id, order_id, product_id
- Partition‑based dedupe
- Latest‑record survivorship
- Consistent column naming
- Type enforcement
- Drift handling
- ADF: window + rowNumber
- Databricks: advanced window logic
- Performed in GOLD layer
- Combines survivorship, business rules, and validated attributes
- ADF: pipeline logs
- Databricks: Delta Lake versioning (time travel)
├── README.md
│
├── datasets/
│ ├── customers.csv
│ ├── orders.csv
│ └── products.csv
│
├── pipelines/
│ ├── df_mini_screenshot.jpg
│ ├── pipeline.json
│ ├── dataflow_raw_to_bronze.json
│ ├── explanation.md
│ └── pipeline2_screenshot.jpg
│
│
└── comparison/
├── comparison.md
└── mdm_principles.md
Strengths
- Excellent for ingestion
- Handles schema drift
- Visual transformations
- Easy orchestration
Limitations
- UI inconsistencies
- Harder debugging
- Limited complex logic
Strengths
- Full control with PySpark
- Advanced window functions
- Delta Lake optimization
- Ideal for SILVER/GOLD logic
Limitations
- Requires coding
- More setup effort
This repository is designed for:
- data engineers
- architects
- students
- interview preparation
- portfolio demonstration
It shows how two major Azure tools solve the same MDM problem in different ways.
- Upload RAW CSVs to your storage account
- Import ADF pipeline and Data Flows
- Import Databricks notebooks
- Run each pipeline independently
- Compare BRONZE, SILVER, and GOLD outputs
MIT License