Project-ADF-vs-Databricks-MDM-Compliant-Data-Processing-Comparison

A technical comparison of two modern data‑engineering approaches for processing customer, orders, and product datasets through an MDM‑aligned pipelines.

Overview

This repository compares two modern approaches for processing customer, orders, and product datasets using Master Data Management (MDM) principles.
Both pipelines implement the same RAW → BRONZE → SILVER → GOLD logic, but with different technologies:

Azure Data Factory (ADF) Data Flows
Databricks PySpark Notebooks

The goal is to show how each platform handles identical transformations, highlight strengths and limitations, and provide a clear architectural comparison for data engineers and architects.

Datasets

Three sample datasets are used throughout both pipelines:

customers.csv
orders.csv
products.csv

Each dataset contains intentional inconsistencies to demonstrate schema drift handling, deduplication, and identity resolution.

Processing Approach

ADF Data Flow Pipeline

ADF is used to implement the pipeline visually through Data Flows.
Key responsibilities:

RAW → BRONZE processing
Schema drift handling
Column standardization
Type casting
Basic deduplication using window logic
Landing Delta files

This approach demonstrates UI‑driven ETL/ELT with minimal coding.

Databricks Notebook Pipeline

Databricks implements the same logic using PySpark notebooks.
Key responsibilities:

BRONZE → SILVER → GOLD transformations
Window‑based dedupe
Identity resolution
Null handling
Business rule enforcement
Delta Lake optimization

This approach demonstrates code‑driven ELT with full control and transparency.

MDM Compliance

Both pipelines enforce core Master Data Management principles:

Identity Resolution

Keys: customer_id, order_id, product_id
Partition‑based dedupe
Latest‑record survivorship

Schema Standardization

Consistent column naming
Type enforcement
Drift handling

Deduplication

ADF: window + rowNumber
Databricks: advanced window logic

Golden Record Creation

Performed in GOLD layer
Combines survivorship, business rules, and validated attributes

Auditability

ADF: pipeline logs
Databricks: Delta Lake versioning (time travel)

Repository Structure


├── README.md
│
├── datasets/
│   ├── customers.csv
│   ├── orders.csv
│   └── products.csv
│
├── pipelines/
│   ├── df_mini_screenshot.jpg
│   ├── pipeline.json
│   ├── dataflow_raw_to_bronze.json
│   ├── explanation.md
│   └── pipeline2_screenshot.jpg
│   
│
└── comparison/
    ├── comparison.md
    └── mdm_principles.md

Comparison Summary

ADF Data Flow

Strengths

Excellent for ingestion
Handles schema drift
Visual transformations
Easy orchestration

Limitations

UI inconsistencies
Harder debugging
Limited complex logic

Databricks Notebooks

Strengths

Full control with PySpark
Advanced window functions
Delta Lake optimization
Ideal for SILVER/GOLD logic

Limitations

Requires coding
More setup effort

Purpose of This Project

This repository is designed for:

data engineers
architects
students
interview preparation
portfolio demonstration

It shows how two major Azure tools solve the same MDM problem in different ways.

How to Run

Upload RAW CSVs to your storage account
Import ADF pipeline and Data Flows
Import Databricks notebooks
Run each pipeline independently
Compare BRONZE, SILVER, and GOLD outputs

License

MIT License

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Project-ADF-vs-Databricks-MDM-Compliant-Data-Processing-Comparison

Overview

Datasets

Processing Approach

ADF Data Flow Pipeline

Databricks Notebook Pipeline

MDM Compliance

Identity Resolution

Schema Standardization

Deduplication

Golden Record Creation

Auditability

Repository Structure

Comparison Summary

ADF Data Flow

Databricks Notebooks

Purpose of This Project

How to Run

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
Comparison		Comparison
data		data
pipelines		pipelines
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Project-ADF-vs-Databricks-MDM-Compliant-Data-Processing-Comparison

Overview

Datasets

Processing Approach

ADF Data Flow Pipeline

Databricks Notebook Pipeline

MDM Compliance

Identity Resolution

Schema Standardization

Deduplication

Golden Record Creation

Auditability

Repository Structure

Comparison Summary

ADF Data Flow

Databricks Notebooks

Purpose of This Project

How to Run

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages