Comprehensive ETL + Data Cleaning + Dimensional Modeling + SQL Insights + Power BI Dashboard
This project demonstrates a complete Data Engineering workflow, starting from raw scraped Amazon product data and ending with a fully modeled PostgreSQL Data Warehouse, analytical SQL insights, and a Power BI dashboard. The goal is to simulate how a real data engineering pipeline ingests, cleans, transforms, models, loads, and analyzes product data.
📦 Amazon-Sales-Data-Engineering-Project
├── README.md
├── .gitignore
├── requirements.txt
│
├── data/
│ ├── processed/
│ │ ├── fact_product_snapshot.csv
│ │ ├── bridge_product_category.csv
│ │ ├── dim_category.csv
│ │ └── dim_product.csv
│ │
│ └── raw/
│ └── amazon.csv
│
├── src/
│ └── pipeline.py
│
├── sql/
│ ├── queries.sql
│ └── create_tables.sql
│
├── notebooks/
│ └── Amazon Sales.ipynb
│
├── reports/
│ ├── Amazon Sales Dashboard.pbix
│ └── states_report.md
│
| Component | Technology |
|---|---|
| Language | Python 3.10 |
| Data Processing | pandas |
| Database | PostgreSQL |
| Visualization | Power BI Desktop |
| Pipeline | Custom Python ETL |
| Documentation | Markdown, Jupyter |
- Performed in pipeline.py and the notebook.
- Loaded raw data
- Removed duplicates
- Dropped rows missing critical identifiers
- Cleaned and validated numeric fields
- Enforced consistent pricing logic
- Extracted hierarchical categories
- Computed category depth + cat_leaf
- Built dimensional tables
- Exported processed CSVs
📌 Fact Table fact_product_snapshot
- Stores product snapshot metrics (prices, ratings…) with date.
📌 Dimension Tables
- dim_product
- dim_category
- bridge_product_category
All SQL queries and results documented in:
- reports/states_report.md
- sql/queries.sql Analyses include:
- Best categories
- Discount vs rating
- Price segmentation
- Hidden gems
- Weak categories
- Platform-wide metrics
Interactive Power BI dashboard: reports/Amazon Sales Dashboard.pbix
- Install dependencies pip install -r requirements.txt
- Run ETL python src/pipeline.py
- Load schema sql/create_tables.sql
- Load CSV data into warehouse
- Run analytics sql/queries.sql
- Airflow DAG
- YAML configs
- Cloud DW migration
- API endpoints
- Unit testing
- More visualizations
This project demonstrates an end-to-end data engineering workflow: from raw CSV ingestion and rigorous data cleaning, through dimensional modeling in PostgreSQL, to SQL analytics and BI reporting with Power BI.
It can serve as a template for similar retail/e‑commerce analytics projects or as a portfolio piece to showcase practical data engineering skills.