🛒 Amazon Sales — Data Engineering Project

Comprehensive ETL + Data Cleaning + Dimensional Modeling + SQL Insights + Power BI Dashboard

⭐ 1. Project Overview

This project demonstrates a complete Data Engineering workflow, starting from raw scraped Amazon product data and ending with a fully modeled PostgreSQL Data Warehouse, analytical SQL insights, and a Power BI dashboard. The goal is to simulate how a real data engineering pipeline ingests, cleans, transforms, models, loads, and analyzes product data.

📂 2. Project Structure

📦 Amazon-Sales-Data-Engineering-Project

├── README.md
├── .gitignore
├── requirements.txt
│
├── data/
│   ├── processed/
│   │   ├── fact_product_snapshot.csv
│   │   ├── bridge_product_category.csv
│   │   ├── dim_category.csv
│   │   └── dim_product.csv
│   │
│   └── raw/
│       └── amazon.csv
│
├── src/
│   └── pipeline.py
│
├── sql/
│   ├── queries.sql
│   └── create_tables.sql
│
├── notebooks/
│   └── Amazon Sales.ipynb
│
├── reports/
│   ├── Amazon Sales Dashboard.pbix
│   └── states_report.md 
│

🛠 3. Tools & Technologies

Component	Technology
Language	Python 3.10
Data Processing	pandas
Database	PostgreSQL
Visualization	Power BI Desktop
Pipeline	Custom Python ETL
Documentation	Markdown, Jupyter

🧹 4. Data Preparation & Cleaning

Performed in pipeline.py and the notebook.
Loaded raw data
Removed duplicates
Dropped rows missing critical identifiers
Cleaned and validated numeric fields
Enforced consistent pricing logic
Extracted hierarchical categories
Computed category depth + cat_leaf
Built dimensional tables
Exported processed CSVs

🧱 5. Data Warehouse Schema

📌 Fact Table fact_product_snapshot

Stores product snapshot metrics (prices, ratings…) with date.

📌 Dimension Tables

dim_product
dim_category
bridge_product_category

📊 6. SQL Insights & Analytics

All SQL queries and results documented in:

reports/states_report.md
sql/queries.sql Analyses include:
Best categories
Discount vs rating
Price segmentation
Hidden gems
Weak categories
Platform-wide metrics

📈 7. Dashboard

Interactive Power BI dashboard: reports/Amazon Sales Dashboard.pbix

▶️ 8. How to Run the Pipeline

Install dependencies pip install -r requirements.txt
Run ETL python src/pipeline.py
Load schema sql/create_tables.sql
Load CSV data into warehouse
Run analytics sql/queries.sql

🚀 9. Improvements

Airflow DAG
YAML configs
Cloud DW migration
API endpoints
Unit testing
More visualizations

🏁 10. Final Thoughts

This project demonstrates an end-to-end data engineering workflow: from raw CSV ingestion and rigorous data cleaning, through dimensional modeling in PostgreSQL, to SQL analytics and BI reporting with Power BI.
It can serve as a template for similar retail/e‑commerce analytics projects or as a portfolio piece to showcase practical data engineering skills.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🛒 Amazon Sales — Data Engineering Project

⭐ 1. Project Overview

📂 2. Project Structure

🛠 3. Tools & Technologies

🧹 4. Data Preparation & Cleaning

🧱 5. Data Warehouse Schema

📊 6. SQL Insights & Analytics

📈 7. Dashboard

▶️ 8. How to Run the Pipeline

🚀 9. Improvements

🏁 10. Final Thoughts

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
data		data
notebooks		notebooks
reports		reports
sql		sql
src		src
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

🛒 Amazon Sales — Data Engineering Project

⭐ 1. Project Overview

📂 2. Project Structure

🛠 3. Tools & Technologies

🧹 4. Data Preparation & Cleaning

🧱 5. Data Warehouse Schema

📊 6. SQL Insights & Analytics

📈 7. Dashboard

▶️ 8. How to Run the Pipeline

🚀 9. Improvements

🏁 10. Final Thoughts

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages