Skip to content

malakShehada/docker_data_pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🚀 Docker Data Pipeline Project

📌 Overview

This project demonstrates a simple but realistic data engineering pipeline using Docker, PostgreSQL, and Jupyter Notebook. The goal is to simulate a real workflow where raw data is ingested, transformed, loaded into a database, and analyzed — all inside isolated Docker containers.

The project is fully containerized using Docker Compose, making it portable and easy to run on any machine.


📁 Project Structure

docker_data_pipeline/
│
├── data/
│   └── raw/
│       └── [dataset.csv]
│
├── jupyter/
│   └── Dockerfile
│
├── notebooks/
│   ├── etl_pipeline.ipynb
│   └── .ipynb_checkpoints/
│         └── etl_pipeline-checkpoint.ipynb
│
├── postgres/
│   └── Dockerfile
│
└── docker-compose.yml

🔹 /data/raw/

Contains the raw dataset used by the ETL pipeline.

The dataset used in this project is the Online Retail dataset from Kaggle:

🔹 /jupyter/Dockerfile

Builds a Jupyter Notebook environment with all required libraries:

  • pandas
  • psycopg2-binary
  • SQLAlchemy
  • matplotlib

🔹 /notebooks/etl_pipeline.ipynb

The main notebook that performs:

  • Data loading (from data/raw/)
  • Cleaning & preprocessing
  • Transformations
  • Loading the cleaned dataset into PostgreSQL
  • Basic analysis & visualizations
  • Creates the retail_dw schema, dimension tables, fact table, and loads them using SQLAlchemy.

🔹 /postgres/Dockerfile

Uses the official postgres:16 image to create a clean PostgreSQL instance.

🔹 docker-compose.yml

Orchestration file that runs the full pipeline:

  • PostgreSQL container
  • Jupyter Notebook container (linked to database)
  • Shared volume for data & notebooks

🐳 Running the Pipeline

Follow these steps to run the entire project locally.

1️⃣ Build and start containers

docker-compose up --build

This will:

  • Start PostgreSQL database
  • Start Jupyter Notebook server
  • Create shared volumes

2️⃣ Access the Jupyter Notebook

Once the container is running, open the link printed in the terminal, usually:

http://localhost:8888

Open notebooks/etl_pipeline.ipynb.

3️⃣ Run the ETL pipeline

Inside the notebook:

  • Load the dataset from /data/raw/
  • Apply transformations
  • Load the data into PostgreSQL using SQLAlchemy & psycopg2
  • Query the database and perform visual analysis

🗄️ Database Connection

The Jupyter notebook connects to the PostgreSQL container using:

  • Host: postgres
  • Port: 5432
  • User: retail_user
  • Password: retail_password
  • Database: retail_db

These values are set inside the docker-compose.yml environment.


📊 What This Project Demonstrates

✔ End‑to‑end ETL workflow ✔ Containerized environment for repeatability ✔ Python‑based transformation logic ✔ Structured PostgreSQL data warehouse ✔ Practical, hands‑on data engineering setup


⭐ Data Warehouse Schema (Star)

The cleaned data is modeled into a star schema inside the retail_dw schema:

  • retail_dw.dim_date: one row per day (year, month, day, weekday_name, year_month)
  • retail_dw.dim_product: product attributes (stock_code, description, unit_price)
  • retail_dw.fact_retail_sales: transaction facts linked to the dimensions (invoice_no, date_key, product_key, customer_id, country, quantity, unit_price, total_price)

📈 Analytical Insights

The notebook runs several SQL queries on the star schema, including:

  • Monthly revenue and number of invoices
  • Top-selling products by revenue
  • Revenue by country
  • Sales by weekday

The results are visualized using Matplotlib inside the notebook.

See etl_pipeline.ipynb for the full SQL queries and charts.


📌 Future Improvements

Here are suggested optional extensions:

  • Add Airflow or Prefect orchestration
  • Add dbt (Data Build Tool) for modeling
  • Add dashboards using Metabase or PowerBI
  • Automate ingestion from external APIs

🙌 Author

Malak Shehada – Data Science & AI Student. Project built for hands‑on Data Engineering practice.


About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors