🚀 Docker Data Pipeline Project

📌 Overview

This project demonstrates a simple but realistic data engineering pipeline using Docker, PostgreSQL, and Jupyter Notebook. The goal is to simulate a real workflow where raw data is ingested, transformed, loaded into a database, and analyzed — all inside isolated Docker containers.

The project is fully containerized using Docker Compose, making it portable and easy to run on any machine.

📁 Project Structure

docker_data_pipeline/
│
├── data/
│   └── raw/
│       └── [dataset.csv]
│
├── jupyter/
│   └── Dockerfile
│
├── notebooks/
│   ├── etl_pipeline.ipynb
│   └── .ipynb_checkpoints/
│         └── etl_pipeline-checkpoint.ipynb
│
├── postgres/
│   └── Dockerfile
│
└── docker-compose.yml

🔹 `/data/raw/`

Contains the raw dataset used by the ETL pipeline.

The dataset used in this project is the Online Retail dataset from Kaggle:

Source: https://www.kaggle.com/code/hellbuoy/online-retail-k-means-hierarchical-clustering/input
License: Please review the dataset license and terms of use on Kaggle before using it.

🔹 `/jupyter/Dockerfile`

Builds a Jupyter Notebook environment with all required libraries:

pandas
psycopg2-binary
SQLAlchemy
matplotlib

🔹 `/notebooks/etl_pipeline.ipynb`

The main notebook that performs:

Data loading (from data/raw/)
Cleaning & preprocessing
Transformations
Loading the cleaned dataset into PostgreSQL
Basic analysis & visualizations
Creates the retail_dw schema, dimension tables, fact table, and loads them using SQLAlchemy.

🔹 `/postgres/Dockerfile`

Uses the official postgres:16 image to create a clean PostgreSQL instance.

🔹 `docker-compose.yml`

Orchestration file that runs the full pipeline:

PostgreSQL container
Jupyter Notebook container (linked to database)
Shared volume for data & notebooks

🐳 Running the Pipeline

Follow these steps to run the entire project locally.

1️⃣ Build and start containers

docker-compose up --build

This will:

Start PostgreSQL database
Start Jupyter Notebook server
Create shared volumes

2️⃣ Access the Jupyter Notebook

Once the container is running, open the link printed in the terminal, usually:

http://localhost:8888

Open notebooks/etl_pipeline.ipynb.

3️⃣ Run the ETL pipeline

Inside the notebook:

Load the dataset from /data/raw/
Apply transformations
Load the data into PostgreSQL using SQLAlchemy & psycopg2
Query the database and perform visual analysis

🗄️ Database Connection

The Jupyter notebook connects to the PostgreSQL container using:

Host: postgres
Port: 5432
User: retail_user
Password: retail_password
Database: retail_db

These values are set inside the docker-compose.yml environment.

📊 What This Project Demonstrates

✔ End‑to‑end ETL workflow ✔ Containerized environment for repeatability ✔ Python‑based transformation logic ✔ Structured PostgreSQL data warehouse ✔ Practical, hands‑on data engineering setup

⭐ Data Warehouse Schema (Star)

The cleaned data is modeled into a star schema inside the retail_dw schema:

retail_dw.dim_date: one row per day (year, month, day, weekday_name, year_month)
retail_dw.dim_product: product attributes (stock_code, description, unit_price)
retail_dw.fact_retail_sales: transaction facts linked to the dimensions (invoice_no, date_key, product_key, customer_id, country, quantity, unit_price, total_price)

📈 Analytical Insights

The notebook runs several SQL queries on the star schema, including:

Monthly revenue and number of invoices
Top-selling products by revenue
Revenue by country
Sales by weekday

The results are visualized using Matplotlib inside the notebook.

See etl_pipeline.ipynb for the full SQL queries and charts.

📌 Future Improvements

Here are suggested optional extensions:

Add Airflow or Prefect orchestration
Add dbt (Data Build Tool) for modeling
Add dashboards using Metabase or PowerBI
Automate ingestion from external APIs

🙌 Author

Malak Shehada – Data Science & AI Student. Project built for hands‑on Data Engineering practice.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🚀 Docker Data Pipeline Project

📌 Overview

📁 Project Structure

🔹 `/data/raw/`

🔹 `/jupyter/Dockerfile`

🔹 `/notebooks/etl_pipeline.ipynb`

🔹 `/postgres/Dockerfile`

🔹 `docker-compose.yml`

🐳 Running the Pipeline

1️⃣ Build and start containers

2️⃣ Access the Jupyter Notebook

3️⃣ Run the ETL pipeline

🗄️ Database Connection

📊 What This Project Demonstrates

⭐ Data Warehouse Schema (Star)

📈 Analytical Insights

📌 Future Improvements

🙌 Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
data/raw		data/raw
jupyter		jupyter
notebooks		notebooks
postgres		postgres
README.md		README.md
docker-compose.yml		docker-compose.yml

Folders and files

Latest commit

History

Repository files navigation

🚀 Docker Data Pipeline Project

📌 Overview

📁 Project Structure

🔹 /data/raw/

🔹 /jupyter/Dockerfile

🔹 /notebooks/etl_pipeline.ipynb

🔹 /postgres/Dockerfile

🔹 docker-compose.yml

🐳 Running the Pipeline

1️⃣ Build and start containers

2️⃣ Access the Jupyter Notebook

3️⃣ Run the ETL pipeline

🗄️ Database Connection

📊 What This Project Demonstrates

⭐ Data Warehouse Schema (Star)

📈 Analytical Insights

📌 Future Improvements

🙌 Author

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

🔹 `/data/raw/`

🔹 `/jupyter/Dockerfile`

🔹 `/notebooks/etl_pipeline.ipynb`

🔹 `/postgres/Dockerfile`

🔹 `docker-compose.yml`

Packages