This project demonstrates a simple but realistic data engineering pipeline using Docker, PostgreSQL, and Jupyter Notebook. The goal is to simulate a real workflow where raw data is ingested, transformed, loaded into a database, and analyzed — all inside isolated Docker containers.
The project is fully containerized using Docker Compose, making it portable and easy to run on any machine.
docker_data_pipeline/
│
├── data/
│ └── raw/
│ └── [dataset.csv]
│
├── jupyter/
│ └── Dockerfile
│
├── notebooks/
│ ├── etl_pipeline.ipynb
│ └── .ipynb_checkpoints/
│ └── etl_pipeline-checkpoint.ipynb
│
├── postgres/
│ └── Dockerfile
│
└── docker-compose.yml
Contains the raw dataset used by the ETL pipeline.
The dataset used in this project is the Online Retail dataset from Kaggle:
- Source: https://www.kaggle.com/code/hellbuoy/online-retail-k-means-hierarchical-clustering/input
- License: Please review the dataset license and terms of use on Kaggle before using it.
Builds a Jupyter Notebook environment with all required libraries:
- pandas
- psycopg2-binary
- SQLAlchemy
- matplotlib
The main notebook that performs:
- Data loading (from
data/raw/) - Cleaning & preprocessing
- Transformations
- Loading the cleaned dataset into PostgreSQL
- Basic analysis & visualizations
- Creates the retail_dw schema, dimension tables, fact table, and loads them using SQLAlchemy.
Uses the official postgres:16 image to create a clean PostgreSQL instance.
Orchestration file that runs the full pipeline:
- PostgreSQL container
- Jupyter Notebook container (linked to database)
- Shared volume for data & notebooks
Follow these steps to run the entire project locally.
docker-compose up --build
This will:
- Start PostgreSQL database
- Start Jupyter Notebook server
- Create shared volumes
Once the container is running, open the link printed in the terminal, usually:
http://localhost:8888
Open notebooks/etl_pipeline.ipynb.
Inside the notebook:
- Load the dataset from
/data/raw/ - Apply transformations
- Load the data into PostgreSQL using SQLAlchemy & psycopg2
- Query the database and perform visual analysis
The Jupyter notebook connects to the PostgreSQL container using:
- Host: postgres
- Port: 5432
- User: retail_user
- Password: retail_password
- Database: retail_db
These values are set inside the docker-compose.yml environment.
✔ End‑to‑end ETL workflow ✔ Containerized environment for repeatability ✔ Python‑based transformation logic ✔ Structured PostgreSQL data warehouse ✔ Practical, hands‑on data engineering setup
The cleaned data is modeled into a star schema inside the retail_dw schema:
retail_dw.dim_date: one row per day (year, month, day, weekday_name, year_month)retail_dw.dim_product: product attributes (stock_code, description, unit_price)retail_dw.fact_retail_sales: transaction facts linked to the dimensions (invoice_no, date_key, product_key, customer_id, country, quantity, unit_price, total_price)
The notebook runs several SQL queries on the star schema, including:
- Monthly revenue and number of invoices
- Top-selling products by revenue
- Revenue by country
- Sales by weekday
The results are visualized using Matplotlib inside the notebook.
See etl_pipeline.ipynb for the full SQL queries and charts.
Here are suggested optional extensions:
- Add Airflow or Prefect orchestration
- Add dbt (Data Build Tool) for modeling
- Add dashboards using Metabase or PowerBI
- Automate ingestion from external APIs
Malak Shehada – Data Science & AI Student. Project built for hands‑on Data Engineering practice.