This capstone project represents the final stage of the Kiwilytics Data Engineering course, integrating multiple tools and technologies to build a real-world data pipeline using Apache Airflow.
The goal of this pipeline is to automate the process of aggregating sales data, updating a PostgreSQL data warehouse, and generating insights such as total orders, total revenue per customer, and daily revenue trends.
- ETL Pipeline Orchestration: Managed using Apache Airflow DAG.
- PostgreSQL Integration: Reads and writes data to a PostgreSQL database via
PostgresHook. - Data Aggregation: Computes total orders and total sales per customer.
- Data Visualization: Automatically generates a daily revenue trend plot using Matplotlib.
- Incremental Updates: Uses SQL
ON CONFLICTlogic to upsert customer sales data. - Automated Execution: Scheduled to run daily using Airflowβs scheduler.
| Task ID | Description |
|---|---|
create_sales_table |
Creates the sales table in PostgreSQL if it doesnβt exist |
merge_sales_data |
Aggregates customer order data and updates the sales table |
extract_and_plot |
Extracts daily revenue data and generates a revenue trend plot |
get_revenue_on_date |
Retrieves total revenue for a specific date |
- Extract customer, order, and product data from PostgreSQL.
- Transform data by calculating
total_ordersandtotal_sales. - Load results back into the
salestable using UPSERT logic. - Visualize the daily revenue over time with Matplotlib.
- Apache Airflow β Workflow orchestration and scheduling
- PostgreSQL β Data warehouse for storing aggregated results
- Python β Core programming for DAG logic
- Pandas β Data manipulation and transformation
- Matplotlib β Data visualization
- SQL β Data extraction and transformation queries
- Make sure you have Airflow and PostgreSQL installed and running.
- Define a PostgreSQL connection in Airflow named: postgres_conn
- Place the
capstone-pipeline.pyfile inside your Airflow DAGs directory: ~/airflow/dags/ - Start the Airflow scheduler and webserver:
airflow scheduler
airflow webserver- Open the Airflow UI (usually at http://localhost:8080) and trigger the DAG named: capstone_pipeline
- A PostgreSQL table named sales containing aggregated customer data.
- A revenue trend plot saved locally at: /tmp/daily_revenue_plot.png
- Console logs showing the total revenue on the specified target date.
| File Name | Description |
|---|---|
capstone-pipeline.py |
Airflow DAG containing all ETL and analysis logic |
README.md |
Project documentation and technical overview |