This project showcases an end-to-end data engineering pipeline that extracts, processes, and visualizes English Premier League (EPL) data from Fotmob. The pipeline is built using a modern data stack, including Python, Apache Airflow, Docker, Terraform, Google Cloud Platform (GCP), PySpark, and Streamlit.
The project is designed to provide soccer fans and data analysts with a comprehensive platform to explore EPL data. It features interactive dashboards for visualizing shot maps, comparing player statistics, and analyzing match facts.
Table of Contents
- Automated Data Extraction: The pipeline automatically extracts data from Fotmob using a Python script.
- Cloud-Based Data Processing: Data is processed in the cloud using a Dataproc cluster on GCP.
- Dimensionally Modeled Data Warehouse: The processed data is stored in a Supabase PostgreSQL database with a dimensional model.
- Interactive Dashboards: The project features interactive dashboards built with Streamlit for visualizing the data, leveraging SQL views for efficient data retrieval.
The project follows a modern data architecture, with a clear separation of concerns between data extraction, processing, and visualization.
Expand for more details
The architecture consists of the following components:
- Data Source: Fotmob provides the raw data for the project.
- Data Extraction: A Python script is used to extract the data from Fotmob.
- Data Lake: The extracted data is stored in a Google Cloud Storage (GCS) bucket.
- Data Processing: An Apache Spark job running on a Dataproc cluster is used to process the data.
- Data Warehouse: The processed data is stored in a Supabase PostgreSQL database with a dimensional model.
- Data Visualization: A Streamlit application is used to visualize the data, leveraging SQL views for efficient data retrieval.
The data pipeline is orchestrated using Apache Airflow. It consists of a series of tasks that extract, load, and transform the data.
Expand for more details
The data pipeline consists of the following steps:
- Extract Data: A Python script extracts the data from Fotmob and saves it as JSON files.
- Load to Staging: The JSON files are loaded into a GCS bucket.
- Process Data: A Dataproc job reads the JSON files from the GCS bucket, processes them, and appends the data to staging tables in the PostgreSQL database.
- Merge to Dimensions and Facts: The data from the staging tables is then inserted into the dimensionally modeled tables in the PostgreSQL database.
The Airflow DAG for the pipeline is defined in airflow/dags/dag_script.py.
The cloud infrastructure for the project is managed using Terraform. This allows for the automated and repeatable provisioning of the required resources on GCP.
Expand for more details
The Terraform configuration files are located in the terraform directory. The main.tf file defines the following resources:
- GCS Bucket: A GCS bucket is created to store the raw data.
- Dataproc Cluster: A Dataproc cluster is created to process the data.
To apply the Terraform configuration, run the following commands:
cd terraform
terraform init
terraform applyImportant: To avoid incurring unnecessary costs, remember to destroy the GCP resources after you are done.
terraform destroyYou can leverage the $300 free credits offered by GCP to run this project.
The project features a multi-page Streamlit application for visualizing the data. The application is organized into several dashboards, each focusing on a different aspect of the data.
The Shot Map Dashboard allows you to visualize the shots taken by a specific player in a given match.
The Player Comparison Pizza Plots allow you to compare the statistics of two players using pizza plots.
The Match Facts Dashboard provides a comprehensive overview of a specific match, including team shot maps, xG race plots, momentum charts, and top player stats.
To set up the project locally, you will need to have Docker and Docker Compose installed.
Expand for more details
-
Clone the repository:
git clone https://github.com/your-username/EPL_Fotmob.git cd EPL_Fotmob -
Set up environment variables:
Create a
.envfile in the root of the project and add the following environment variables. You will need to get the Supabase connection details from your Supabase project settings.GCP_PROJECT_ID=<your-gcp-project-id> GCP_REGION=<your-gcp-region> GCS_BUCKET_NAME=<your-gcs-bucket-name> DB_CONNECTION_STRING=<your-supabase-db-connection-string> -
Build and run the Docker containers:
docker-compose up --build
This will start the following services:
airflow-webserver: The Airflow web server.airflow-scheduler: The Airflow scheduler.
-
Access the Airflow UI:
Open your web browser and navigate to
http://localhost:8080to access the Airflow UI. You can trigger theepl_data_pipelineDAG to start the data pipeline. This will extract the data, process it, and load it into your Supabase database. -
Create SQL Views:
After the data pipeline has successfully run, you will need to create the SQL views in your Supabase database. You can do this by running the
CREATE OR REPLACE VIEWstatements fromairflow/dags/sql_create_scripts.sqlin the Supabase SQL editor. -
Run the Streamlit Dashboard:
Once the views are created, you can run the Streamlit application:
streamlit run streamlit/app.py
Open your web browser and navigate to
http://localhost:8501to access the Streamlit dashboard.
- Add more visualizations to the Streamlit dashboard.
- Add support for other soccer leagues.
- Use dbt for sql views and unit testing for the data pipeline.







