Skip to content

goamegah/flowstate

Repository files navigation

Near Real-time Road Traffic Monitoring Solution

FlowState is a near real-time road traffic monitoring solution that leverages Apache Spark, Apache Airflow, and Docker to process and analyze traffic data. The project is designed to handle large volumes of data efficiently, providing insights into traffic patterns and conditions. This solution is built to be scalable and robust, making it suitable for real-world applications in traffic management and urban planning.

Data is collected from the Rennes Metropole API, which provides real-time traffic data. The solution processes this data to extract meaningful insights, such as traffic flow and congestion levels, and stores the results in a structured format for further analysis.

For more details, see the Documentation.

Here's reference architecture of the project: Reference Architecture

Table of Contents

Technology Stack

  • Stream Processing: Apache Spark 4.0.0
  • Orchestration: Apache Airflow 2.6.0
  • Database: PostgreSQL 13
  • UI Framework: Streamlit
  • Build Tool: SBT with Assembly plugin
  • Containerization: Docker & Docker Compose
  • Language: Scala 2.13.16

Prerequisites

Before you begin, ensure you have the following software installed:

Setup

  1. Clone the repository:
git clone git@github.com:goamegah/flowstate.git
cd flowstate
  1. Rename the dotenv.txt file to .env:
mv dotenv.txt .env
  1. Create 3 folders:
mkdir -p shared/data/transient # for intermediate data loading
mkdir -p shared/data/raw # for raw data loading from transient folder
mkdir -p shared/checkpoint # used by Spark for checkpointing
mkdir -p shared/jars # for Spark application JAR file

shared is a bind volume that is mounted to the docker container to display the raw files in your IDE.

  1. Run the Docker Compose:
docker compose up -d
  1. Go to airflow web UI:
http://localhost:8080

You well need to create a connection to the API with the following parameters:

alt text

alt text

This connection is used to check the API availability

After setting up the connection, you can see following 3 DAGs that you can run one after another:

  • pl_load_flowstate_raw_files

alt text This DAG loads the raw data from the Rennes Metropole API into the raw folder. It is scheduled to run every 1 minutes.

  • pl_run_flowstate_mainapp_dag

alt text This DAG run the main application that processes the raw data and stores the results in the data warehouse. It's not scheduled to run automatically, you can trigger it manually from the Airflow UI.

  • [Optional] pl_clean_up_flowstate_folders_dag:

clean up pipeline DAG that cleans up the data from raw, transient and checkpoint folders.

  1. Check the results in the Streamlit app web UI:
http://localhost:8501

alt text

About

End-To-End Near Real-time Road Traffic Monitoring Spark Structured Streaming solution

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •