Air Quality Forecasting Using Diffusion Modeling and ML Algorithms

This project focuses on predicting PM2.5 concentrations using a combination of classical machine learning models, deep learning architectures (like LSTM), and an advanced Denoising Diffusion Probabilistic Model (DDPM). The goal is to build a cutting-edge forecasting system that leverages the strengths of generative modeling and multivariate time-series engineering.

Project Overview

Forecasting air pollution, especially PM2.5, is critical for public health, urban planning, and climate research.
This project builds a full-stack ML pipeline:
- Preprocessing raw CSV sensor data
- Engineering lag and rolling features
- Training classical regressors (Random Forest, XGBoost)
- Training a sequence-based LSTM
- Designing and implementing a custom DDPM to model time series in a novel way
Visual comparisons and performance benchmarks reveal strengths and limitations across models.

Screenshots of Highlights

Highlights

Multivariate Time Series Forecasting
Custom Sequence Diffusion Model (DDPM)
ML Benchmarking (XGBoost, RF, LSTM)
Feature-Rich Lag Analysis
Clean Modular Codebase

Project Structure

air-quality-diffusion/
│
├── data/ # Preprocessed and scaled input datasets
├── models/ # Saved model weights (.pth, .json, .pkl)
├── notebooks/ # Core notebooks (see below)
├── src/ # Source code
│ └── diffusion_model.py # DDPM architecture and forward pass logic
│
├── requirements.txt # Python dependencies
└── README.md # Project documentation

Notebooks Overview

01_explore_clean_data.ipynb

Loads raw CSV files
Handles missing values, interpolation, and scaling
Concatenates multivariate air quality features across sensors
Output: scaled_data.csv

02_feature_engineering.ipynb

Constructs:
- Lag features (t-1 to t-n)
- Rolling means and standard deviations
- Day of week / time encodings
  Output: Feature-rich dataset for ML modeling

02_dimensionality_rduction.ipynb

Applies PCA to reduce dimensionality of multivariate feature space
Compresses high-dimensional sensor data into a lower-dimensional latent space for modeling
Visualizes variance explained by each principal component and selects optimal number of components
Prepares compressed input features for use in downstream models (e.g., LSTM, XGBoost, Diffusion) Output: Reduced-dimension dataset and PCA transformation object

03_lstm_model.ipynb

Prepares sequential windows for LSTM training
Trains an LSTM model for PM2.5 forecasting
Plots predictions and error curves
Output: RMSE, R², and predicted vs true PM2.5 plots

04_diffusion_modeling.ipynb

Implements a custom Simple1DDiffusionModel
Trains the DDPM to denoise a latent 1D vector representation of PM2.5 sequences
Forecasts via iterative reverse sampling
Benchmarks model predictions against ground truth
Output: DDPM performance metrics and forecast visualizations

05_baseline_models.ipynb

Trains Random Forest and XGBoost regressors on lag feature data
Compares:
- Random Forest
- XGBoost
- LSTM
- DDPM
  Output: Final RMSE, R², and model performance comparison

Tech Stack

Python 3.10+
PyTorch – Deep learning and DDPM modeling
XGBoost – Gradient boosting
Scikit-learn – Baseline metrics and preprocessing
Pandas / NumPy / Matplotlib – Data wrangling and visualization

How the Diffusion Model Works

The DDPM is inspired by generative models like DALL·E and Stable Diffusion but adapted for 1D regression.
Instead of generating images, it learns to reverse noise applied to PM2.5 time series.
Sampling starts with Gaussian noise and refines over 100+ steps using learned denoising logic.
The model forecasts future values by iteratively cleaning noisy sequences.

Future Improvements

Calibrate the diffusion noise schedule (beta_t)
Use a Transformer-based denoiser for better attention over time
Explore multi-step forecasting (not just one time step ahead)
Integrate weather and environmental features (e.g., temperature, humidity)
Deploy as a web app using Streamlit or Flask

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Air Quality Forecasting Using Diffusion Modeling and ML Algorithms

Project Overview

Screenshots of Highlights

Highlights

Project Structure

Notebooks Overview

01_explore_clean_data.ipynb

02_feature_engineering.ipynb

02_dimensionality_rduction.ipynb

03_lstm_model.ipynb

04_diffusion_modeling.ipynb

05_baseline_models.ipynb

Tech Stack

How the Diffusion Model Works

Future Improvements

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
data		data
models		models
notebooks		notebooks
src		src
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

chadvik88/AirQualityIndex

Folders and files

Latest commit

History

Repository files navigation

Air Quality Forecasting Using Diffusion Modeling and ML Algorithms

Project Overview

Screenshots of Highlights

Highlights

Project Structure

Notebooks Overview

01_explore_clean_data.ipynb

02_feature_engineering.ipynb

02_dimensionality_rduction.ipynb

03_lstm_model.ipynb

04_diffusion_modeling.ipynb

05_baseline_models.ipynb

Tech Stack

How the Diffusion Model Works

Future Improvements

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages