This project focuses on predicting PM2.5 concentrations using a combination of classical machine learning models, deep learning architectures (like LSTM), and an advanced Denoising Diffusion Probabilistic Model (DDPM). The goal is to build a cutting-edge forecasting system that leverages the strengths of generative modeling and multivariate time-series engineering.
- Forecasting air pollution, especially PM2.5, is critical for public health, urban planning, and climate research.
- This project builds a full-stack ML pipeline:
- Preprocessing raw CSV sensor data
- Engineering lag and rolling features
- Training classical regressors (Random Forest, XGBoost)
- Training a sequence-based LSTM
- Designing and implementing a custom DDPM to model time series in a novel way
- Visual comparisons and performance benchmarks reveal strengths and limitations across models.
- Multivariate Time Series Forecasting
- Custom Sequence Diffusion Model (DDPM)
- ML Benchmarking (XGBoost, RF, LSTM)
- Feature-Rich Lag Analysis
- Clean Modular Codebase
air-quality-diffusion/
│
├── data/ # Preprocessed and scaled input datasets
├── models/ # Saved model weights (.pth, .json, .pkl)
├── notebooks/ # Core notebooks (see below)
├── src/ # Source code
│ └── diffusion_model.py # DDPM architecture and forward pass logic
│
├── requirements.txt # Python dependencies
└── README.md # Project documentation- Loads raw CSV files
- Handles missing values, interpolation, and scaling
- Concatenates multivariate air quality features across sensors
Output:scaled_data.csv
- Constructs:
- Lag features (t-1 to t-n)
- Rolling means and standard deviations
- Day of week / time encodings
Output: Feature-rich dataset for ML modeling
- Applies PCA to reduce dimensionality of multivariate feature space
- Compresses high-dimensional sensor data into a lower-dimensional latent space for modeling
- Visualizes variance explained by each principal component and selects optimal number of components
- Prepares compressed input features for use in downstream models (e.g., LSTM, XGBoost, Diffusion) Output: Reduced-dimension dataset and PCA transformation object
- Prepares sequential windows for LSTM training
- Trains an LSTM model for PM2.5 forecasting
- Plots predictions and error curves
Output: RMSE, R², and predicted vs true PM2.5 plots
- Implements a custom
Simple1DDiffusionModel - Trains the DDPM to denoise a latent 1D vector representation of PM2.5 sequences
- Forecasts via iterative reverse sampling
- Benchmarks model predictions against ground truth
Output: DDPM performance metrics and forecast visualizations
- Trains Random Forest and XGBoost regressors on lag feature data
- Compares:
- Random Forest
- XGBoost
- LSTM
- DDPM
Output: Final RMSE, R², and model performance comparison
- Python 3.10+
- PyTorch – Deep learning and DDPM modeling
- XGBoost – Gradient boosting
- Scikit-learn – Baseline metrics and preprocessing
- Pandas / NumPy / Matplotlib – Data wrangling and visualization
- The DDPM is inspired by generative models like DALL·E and Stable Diffusion but adapted for 1D regression.
- Instead of generating images, it learns to reverse noise applied to PM2.5 time series.
- Sampling starts with Gaussian noise and refines over 100+ steps using learned denoising logic.
- The model forecasts future values by iteratively cleaning noisy sequences.
- Calibrate the diffusion noise schedule (beta_t)
- Use a Transformer-based denoiser for better attention over time
- Explore multi-step forecasting (not just one time step ahead)
- Integrate weather and environmental features (e.g., temperature, humidity)
- Deploy as a web app using Streamlit or Flask



