A modular and incremental ETL (Extract–Transform–Load) pipeline for retrieving, cleaning, transforming, validating, and storing weather data from the Open-Meteo API.
This project pulls historical and forecast weather data for multiple cities, cleans and transforms it, performs quality checks, and stores the results in tidy daily and monthly summary datasets.
It supports incremental loading, meaning it only fetches new data since the last successful run.
Designed to demonstrate:
- ETL architecture
- Incremental data loading
- Modularized Python design
- Automated state tracking
- Data quality validation
- Writing outputs to CSV and Parquet
- Retrieve weather data from Open-Meteo (historical + forecast)
- Normalize & transform the raw response into tidy datasets
- Automatically detect and load only missing date ranges
- Generate daily tidy datasets and monthly aggregated summaries
- Perform quality checks before saving outputs
- Write final datasets to CSV and Parquet
- Maintain and update pipeline state (
state.json) - Modular architecture, easy to extend
modular_weather_etl/
│
├── main.py # Main entry point for running the ETL
├── config.py # Configurations (cities, API settings, file paths)
├── open_meteo.py # API interaction logic
├── weather_transform.py # Transformations: tidy data + monthly summary
├── quality.py # Validation / quality checks
├── writer.py # Writing CSV / Parquet results
├── state.py # Incremental loading state handler
│
└── data/
├── daily_weather.csv
├── monthly_weather.csv
└── state.json # Stores last loaded date
open_meteo.py
- Determines which date ranges need to be fetched
- Automatically splits requests into:
- Historical (archive API)
- Forecast (forecast API)
- Sends requests and returns raw JSON responses
weather_transform.py
- Converts raw JSON into a tidy tabular DataFrame
- Cleans column names & units
- Computes new fields like:
- Temperature range
- Precipitation sum
- Builds monthly aggregations
quality.py
- Ensures no missing required columns
- Ensures no empty dataset
- Ensures valid date formats
- Ensures all required metrics are present
writer.py
- Writes:
daily_weather.csvmonthly_weather.csv- Equivalent Parquet files
- Handles incremental merge logic
state.py
- Saves the last successfully loaded date into
state.json - Ensures next runs only fetch missing data
git clone https://github.com/sshossen/weather_etl.git
cd modular_weather_etlpip install -r requirements.txtpython main.pydata/daily_weather.csv
data/monthly_weather.csv
data/daily_weather.parquet
data/monthly_weather.parquet
data/state.json