Modular Weather ETL Pipeline

A modular and incremental ETL (Extract–Transform–Load) pipeline for retrieving, cleaning, transforming, validating, and storing weather data from the Open-Meteo API.

🌦️ Project Overview

This project pulls historical and forecast weather data for multiple cities, cleans and transforms it, performs quality checks, and stores the results in tidy daily and monthly summary datasets.

It supports incremental loading, meaning it only fetches new data since the last successful run.

Designed to demonstrate:

ETL architecture
Incremental data loading
Modularized Python design
Automated state tracking
Data quality validation
Writing outputs to CSV and Parquet

🎯 Project Goals

Retrieve weather data from Open-Meteo (historical + forecast)
Normalize & transform the raw response into tidy datasets
Automatically detect and load only missing date ranges
Generate daily tidy datasets and monthly aggregated summaries
Perform quality checks before saving outputs
Write final datasets to CSV and Parquet
Maintain and update pipeline state (state.json)
Modular architecture, easy to extend

🧱 Project Structure

modular_weather_etl/
│
├── main.py                 # Main entry point for running the ETL
├── config.py               # Configurations (cities, API settings, file paths)
├── open_meteo.py           # API interaction logic
├── weather_transform.py    # Transformations: tidy data + monthly summary
├── quality.py              # Validation / quality checks
├── writer.py               # Writing CSV / Parquet results
├── state.py                # Incremental loading state handler
│
└── data/
    ├── daily_weather.csv
    ├── monthly_weather.csv
    └── state.json          # Stores last loaded date

⚙️ How It Works (Architecture)

1. Extract

open_meteo.py

Determines which date ranges need to be fetched
Automatically splits requests into:
- Historical (archive API)
- Forecast (forecast API)
Sends requests and returns raw JSON responses

2. Transform

weather_transform.py

Converts raw JSON into a tidy tabular DataFrame
Cleans column names & units
Computes new fields like:
- Temperature range
- Precipitation sum
Builds monthly aggregations

3. Quality Checks

quality.py

Ensures no missing required columns
Ensures no empty dataset
Ensures valid date formats
Ensures all required metrics are present

4. Load

writer.py

Writes:
- daily_weather.csv
- monthly_weather.csv
- Equivalent Parquet files
Handles incremental merge logic

5. State Tracking

state.py

Saves the last successfully loaded date into state.json
Ensures next runs only fetch missing data

🚀 How to Run the Project

1. Clone the repository

git clone https://github.com/sshossen/weather_etl.git
cd modular_weather_etl

2. Install the dependencies

pip install -r requirements.txt

3. Run the ETL pipeline

python main.py

4. Outputs appear in:

data/daily_weather.csv
data/monthly_weather.csv
data/daily_weather.parquet
data/monthly_weather.parquet
data/state.json

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
extract		extract
load		load
transform		transform
validate		validate
.gitignore		.gitignore
README.md		README.md
__init__.py		__init__.py
config.py		config.py
main.py		main.py
requirements.txt		requirements.txt
state.py		state.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Modular Weather ETL Pipeline

🌦️ Project Overview

🎯 Project Goals

🧱 Project Structure

⚙️ How It Works (Architecture)

1. Extract

2. Transform

3. Quality Checks

4. Load

5. State Tracking

🚀 How to Run the Project

1. Clone the repository

2. Install the dependencies

3. Run the ETL pipeline

4. Outputs appear in:

About

Uh oh!

Releases

Packages

Languages

sshossen/weather_etl

Folders and files

Latest commit

History

Repository files navigation

Modular Weather ETL Pipeline

🌦️ Project Overview

🎯 Project Goals

🧱 Project Structure

⚙️ How It Works (Architecture)

1. Extract

2. Transform

3. Quality Checks

4. Load

5. State Tracking

🚀 How to Run the Project

1. Clone the repository

2. Install the dependencies

3. Run the ETL pipeline

4. Outputs appear in:

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages