Skip to content

This project pulls historical and forecast weather data for multiple cities, cleans and transforms it, performs quality checks, and stores the results in tidy daily and monthly summary datasets.

Notifications You must be signed in to change notification settings

sshossen/weather_etl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Modular Weather ETL Pipeline

A modular and incremental ETL (Extract–Transform–Load) pipeline for retrieving, cleaning, transforming, validating, and storing weather data from the Open-Meteo API.


🌦️ Project Overview

This project pulls historical and forecast weather data for multiple cities, cleans and transforms it, performs quality checks, and stores the results in tidy daily and monthly summary datasets.

It supports incremental loading, meaning it only fetches new data since the last successful run.

Designed to demonstrate:

  • ETL architecture
  • Incremental data loading
  • Modularized Python design
  • Automated state tracking
  • Data quality validation
  • Writing outputs to CSV and Parquet

🎯 Project Goals

  • Retrieve weather data from Open-Meteo (historical + forecast)
  • Normalize & transform the raw response into tidy datasets
  • Automatically detect and load only missing date ranges
  • Generate daily tidy datasets and monthly aggregated summaries
  • Perform quality checks before saving outputs
  • Write final datasets to CSV and Parquet
  • Maintain and update pipeline state (state.json)
  • Modular architecture, easy to extend

🧱 Project Structure

modular_weather_etl/
│
├── main.py                 # Main entry point for running the ETL
├── config.py               # Configurations (cities, API settings, file paths)
├── open_meteo.py           # API interaction logic
├── weather_transform.py    # Transformations: tidy data + monthly summary
├── quality.py              # Validation / quality checks
├── writer.py               # Writing CSV / Parquet results
├── state.py                # Incremental loading state handler
│
└── data/
    ├── daily_weather.csv
    ├── monthly_weather.csv
    └── state.json          # Stores last loaded date

⚙️ How It Works (Architecture)

1. Extract

open_meteo.py

  • Determines which date ranges need to be fetched
  • Automatically splits requests into:
    • Historical (archive API)
    • Forecast (forecast API)
  • Sends requests and returns raw JSON responses

2. Transform

weather_transform.py

  • Converts raw JSON into a tidy tabular DataFrame
  • Cleans column names & units
  • Computes new fields like:
    • Temperature range
    • Precipitation sum
  • Builds monthly aggregations

3. Quality Checks

quality.py

  • Ensures no missing required columns
  • Ensures no empty dataset
  • Ensures valid date formats
  • Ensures all required metrics are present

4. Load

writer.py

  • Writes:
    • daily_weather.csv
    • monthly_weather.csv
    • Equivalent Parquet files
  • Handles incremental merge logic

5. State Tracking

state.py

  • Saves the last successfully loaded date into state.json
  • Ensures next runs only fetch missing data

🚀 How to Run the Project

1. Clone the repository

git clone https://github.com/sshossen/weather_etl.git
cd modular_weather_etl

2. Install the dependencies

pip install -r requirements.txt

3. Run the ETL pipeline

python main.py

4. Outputs appear in:

data/daily_weather.csv
data/monthly_weather.csv
data/daily_weather.parquet
data/monthly_weather.parquet
data/state.json

About

This project pulls historical and forecast weather data for multiple cities, cleans and transforms it, performs quality checks, and stores the results in tidy daily and monthly summary datasets.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages