A high-performance Rust tool for converting UK Met Office MIDAS weather datasets from BADC-CSV format to optimized Parquet files for efficient analysis.
MIDAS Processor is part of a climate research toolkit designed to process historical UK weather data from the CEDA Archive. It transforms the original BADC-CSV format into modern, optimized Parquet files with significant performance improvements for analytical workloads.
This tool works as part of a complete climate data processing pipeline:
- midas-fetcher: Downloads MIDAS datasets from CEDA
- midas-processor (this tool): Converts BADC-CSV to optimized Parquet
- Analysis tools: Python/R analysis of the resulting Parquet files
MIDAS (Met Office Integrated Data Archive System) contains historical weather observations from 1000+ UK land-based weather stations, spanning from the late 19th century to present day. The datasets include:
- Daily rainfall observations (161k+ files, ~5GB)
- Daily temperature observations (41k+ files, ~2GB)
- Wind observations (12k+ files, ~9GB)
- Solar radiation observations (3k+ files, ~2GB)
- Large row groups (500K rows)
- Smart compression (Snappy, ZSTD, LZ4 options)
- Column statistics for query pruning
- Memory-efficient streaming for large datasets
- Automatic dataset discovery from midas-fetcher cache
- Schema detection and validation
- Interactive dataset selection when run without arguments
- Simple command-line interface with sensible defaults
- Comprehensive progress reporting with file counts and timing
- Verbose mode for debugging and optimization insights
- Rust 1.70+ (uses Rust 2024 edition features)
- 8GB+ RAM recommended for large datasets
git clone https://github.com/your-org/midas-processor
cd midas-processor
cargo install --path .cargo install midas-processor- Download datasets using midas-fetcher
- Convert with auto-discovery:
This will show available datasets and let you select one interactively.
midas-processor
# Interactive dataset selection
midas-processor
# Process specific dataset
midas-processor /path/to/uk-daily-rain-obs-202407
# Custom output location
midas-processor --output-path ./analysis/rain_data.parquet
# High compression for archival
midas-processor --compression zstd
# Schema analysis only (no conversion)
midas-processor --discovery-only --verbose
# Combine options
midas-processor /path/to/dataset --compression lz4 --verbose| Option | Description | Default |
|---|---|---|
DATASET_PATH |
Path to MIDAS dataset directory (optional) | Auto-discover |
--output-path |
Custom output location | ../parquet/{dataset}.parquet |
--compression |
Compression algorithm (snappy/zstd/lz4/none) | snappy |
--discovery-only |
Analyze schema without converting | false |
--verbose |
Enable detailed logging | false |
- Station-Timestamp Sorting: Data is sorted by
station_idthenob_end_timefor optimal query performance - Large Row Groups: 500K rows per group for better compression and fewer metadata operations
- Column Statistics: Enabled for all columns to allow query engines to skip irrelevant data
- Memory Streaming: Processes datasets larger than available RAM through streaming execution
- Header validation: Ensures BADC-CSV headers are correctly parsed
- Schema consistency: Validates column structures across files
- Error reporting: Detailed error messages with file locations
- Missing data handling: Graceful handling of incomplete or corrupted files
import polars as pl
# Fast station-based query
df = pl.scan_parquet("rain_data.parquet") \
.filter(pl.col("station_id") == "00009") \
.collect()
# Time range analysis
monthly_avg = pl.scan_parquet("temperature_data.parquet") \
.filter(pl.col("ob_end_time").dt.year() == 2023) \
.group_by(["station_id", pl.col("ob_end_time").dt.month()]) \
.agg(pl.col("air_temperature").mean()) \
.collect()import pandas as pd
# Read with automatic optimization
df = pd.read_parquet("rain_data.parquet")
# Station-specific analysis
station_data = df[df['station_id'] == '00009']library(arrow)
# Lazy evaluation with Arrow
rain_data <- open_dataset("rain_data.parquet")
# Efficient aggregation
monthly_totals <- rain_data %>%
filter(year(ob_end_time) == 2023) %>%
group_by(station_id, month = month(ob_end_time)) %>%
summarise(total_rain = sum(prcp_amt, na.rm = TRUE)) %>%
collect()Memory Issues
# For very large datasets, ensure sufficient RAM or use streaming
midas --verbose # Monitor memory usagePerformance Issues
# Check if storage is the bottleneck
midas --verbose # Shows processing ratesCache Directory Not Found
# Ensure midas-fetcher has been run first
ls ~/Library/Application\ Support/midas-fetcher/cache/ # macOS
ls ~/.config/midas-fetcher/cache/ # Linux- "No MIDAS datasets found in cache": Run midas-fetcher first to download datasets
- "Failed to parse header": BADC-CSV file may be corrupted, check source data
- "Configuration file not found": Dataset type not recognized, check file structure
This project is licensed under the MIT License - see the LICENSE file for details.
See CHANGELOG.md for detailed version history and release notes.
We welcome contributions! Please see our contributing guidelines for details.
git clone https://github.com/your-org/midas-processor
cd midas-processor
cargo build
cargo test- Use
cargo fmtfor formatting - Ensure
cargo clippypasses without warnings - Add tests for new functionality
- Update documentation for API changes
If you use this tool in your research, please cite:
@software{midas_processor,
title = {MIDAS Processor: High-Performance Climate Data Processing},
author = {Richard Lyon},
year = {2025},
url = {https://github.com/rjl-climate/midas_processor}
}- Documentation: See docs/ directory
- Issues: Report bugs via GitHub Issues
- Discussions: Ask questions in GitHub Discussions
- UK Met Office: For providing the MIDAS datasets
- CEDA: For hosting and maintaining the climate data archive
- BADC: For developing the CSV format standards
- Polars Project: For the high-performance DataFrame library enabling fast processing