This repository contains a Rust implementation of the One Billion Row Challenge. The goal is to efficiently parse and process a large dataset with one billion rows, leveraging Rust's performance capabilities.
The challenge is to read a text file with 1 billion rows, each containing a mapping of a string to a floating point number (this is meant to signify a key-value pair, where the key is a weather station and the value is the temperature registered by that station at any given time). Once read, the program should calculate the minimum, maximum, and average temperature for each weather station and output the results into a JSON file (output.json), with the mappings being the weather station name and the values being a string with the format min/max/avg.
The input should look a little like this:
...
Juba;9.2
Dar es Salaam;26.1
Honiara;22.4
San Salvador;6.9
Nashville;21.9
Vientiane;29.4
Edinburgh;22
Gaborone;37.2
...
The output should look like this:
{
...
Kankan=-22.2/26.5/76.4,
Kano=-35.8/26.4/83.6,
Kansas City=-23.0/12.5/45.2,
Karachi=-26.4/26.0/77.2,
Karonga=-23.9/24.4/72.7,
Kathmandu=-42.6/18.3/75.2,
Khartoum=-14.2/29.9/80.4,
Kingston=-34.3/27.4/86.2,
Kinshasa=-17.4/25.3/71.3,
...
}
| Cold Cache | Warm Cache | |
|---|---|---|
| ⏱️ Performance | 6-10 seconds | 1.8-2.0 seconds |
Latest Benchmarks:
- Calculations only: ~1.814s
- Full challenge: ~1.820s
Benchmarks were run on my MacBook Pro 14" M3 Max with 36GB Unified Memory and 14 CPU cores (10 performance, 4 efficiency).
- SIMD acceleration for ultra-fast parsing (responsible for around 10-20% of the performance gain)
- Multi-threaded processing (responsible for most of the performance gain) using
rayon - Optimized HashMap for fast lookups using
hashbrownandahash - Optimized float32 parsing by pretending it is a i16, multiplied by 10.
- Efficient memory management
- Handles 1 billion+ rows efficiently and quickly
- Benchmark suite with Criterion for accurate performance measurements
# Run in release mode
cargo run --release
# Run benchmarks
cargo bench
# Generate sample data (e.g., 1000 rows)
cargo run --example generate 1000
# To generate the 1B challenge data, use:
cargo run --example generate 1000000000src/- Core source codebenches/- Benchmarksexamples/- Data generators1b_measurements.txt- Input data (ignored in git)
PRs welcome! Please benchmark your changes.
This project is licensed under the MIT License. See the LICENSE file for details.