An ML-powered compiler optimization system that uses Transformers to predict optimal LLVM pass sequences for C programs, specifically targeting RISC-V hardware. Beat standard optimization levels (-O0/-O1/-O2/-O3) with intelligent, program-specific optimizations!
# 1. Install LLVM/Clang with RISC-V support (18+)
sudo apt install clang llvm llvm-tools
# Verify RISC-V support
llc --version | grep riscv
# 2. Install QEMU for RISC-V emulation
sudo apt install qemu-user qemu-user-static
# 3. Install RISC-V toolchain (recommended)
sudo apt install gcc-riscv64-linux-gnu g++-riscv64-linux-gnu
# 4. Install Python dependencies
pip install xgboost scikit-learn pandas numpy tqdm
# Or use venv:
python3 -m venv venv
source venv/bin/activate
pip install xgboost scikit-learn pandas numpy tqdm# Verify setup
cd tools
chmod +x test_tools.sh run_full_generation.sh
./test_tools.sh
# Test feature extraction
python3 feature_extractor.py ../training_programs/01_insertion_sort.c
# Test pass sequence generation
python3 pass_sequence_generator.py -n 5 -s mixed# Quick test (10 sequences per program, ~10 min)
./run_full_generation.sh --test
# Full dataset (200 sequences per program, 4-10 hours)
./run_full_generation.sh
# Custom configuration
python3 generate_training_data.py \
--programs-dir ../training_programs \
--output-dir ./training_data \
--num-sequences 200 \
--strategy mixed \
--max-workers 4 \
--baselines-
Install LLVM/Clang
- Download from https://releases.llvm.org/
- Get version 18+ with RISC-V support
- Add to PATH:
C:\Program Files\LLVM\bin
-
Install QEMU
- Download from https://www.qemu.org/download/#windows
- Install to
C:\Program Files\qemu - Add to PATH
-
Install Python 3.8+
- Download from https://www.python.org/downloads/
- Check "Add Python to PATH" during installation
-
Install Python dependencies
pip install xgboost scikit-learn pandas numpy tqdm
cd tools
REM Verify setup
python test_tools.py
REM Test feature extraction
python feature_extractor.py ..\training_programs\01_insertion_sort.c
REM Test pass sequence generation
python pass_sequence_generator.py -n 5 -s mixedREM Quick test
python generate_training_data.py --programs-dir ..\training_programs --output-dir .\training_data -n 10 --strategy mixed
REM Full dataset
python generate_training_data.py --programs-dir ..\training_programs --output-dir .\training_data -n 200 --strategy mixed --max-workers 4 --baselinesThis project uses Machine Learning (XGBoost) to learn which LLVM compiler optimization passes work best for different types of programs on RISC-V architecture. Instead of using one-size-fits-all optimization levels like -O2 or -O3, it predicts custom pass sequences tailored to each program's characteristics.
-
Training Data Generation
- Compile 30+ training programs to RISC-V
- Try 200+ different pass sequences per program
- Measure execution time and binary size
- Extract ~50 features from LLVM IR
-
Model Training
- Train XGBoost on program features β performance
- Learn which passes work best for which program types
- Model:
program_features β best_pass_sequence
-
Evaluation
- Test on 20 unseen programs
- Compare ML predictions vs
-O0/-O1/-O2/-O3 - Goal: Beat
-O3on >50% of test programs
hackman/
βββ README.md # This file
βββ training_programs/ # 30+ programs for training (~176 programs)
β βββ 01_insertion_sort.c
β βββ 02_selection_sort.c
β βββ ...
βββ test_programs/ # 20 programs for evaluation
β βββ 01_quicksort.c
β βββ 02_mergesort.c
β βββ ...
βββ tools/ # ML pipeline tools
β βββ feature_extractor.py # Extract IR features
β βββ pass_sequence_generator.py # Generate pass sequences
β βββ hybrid_sequence_generator.py # Hybrid pass + machine optimization
β βββ machine_flags_generator_v2.py # RISC-V machine-level flags (ABI support)
β βββ generate_training_data.py # Main data generation script
β βββ train_passformer.py # Train ML model
β βββ combined_model.py # Combined pass + machine optimization model
β βββ test_tools.sh # Verify setup (Linux)
β βββ run_full_generation.sh # Convenience script (Linux)
β βββ training_data/ # Generated datasets
β βββ training_data_hybrid.json # Hybrid pass + machine data
β βββ baselines.json # -O0/-O1/-O2/-O3 results
βββ combined_model.py # Model training script
βββ train_passformer.py # Transformer-based model training
# Extract features from C program
python3 feature_extractor.py program.c -o features.json
# Specify RISC-V target
python3 feature_extractor.py program.c --target-arch riscv64
# Show all features
python3 feature_extractor.py program.c --verbose# Generate random sequences
python3 pass_sequence_generator.py -n 10 -s random
# Mixed strategy (random + synergy-based)
python3 pass_sequence_generator.py -n 20 -s mixed
# Genetic algorithm
python3 pass_sequence_generator.py -n 50 -s genetic
# Custom length range
python3 pass_sequence_generator.py -n 10 --min-length 5 --max-length 15# Generate hybrid sequences with machine-level flags
python3 hybrid_sequence_generator.py -n 10 --strategy mixed
# Include machine flags
python3 hybrid_sequence_generator.py -n 20 --include-machine-flags# Generate machine-level configs with default ABI
python3 machine_flags_generator_v2.py -n 5
# Vary ABI for more diversity (lp64/lp64f/lp64d)
python3 machine_flags_generator_v2.py -n 10 --vary-abi
# For 32-bit RISC-V
python3 machine_flags_generator_v2.py -n 5 --target riscv32 --vary-abi# Full pipeline with all options
python3 generate_training_data.py \
--programs-dir ../training_programs \
--output-dir ./training_data \
--num-sequences 200 \
--strategy mixed \
--target-arch riscv64 \
--max-workers 4 \
--baselines \
--runs 3
# Quick test run
python3 generate_training_data.py \
--programs-dir ../training_programs \
--output-dir ./training_data \
-n 10 \
--no-parallel{
"metadata": {
"num_programs": 30,
"num_sequences": 200,
"strategy": "mixed",
"total_data_points": 5123
},
"data": [
{
"program": "insertion_sort",
"sequence_id": 0,
"features": {
"total_instructions": 87,
"num_load": 23,
"num_store": 15,
"cyclomatic_complexity": 5,
"memory_intensity": 0.437
},
"pass_sequence": ["mem2reg", "simplifycfg", "gvn"],
"machine_config": {
"abi": "lp64d",
"config": {"m": true, "a": true, "f": true, "d": true, "c": true}
},
"execution_time": 0.0234,
"binary_size": 8192
}
]
}{
"insertion_sort": {
"O0": {"time": 0.145, "size": 12288},
"O1": {"time": 0.089, "size": 9216},
"O2": {"time": 0.067, "size": 8704},
"O3": {"time": 0.054, "size": 8192}
}
}# Train on hybrid data (pass sequences + machine flags)
python3 combined_model.py \
--data tools/training_data/training_data_hybrid.json \
--baselines tools/training_data/baselines.json \
--output models/combined_model.pkl
# Evaluate model
python3 combined_model.py \
--data tools/training_data/training_data_hybrid.json \
--baselines tools/training_data/baselines.json \
--evaluate# Train PassFormer (Transformer-based sequence model)
python3 train_passformer.py \
--data tools/training_data/training_data_hybrid.json \
--epochs 50 \
--batch-size 32 \
--output models/passformer.pthIssue: "clang: unknown target triple 'riscv64'"
# Verify RISC-V support
llc --version | grep riscv
# If missing, reinstall LLVM with RISC-V backendIssue: "qemu-riscv64: not found"
sudo apt install qemu-user-static
which qemu-riscv64 # Should show /usr/bin/qemu-riscv64Issue: "error while loading shared libraries"
# Install RISC-V sysroot
sudo apt install gcc-riscv64-linux-gnu
# Or run with explicit library path
qemu-riscv64 -L /usr/riscv64-linux-gnu ./programIssue: "clang not recognized"
- Add LLVM to PATH:
C:\Program Files\LLVM\bin - Restart terminal after adding to PATH
Issue: "qemu-riscv64.exe not found"
- Install QEMU for Windows
- Add to PATH:
C:\Program Files\qemu
Issue: Python package installation fails
REM Use --user flag
pip install --user xgboost scikit-learn pandas numpy tqdm
REM Or create virtual environment
python -m venv venv
venv\Scripts\activate
pip install xgboost scikit-learn pandas numpy tqdm- Small test (10 sequences Γ 30 programs): ~10 minutes
- Medium run (50 sequences Γ 30 programs): ~1 hour
- Full dataset (200 sequences Γ 30 programs): 4-10 hours
- Success rate: ~85% (some sequences fail to compile)
- Data points: 5,000-6,000 valid samples
- File size: 10-50 MB JSON
- Model training: 5-30 minutes
- Evaluation: 10-60 minutes on 20 test programs
| Metric | Target |
|---|---|
| Beat -O3 | >50% of test programs |
| Average speedup | 5-10% faster than -O3 |
| Generalization | Works on unseen programs |
- Check tools work:
./test_tools.sh(Linux) orpython test_tools.py(Windows) - Verify RISC-V support:
llc --version | grep riscv - Test simple compilation:
echo 'int main() { return 0; }' > test.c clang --target=riscv64-unknown-linux-gnu test.c -o test qemu-riscv64 test
- Use
--helpflag: All tools support--helpfor detailed options
- training_programs/: Programs used to train the ML model
- test_programs/: Programs used to evaluate the model (unseen during training)
- tools/training_data/: Generated training datasets
- combined_model.py: Main model training script
- train_passformer.py: Transformer-based model training
- LLVM Pass Documentation: https://llvm.org/docs/Passes.html
- RISC-V ISA: https://riscv.org/technical/specifications/
- QEMU User Mode: https://www.qemu.org/docs/master/user/main.html
- XGBoost: https://xgboost.readthedocs.io/
Linux:
cd tools
./run_full_generation.shWindows:
cd tools
python generate_training_data.py --programs-dir ..\training_programs --output-dir .\training_data -n 200 --strategy mixedGood luck beating -O3! π