Heidi Engine is a research project focused on building an autonomous coding agent through iterative self-improvement, leveraging teacher-student distillation and advanced data pipelines.
- Overview
- Installation
- Data Collection Pipeline
- Model Training
- Monitoring & Dashboard
- C++ Core Optimizations
- Hyperparameter Optimization (HPO)
- System Requirements
- Troubleshooting
Heidi Engine automates the process of collecting, cleaning, and validating code data, then trains and evaluates models in a closed loop. It supports multi-language validation, distributed monitoring, and high-performance C++ extensions for efficiency.
Clone the repository and install dependencies:
git clone https://github.com/heidi-dang/heidi-engine.git
cd heidi-engine
pip install -e .The core data pipeline is managed by loop_repos.sh, which automates:
- Scraping GitHub repositories
- Generating and validating synthetic training data
- Filtering and deduplication
Key Features:
- Stack Presets:
--stack python(Python:.py,.ipynb)--stack cpp(C++:.cpp,.h)--stack vite(Modern frontend:.ts,.tsx,.vue,.svelte)--stack web(Web:.js,.ts)--stack go(Go:.go)
- Smart Filtering: Excludes homework-like repos, checks for permissive licenses, and limits file sizes.
- Golden Repos: Add curated, high-quality repos with
--golden. - Resume Support: Continue previous runs with
--resume. - Global Deduplication: Merge and deduplicate with
--dedupe.
Default: Each round processes up to 33 samples. Override with --rounds and --samples.
Example:
./scripts/loop_repos.sh \
--stack python \
--max 100 \
--rounds 1 \
--samples 1000 \
--resume \
--golden \
--dedupeUpload to Hugging Face:
./scripts/loop_repos.sh --stack python --max 100 --push-to-hub my-org/my-datasetTrain models as part of the data loop (--full), or standalone for more control:
./scripts/train_only.py \
--data ./autotrain_repos/merged_dataset.jsonl \
--base-model microsoft/phi-2 \
--steps 1000 \
--out-dir ./my_model_outputTrack progress in real time with two dashboard options:
./scripts/dashboard.shStart the telemetry server:
python3 -m heidi_engine.telemetry init --serverAccess at: http://127.0.0.1:7779/
Features:
- Real-time stats: generation, validation, failure rates
- Training metrics: loss, steps
- GPU VRAM monitoring
- API cost estimates
- Dark mode
Monitor distributed training from a single dashboard:
On Dashboard Machine:
python3 -m heidi_engine.telemetry init --serverView at http://127.0.0.1:7779/
On Worker Machines:
./scripts/loop_repos.sh --stack python --monitor http://<dashboard-ip>:7779High-performance C++ extensions accelerate data processing and resource management:
- Speed: Deduplication and transpose up to 3.4x faster than Python
- Efficiency: Arena allocation, vectorized compression
- Kernel Integration: Submodule linking with heidi-kernel
- Monitoring: Real-time GPU VRAM tracking via CUDA
See docs/cpp_optimizations.md for details.
Integrated Optuna-powered sweeps for optimal training parameters:
- Automated Search: Explores
learning_rate,batch_size,lora_r - Resource Awareness: Skips trials if GPU VRAM <1GB
- Dashboard Integration: Broadcasts best params in real time
- Fail-Safe: Infinite-loss fallback for OOM or script crashes
Example:
./scripts/train_only.py --data dataset.jsonl --optuna --n-trials 20Compiler Requirements for Validation:
g++(C++)node(JavaScript/TypeScript)go(Go)
- Connection Issues: Check firewall and telemetry server status
- Authentication Errors: Set
TELEMETRY_PASSenvironment variable - Validation Failures: Ensure compilers are installed; fallback logging is enabled
For more, see docs/walkthrough_v1.md.