-
Notifications
You must be signed in to change notification settings - Fork 3.7k
Description
Problem
The current experiment loop in program.md uses git branches for state management and a sequential single-agent loop. This works but has concrete engineering limitations:
-
Sequential execution on one GPU. The loop is
modify → train → evaluate → decide → repeat. With a 5-min time budget, throughput is ~12 experiments/hour. On a multi-GPU node (e.g. 8×H100), 7 GPUs sit idle. -
Flat state.
results.tsvis append-only with no queryable structure. There's no way to ask "which architecture changes had the highest keep rate?" or "what near-misses should I combine?" without writing custom analysis code. -
No crash recovery. If the agent's context fills up or the process dies, the experiment state is lost. There's no way to resume from where you left off — you restart from scratch and re-read the TSV.
-
No search adaptation. The agent uses raw LLM reasoning to pick the next experiment. There's no feedback loop that says "you've tried 15 hyperparams changes with 0 keeps, switch to architecture changes" or "these two near-misses both touched attention — combine them."
-
Git as experiment DAG.
git reset --hardon discard loses the diff history of failed experiments. You can't later go back and ask "what did experiment improvements to novelty #47 actually change?" — it's gone.
Proposed approach
Model the experiment loop as three primitives:
- Worker: A process bound to a GPU. Executes
uv run train.py, parses the---metrics block from stdout, handles timeouts. One worker per GPU. - Function: A stateless handler.
experiment::register,experiment::complete,search::adapt,pool::acquire,report::summary, etc. All experiment state lives in a KV store keyed by scope (experiments, lineage, near_misses, gpu_pool, strategy). - Trigger: Wires functions to events. HTTP triggers expose a REST API for external agents. Event triggers auto-adapt search strategy when experiments complete. Cron triggers run GPU health checks.
The external agent (Claude, Codex, whatever) calls the same uv run train.py command but wraps it with API calls:
# Before training
curl -X POST localhost:3111/api/experiment/register -d '{
"tag":"mar11", "hypothesis":"...", "category":"architecture",
"commit_sha":"a1b2c3d", "diff_summary":"DEPTH 8->12"
}'
# After training
curl -X POST localhost:3111/api/experiment/complete -d '{
"experiment_id":"exp-xxx", "val_bpb":0.993, "peak_vram_mb":45060, ...
}'
# Returns: { "improved": true, "action": "keep_commit", "best_val_bpb": 0.993 }Concrete function inventory (23 total)
Experiment lifecycle (7):
experiment::setup— init tag, branch, strategyexperiment::register— record hypothesis + diff before trainingexperiment::complete— record metrics, auto-decide keep/discard, detect near-misses (within 0.002 BPB of best)experiment::crash— track consecutive crashes per tag, abort after 3experiment::history— query by tag, status, limitexperiment::best— current best for a tagexperiment::near_misses— experiments within striking distance
Search strategy (4):
search::strategy— get current modesearch::set_strategy— manual overridesearch::adapt— auto-adapt based on recent keep rate, crash rate, near-miss count. Transitions: explore → exploit (high keep rate), explore → combine (plateau + near-misses), explore → ablation (plateau, no near-misses), any → exploit (high crash rate)search::suggest_direction— analyze category-level stats, identify underexplored areas, suggest next experiment
GPU pool (6):
pool::register_gpu— worker self-registers on startup with nvidia-smi metadatapool::heartbeat— 30s heartbeat, mark offline after 60s stalepool::list— all workers with status (idle/training/offline)pool::acquire— atomic claim of idle GPU for an experimentpool::release— return GPU to poolpool::deregister— remove on shutdown
Reporting (4):
report::summary— full stats: keep rate, category breakdown, BPB progression, training timereport::tsv— export in original autoresearch TSV format for compatibilityreport::diff— compare two experiments (delta BPB, params, VRAM)report::tags— list all run tags
GPU execution (2, per worker):
gpu::train— spawnuv run train.pywith CUDA_VISIBLE_DEVICES isolation, parse metrics block, enforce kill timeoutgpu::health— nvidia-smi temperature/memory/utilization
KV state schema
experiments:{id} → { id, tag, parent_id, commit_sha, hypothesis, category,
val_bpb, peak_vram_mb, status, gpu_id, diff_summary, ... }
lineage:{tag} → [experiment_id, experiment_id, ...] (ordered)
best:{tag} → { experiment_id, val_bpb, commit_sha }
near_misses:{id} → { experiment_id, val_bpb, delta, hypothesis, category }
gpu_pool:{gpu_id} → { id, gpu_index, gpu_name, vram_mb, status, current_experiment_id }
strategy:{tag} → { mode, explore_ratio, temperature, reason }
tags:{name} → { name, branch, best_val_bpb, total_experiments, kept_experiments }
crashes:{tag} → consecutive crash count (int)
Multi-GPU parallelism
With N GPU workers, N agents can run experiments simultaneously on the same tag. Each agent:
- Calls
pool::acquireto claim a GPU - Runs training on that GPU
- Calls
experiment::complete(KV writes are serialized) - Calls
pool::release
The search strategy adapts globally across all parallel experiments.
What stays the same
train.pyis still the only file agents modifyprepare.pyis still read-only- 5-minute fixed time budget
- val_bpb as the single metric
- Git branches for version control
- External agents (Claude, Codex) drive the loop
program.mdas the agent instruction file
Implementation
Two polyglot workers:
- Orchestrator (Python, iii-sdk): All 21 orchestration functions + triggers
- GPU Worker (Rust, iii-sdk): Per-GPU process, training execution + health
The engine itself is iii-engine — Worker/Function/Trigger runtime with built-in KV state, PubSub, cron, and HTTP trigger modules.
Reference implementation: https://github.com/iii-hq/n-autoresearch
Tradeoffs
Added complexity: Two worker processes + iii-engine runtime vs. zero infrastructure today. Justified only if you're running multi-GPU or want persistent experiment state across sessions.
Not proposed: Built-in LLM, auto-modification of train.py, or any change to the training code itself. The agent is still external — this only structures the infrastructure around it.