Feat: Structured experiment orchestration via Worker/Function/Trigger primitives

## Problem

The current experiment loop in `program.md` uses git branches for state management and a sequential single-agent loop. This works but has concrete engineering limitations:

1. **Sequential execution on one GPU.** The loop is `modify → train → evaluate → decide → repeat`. With a 5-min time budget, throughput is ~12 experiments/hour. On a multi-GPU node (e.g. 8×H100), 7 GPUs sit idle.

2. **Flat state.** `results.tsv` is append-only with no queryable structure. There's no way to ask "which architecture changes had the highest keep rate?" or "what near-misses should I combine?" without writing custom analysis code.

3. **No crash recovery.** If the agent's context fills up or the process dies, the experiment state is lost. There's no way to resume from where you left off — you restart from scratch and re-read the TSV.

4. **No search adaptation.** The agent uses raw LLM reasoning to pick the next experiment. There's no feedback loop that says "you've tried 15 hyperparams changes with 0 keeps, switch to architecture changes" or "these two near-misses both touched attention — combine them."

5. **Git as experiment DAG.** `git reset --hard` on discard loses the diff history of failed experiments. You can't later go back and ask "what did experiment #47 actually change?" — it's gone.

## Proposed approach

Model the experiment loop as three primitives:

- **Worker**: A process bound to a GPU. Executes `uv run train.py`, parses the `---` metrics block from stdout, handles timeouts. One worker per GPU.
- **Function**: A stateless handler. `experiment::register`, `experiment::complete`, `search::adapt`, `pool::acquire`, `report::summary`, etc. All experiment state lives in a KV store keyed by scope (experiments, lineage, near_misses, gpu_pool, strategy).
- **Trigger**: Wires functions to events. HTTP triggers expose a REST API for external agents. Event triggers auto-adapt search strategy when experiments complete. Cron triggers run GPU health checks.

The external agent (Claude, Codex, whatever) calls the same `uv run train.py` command but wraps it with API calls:

```bash
# Before training
curl -X POST localhost:3111/api/experiment/register -d '{
  "tag":"mar11", "hypothesis":"...", "category":"architecture",
  "commit_sha":"a1b2c3d", "diff_summary":"DEPTH 8->12"
}'

# After training
curl -X POST localhost:3111/api/experiment/complete -d '{
  "experiment_id":"exp-xxx", "val_bpb":0.993, "peak_vram_mb":45060, ...
}'
# Returns: { "improved": true, "action": "keep_commit", "best_val_bpb": 0.993 }
```

### Concrete function inventory (23 total)

**Experiment lifecycle (7)**:
- `experiment::setup` — init tag, branch, strategy
- `experiment::register` — record hypothesis + diff before training
- `experiment::complete` — record metrics, auto-decide keep/discard, detect near-misses (within 0.002 BPB of best)
- `experiment::crash` — track consecutive crashes per tag, abort after 3
- `experiment::history` — query by tag, status, limit
- `experiment::best` — current best for a tag
- `experiment::near_misses` — experiments within striking distance

**Search strategy (4)**:
- `search::strategy` — get current mode
- `search::set_strategy` — manual override
- `search::adapt` — auto-adapt based on recent keep rate, crash rate, near-miss count. Transitions: explore → exploit (high keep rate), explore → combine (plateau + near-misses), explore → ablation (plateau, no near-misses), any → exploit (high crash rate)
- `search::suggest_direction` — analyze category-level stats, identify underexplored areas, suggest next experiment

**GPU pool (6)**:
- `pool::register_gpu` — worker self-registers on startup with nvidia-smi metadata
- `pool::heartbeat` — 30s heartbeat, mark offline after 60s stale
- `pool::list` — all workers with status (idle/training/offline)
- `pool::acquire` — atomic claim of idle GPU for an experiment
- `pool::release` — return GPU to pool
- `pool::deregister` — remove on shutdown

**Reporting (4)**:
- `report::summary` — full stats: keep rate, category breakdown, BPB progression, training time
- `report::tsv` — export in original autoresearch TSV format for compatibility
- `report::diff` — compare two experiments (delta BPB, params, VRAM)
- `report::tags` — list all run tags

**GPU execution (2, per worker)**:
- `gpu::train` — spawn `uv run train.py` with CUDA_VISIBLE_DEVICES isolation, parse metrics block, enforce kill timeout
- `gpu::health` — nvidia-smi temperature/memory/utilization

### KV state schema

```
experiments:{id}   → { id, tag, parent_id, commit_sha, hypothesis, category,
                       val_bpb, peak_vram_mb, status, gpu_id, diff_summary, ... }
lineage:{tag}      → [experiment_id, experiment_id, ...]  (ordered)
best:{tag}         → { experiment_id, val_bpb, commit_sha }
near_misses:{id}   → { experiment_id, val_bpb, delta, hypothesis, category }
gpu_pool:{gpu_id}  → { id, gpu_index, gpu_name, vram_mb, status, current_experiment_id }
strategy:{tag}     → { mode, explore_ratio, temperature, reason }
tags:{name}        → { name, branch, best_val_bpb, total_experiments, kept_experiments }
crashes:{tag}      → consecutive crash count (int)
```

### Multi-GPU parallelism

With N GPU workers, N agents can run experiments simultaneously on the same tag. Each agent:
1. Calls `pool::acquire` to claim a GPU
2. Runs training on that GPU
3. Calls `experiment::complete` (KV writes are serialized)
4. Calls `pool::release`

The search strategy adapts globally across all parallel experiments.

### What stays the same

- `train.py` is still the only file agents modify
- `prepare.py` is still read-only
- 5-minute fixed time budget
- val_bpb as the single metric
- Git branches for version control
- External agents (Claude, Codex) drive the loop
- `program.md` as the agent instruction file

### Implementation

Two polyglot workers:
- **Orchestrator** (Python, iii-sdk): All 21 orchestration functions + triggers
- **GPU Worker** (Rust, iii-sdk): Per-GPU process, training execution + health

The engine itself is [iii-engine](https://github.com/iii-hq/iii-engine) — Worker/Function/Trigger runtime with built-in KV state, PubSub, cron, and HTTP trigger modules.

Reference implementation: https://github.com/iii-hq/n-autoresearch

## Tradeoffs

**Added complexity**: Two worker processes + iii-engine runtime vs. zero infrastructure today. Justified only if you're running multi-GPU or want persistent experiment state across sessions.

**Not proposed**: Built-in LLM, auto-modification of train.py, or any change to the training code itself. The agent is still external — this only structures the infrastructure around it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat: Structured experiment orchestration via Worker/Function/Trigger primitives #169

Problem

Proposed approach

Concrete function inventory (23 total)

KV state schema

Multi-GPU parallelism

What stays the same

Implementation

Tradeoffs

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Feat: Structured experiment orchestration via Worker/Function/Trigger primitives #169

Description

Problem

Proposed approach

Concrete function inventory (23 total)

KV state schema

Multi-GPU parallelism

What stays the same

Implementation

Tradeoffs

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions