Skip to content

Feat: Structured experiment orchestration via Worker/Function/Trigger primitives #169

@rohitg00

Description

@rohitg00

Problem

The current experiment loop in program.md uses git branches for state management and a sequential single-agent loop. This works but has concrete engineering limitations:

  1. Sequential execution on one GPU. The loop is modify → train → evaluate → decide → repeat. With a 5-min time budget, throughput is ~12 experiments/hour. On a multi-GPU node (e.g. 8×H100), 7 GPUs sit idle.

  2. Flat state. results.tsv is append-only with no queryable structure. There's no way to ask "which architecture changes had the highest keep rate?" or "what near-misses should I combine?" without writing custom analysis code.

  3. No crash recovery. If the agent's context fills up or the process dies, the experiment state is lost. There's no way to resume from where you left off — you restart from scratch and re-read the TSV.

  4. No search adaptation. The agent uses raw LLM reasoning to pick the next experiment. There's no feedback loop that says "you've tried 15 hyperparams changes with 0 keeps, switch to architecture changes" or "these two near-misses both touched attention — combine them."

  5. Git as experiment DAG. git reset --hard on discard loses the diff history of failed experiments. You can't later go back and ask "what did experiment improvements to novelty #47 actually change?" — it's gone.

Proposed approach

Model the experiment loop as three primitives:

  • Worker: A process bound to a GPU. Executes uv run train.py, parses the --- metrics block from stdout, handles timeouts. One worker per GPU.
  • Function: A stateless handler. experiment::register, experiment::complete, search::adapt, pool::acquire, report::summary, etc. All experiment state lives in a KV store keyed by scope (experiments, lineage, near_misses, gpu_pool, strategy).
  • Trigger: Wires functions to events. HTTP triggers expose a REST API for external agents. Event triggers auto-adapt search strategy when experiments complete. Cron triggers run GPU health checks.

The external agent (Claude, Codex, whatever) calls the same uv run train.py command but wraps it with API calls:

# Before training
curl -X POST localhost:3111/api/experiment/register -d '{
  "tag":"mar11", "hypothesis":"...", "category":"architecture",
  "commit_sha":"a1b2c3d", "diff_summary":"DEPTH 8->12"
}'

# After training
curl -X POST localhost:3111/api/experiment/complete -d '{
  "experiment_id":"exp-xxx", "val_bpb":0.993, "peak_vram_mb":45060, ...
}'
# Returns: { "improved": true, "action": "keep_commit", "best_val_bpb": 0.993 }

Concrete function inventory (23 total)

Experiment lifecycle (7):

  • experiment::setup — init tag, branch, strategy
  • experiment::register — record hypothesis + diff before training
  • experiment::complete — record metrics, auto-decide keep/discard, detect near-misses (within 0.002 BPB of best)
  • experiment::crash — track consecutive crashes per tag, abort after 3
  • experiment::history — query by tag, status, limit
  • experiment::best — current best for a tag
  • experiment::near_misses — experiments within striking distance

Search strategy (4):

  • search::strategy — get current mode
  • search::set_strategy — manual override
  • search::adapt — auto-adapt based on recent keep rate, crash rate, near-miss count. Transitions: explore → exploit (high keep rate), explore → combine (plateau + near-misses), explore → ablation (plateau, no near-misses), any → exploit (high crash rate)
  • search::suggest_direction — analyze category-level stats, identify underexplored areas, suggest next experiment

GPU pool (6):

  • pool::register_gpu — worker self-registers on startup with nvidia-smi metadata
  • pool::heartbeat — 30s heartbeat, mark offline after 60s stale
  • pool::list — all workers with status (idle/training/offline)
  • pool::acquire — atomic claim of idle GPU for an experiment
  • pool::release — return GPU to pool
  • pool::deregister — remove on shutdown

Reporting (4):

  • report::summary — full stats: keep rate, category breakdown, BPB progression, training time
  • report::tsv — export in original autoresearch TSV format for compatibility
  • report::diff — compare two experiments (delta BPB, params, VRAM)
  • report::tags — list all run tags

GPU execution (2, per worker):

  • gpu::train — spawn uv run train.py with CUDA_VISIBLE_DEVICES isolation, parse metrics block, enforce kill timeout
  • gpu::health — nvidia-smi temperature/memory/utilization

KV state schema

experiments:{id}   → { id, tag, parent_id, commit_sha, hypothesis, category,
                       val_bpb, peak_vram_mb, status, gpu_id, diff_summary, ... }
lineage:{tag}      → [experiment_id, experiment_id, ...]  (ordered)
best:{tag}         → { experiment_id, val_bpb, commit_sha }
near_misses:{id}   → { experiment_id, val_bpb, delta, hypothesis, category }
gpu_pool:{gpu_id}  → { id, gpu_index, gpu_name, vram_mb, status, current_experiment_id }
strategy:{tag}     → { mode, explore_ratio, temperature, reason }
tags:{name}        → { name, branch, best_val_bpb, total_experiments, kept_experiments }
crashes:{tag}      → consecutive crash count (int)

Multi-GPU parallelism

With N GPU workers, N agents can run experiments simultaneously on the same tag. Each agent:

  1. Calls pool::acquire to claim a GPU
  2. Runs training on that GPU
  3. Calls experiment::complete (KV writes are serialized)
  4. Calls pool::release

The search strategy adapts globally across all parallel experiments.

What stays the same

  • train.py is still the only file agents modify
  • prepare.py is still read-only
  • 5-minute fixed time budget
  • val_bpb as the single metric
  • Git branches for version control
  • External agents (Claude, Codex) drive the loop
  • program.md as the agent instruction file

Implementation

Two polyglot workers:

  • Orchestrator (Python, iii-sdk): All 21 orchestration functions + triggers
  • GPU Worker (Rust, iii-sdk): Per-GPU process, training execution + health

The engine itself is iii-engine — Worker/Function/Trigger runtime with built-in KV state, PubSub, cron, and HTTP trigger modules.

Reference implementation: https://github.com/iii-hq/n-autoresearch

Tradeoffs

Added complexity: Two worker processes + iii-engine runtime vs. zero infrastructure today. Justified only if you're running multi-GPU or want persistent experiment state across sessions.

Not proposed: Built-in LLM, auto-modification of train.py, or any change to the training code itself. The agent is still external — this only structures the infrastructure around it.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions