Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -21,3 +21,7 @@ dev/

# Results file
results.tsv

# Checkpoints and logs
checkpoints/
*.log
150 changes: 150 additions & 0 deletions AUTORESEARCH_V2.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,150 @@
# Autoresearch v2 — Multi-Agent Autonomous ML Research Platform

Built on top of [Karpathy's autoresearch](https://github.com/karpathy/autoresearch): an AI agent autonomously experiments on a small GPT training setup, modifying code, training for a fixed time budget, checking if the result improved, and repeating. We extended this from a single-agent/single-file loop into a multi-agent, multi-scale, self-improving research platform.

## What's New

### Variable Time Scales

Instead of a fixed 5-minute budget, experiments run at different scales:

| Scale | Duration | Use |
|-------|----------|-----|
| probe | 30s | Memory/compilation check |
| quick | 2min | Rough signal, kill bad ideas fast |
| standard | 5min | Real evaluation (original behavior) |
| long | 15min | Confirm promising results |
| deep | 30min | Final validation |

Agents start with quick runs and escalate promising ideas to longer scales. All configurable via `AR_*` environment variables.

### Research Memory

Replaces the flat `results.tsv` with a shared knowledge base in `results/`:

- **experiments.jsonl** — All experiment records from all agents (append-only, file-locked)
- **lessons.jsonl** — Discovered patterns and insights (e.g., "depth>12 always OOMs")
- **journal.md** — Research summaries and the director's agenda

Backward-compatible `results.tsv` is auto-generated for the analysis notebook.

### Model Checkpoints

Promising runs save full model + optimizer state to `checkpoints/`. This enables the scaling ladder: a good 5-minute result can be resumed and trained for 15 more minutes without starting from scratch.

### Multi-Agent Research Org

Five specialized agent roles, each with tailored instructions:

| Role | What it does |
|------|-------------|
| **Explorer** | Bold architectural experiments — new activations, attention variants, normalization |
| **Optimizer** | Systematic hyperparameter tuning — LR sweeps, batch size, schedules |
| **Analyst** | Reads all results, finds patterns, writes lessons, suggests combinations |
| **Reviewer** | Validates improvements at longer scales, runs ablations, catches false positives |
| **Director** | Coordinates research agenda, cherry-picks across branches, redirects stuck agents |

Each agent runs in its own git worktree with its own branch, sharing results through the knowledge base.

### GPU Queue

When multiple agents share one GPU, a priority queue manages access. Probe jobs (30s) jump ahead so agents get fast feedback. Agents can plan their next experiment while waiting.

## File Structure

```
# New infrastructure
config.py — Configurable constants (env var overrides)
knowledge.py — JSONL research memory with file locking
checkpoint.py — Model checkpoint save/load
run_experiment.py — Experiment runner wrapper
gpu_queue.py — Priority queue for GPU sharing
launch.py — Multi-agent orchestrator

# Agent role instructions
programs/explorer.md
programs/optimizer.md
programs/analyst.md
programs/reviewer.md
programs/director.md

# Original files (modified)
prepare.py — Constants now configurable via config.py
train.py — Checkpoint save added (env-var gated)
program.md — Multi-agent awareness section added

# Shared directories (gitignored, created at runtime)
results/ — Knowledge base (experiments, lessons, journal)
queue/ — GPU job queue
checkpoints/ — Saved model states
worktrees/ — Per-agent git worktrees
```

## Quick Start

### Solo Mode (enhanced single agent)

```bash
python launch.py launch --tag mar10 --preset solo
```

One explorer agent with the full knowledge base, variable time scales, and checkpointing.

### Duo (explorer + optimizer sharing one GPU)

```bash
python launch.py launch --tag mar10 --preset duo --single-gpu
```

### Full Research Org in tmux

```bash
python launch.py launch --tag mar10 --preset full --single-gpu --tmux
```

All five agents running in tmux panes. Attach with `tmux attach -t autoresearch-mar10`.

### Custom Agent Mix

```bash
python launch.py launch --tag mar10 --agents explorer:2,optimizer,director --single-gpu
```

Two explorers, one optimizer, one director.

### Cleanup

```bash
python launch.py cleanup --tag mar10
```

Removes worktrees and branches for a tag.

## Standalone Tools

```bash
# View research briefing (what agents read before each experiment)
python run_experiment.py --briefing

# Run a single experiment manually
python run_experiment.py --scale quick --description "try GLU activation"

# Log a lesson
python run_experiment.py --lesson architecture medium "GLU helps at depth=8"

# Check GPU queue
python gpu_queue.py status

# List checkpoints
python checkpoint.py

# Sync knowledge base to legacy results.tsv
python knowledge.py sync-tsv
```

## Requirements

- Single NVIDIA GPU (tested on H100)
- Python 3.10+
- [uv](https://docs.astral.sh/uv/)
- [Claude Code](https://claude.com/claude-code) CLI (`claude` command)
18 changes: 2 additions & 16 deletions analysis.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -16,21 +16,7 @@
"id": "v3r8c77lxhs",
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import matplotlib.pyplot as plt\n",
"import numpy as np\n",
"\n",
"# Load the TSV (tab-separated, 5 columns: commit, val_bpb, memory_gb, status, description)\n",
"df = pd.read_csv(\"results.tsv\", sep=\"\\t\")\n",
"df[\"val_bpb\"] = pd.to_numeric(df[\"val_bpb\"], errors=\"coerce\")\n",
"df[\"memory_gb\"] = pd.to_numeric(df[\"memory_gb\"], errors=\"coerce\")\n",
"df[\"status\"] = df[\"status\"].str.strip().str.upper()\n",
"\n",
"print(f\"Total experiments: {len(df)}\")\n",
"print(f\"Columns: {list(df.columns)}\")\n",
"df.head(10)"
]
"source": "import pandas as pd\nimport matplotlib.pyplot as plt\nimport numpy as np\n\n# Load the TSV (tab-separated, 10 columns: commit, val_bpb, memory_gb, mfu, num_params_M, depth, total_batch_size, matrix_lr, status, description)\ndf = pd.read_csv(\"results.tsv\", sep=\"\\t\")\ndf[\"val_bpb\"] = pd.to_numeric(df[\"val_bpb\"], errors=\"coerce\")\ndf[\"memory_gb\"] = pd.to_numeric(df[\"memory_gb\"], errors=\"coerce\")\ndf[\"status\"] = df[\"status\"].str.strip().str.upper()\n\nprint(f\"Total experiments: {len(df)}\")\nprint(f\"Columns: {list(df.columns)}\")\ndf.head(10)"
},
{
"cell_type": "code",
Expand Down Expand Up @@ -243,4 +229,4 @@
},
"nbformat": 4,
"nbformat_minor": 5
}
}
99 changes: 99 additions & 0 deletions checkpoint.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,99 @@
"""
Model checkpoint save/load for autoresearch.
Allows promising runs to be resumed at longer time scales.
"""

import json
import os
import torch

from config import get_checkpoint_dir


def save_checkpoint(model, optimizer, step, config_dict, metadata, path=None):
"""Save model + optimizer state for later resumption.

Args:
model: The compiled or raw GPT model.
optimizer: The MuonAdamW optimizer.
step: Current training step.
config_dict: Model config as dict (from dataclasses.asdict).
metadata: Dict with val_bpb, scale, branch, commit, etc.
path: Override save path. Default: checkpoints/<commit>_<scale>.pt
"""
ckpt_dir = get_checkpoint_dir()
os.makedirs(ckpt_dir, exist_ok=True)

if path is None:
commit = metadata.get("commit", "unknown")
scale = metadata.get("scale", "unknown")
path = os.path.join(ckpt_dir, f"{commit}_{scale}.pt")

# Handle torch.compile'd models
raw_model = model._orig_mod if hasattr(model, "_orig_mod") else model

checkpoint = {
"model_state": raw_model.state_dict(),
"optimizer_state": optimizer.state_dict(),
"step": step,
"config": config_dict,
"metadata": metadata,
}
torch.save(checkpoint, path)
print(f"Checkpoint saved to {path}")
return path


def load_checkpoint(path, device="cuda"):
"""Load a checkpoint. Returns dict with model_state, optimizer_state, step, config, metadata."""
checkpoint = torch.load(path, map_location=device, weights_only=False)
print(f"Checkpoint loaded from {path}")
return checkpoint


def list_checkpoints(sort_by="val_bpb"):
"""List available checkpoints sorted by metric."""
ckpt_dir = get_checkpoint_dir()
if not os.path.exists(ckpt_dir):
return []

checkpoints = []
for fname in os.listdir(ckpt_dir):
if not fname.endswith(".pt"):
continue
path = os.path.join(ckpt_dir, fname)
try:
# Load only metadata (not full weights) for listing
ckpt = torch.load(path, map_location="cpu", weights_only=False)
meta = ckpt.get("metadata", {})
meta["path"] = path
meta["filename"] = fname
checkpoints.append(meta)
except Exception:
continue

if sort_by == "val_bpb":
checkpoints.sort(key=lambda c: c.get("val_bpb", float("inf")))
elif sort_by == "time":
checkpoints.sort(key=lambda c: os.path.getmtime(c["path"]), reverse=True)

return checkpoints


def find_best_checkpoint():
"""Return path to checkpoint with lowest val_bpb, or None."""
ckpts = list_checkpoints(sort_by="val_bpb")
if ckpts:
return ckpts[0]["path"]
return None


if __name__ == "__main__":
ckpts = list_checkpoints()
if not ckpts:
print("No checkpoints found.")
else:
print(f"{'val_bpb':>10} {'scale':>8} {'commit':>8} path")
print("-" * 60)
for c in ckpts:
print(f"{c.get('val_bpb', '?'):>10.6f} {c.get('scale', '?'):>8} {c.get('commit', '?'):>8} {c['path']}")
48 changes: 48 additions & 0 deletions config.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
"""
Centralized configuration for autoresearch.
All constants are overridable via environment variables (AR_ prefix).
Without env vars, defaults match the original hardcoded values.
"""

import os

TIME_BUDGETS = {
"probe": 30,
"quick": 120,
"standard": 300,
"long": 900,
"deep": 1800,
}

SCALE_PRIORITY = {"probe": 0, "quick": 1, "standard": 2, "long": 3, "deep": 4}


def get_scale() -> str:
return os.environ.get("AR_SCALE", "standard")


def get_time_budget() -> int:
explicit = os.environ.get("AR_TIME_BUDGET")
if explicit is not None:
return int(explicit)
return TIME_BUDGETS.get(get_scale(), 300)


def get_max_seq_len() -> int:
return int(os.environ.get("AR_MAX_SEQ_LEN", "2048"))


def get_eval_tokens() -> int:
return int(os.environ.get("AR_EVAL_TOKENS", str(160 * 524288)))


def get_checkpoint_dir() -> str:
return os.environ.get("AR_CHECKPOINT_DIR", os.path.join(os.path.dirname(__file__), "checkpoints"))


def get_results_dir() -> str:
return os.environ.get("AR_RESULTS_DIR", os.path.join(os.path.dirname(__file__), "results"))


def get_queue_dir() -> str:
return os.environ.get("AR_QUEUE_DIR", os.path.join(os.path.dirname(__file__), "queue"))
Loading