A/B testing platform for Claude agents.
Automates the plan → run → evaluate → repeat loop to find optimal agent configurations.
An experimentation platform that treats Claude agent configurations as hypotheses and tests them with statistical rigor. Define two agent configs that differ in exactly one variable, run them against the same tasks, and get a verdict backed by data — not vibes.
Over 7 days of dogfooding: ~200 experiment groups, ~6,000 trials, ~$110 USD across math, coding, reasoning, and security tasks.
|
AutoResearch Loop — generates hypotheses, runs A/B tests, iterates automatically
|
Research Complete — key findings, best config, and next steps
|
# Install
pip install -e .
# Run a single A/B experiment
openbench run experiments/quicktest_model.py
# Automated research from a natural language goal
openbench research "Find the best system prompt for a concise Q&A assistant" --max-iter 3
# N-way tournament
openbench tournament experiments/verified_math_tournament.py --yes
# Browse results interactively
openbench tuiRequires claude-agent-sdk (uses Claude Max subscription — no API key needed).
You: "Find the optimal turn budget for coding tasks"
↓
Planner (LLM) — generates A/B hypothesis & experiment config
↓
Runner — executes both agents in isolated temp dirs
│ collects: output, tokens, cost, tool calls, traces
↓
Evaluator — LLM-as-judge + optional ground-truth check_fn
↓
Comparator — statistical comparison, winner selection
↓
Loop (if auto-research) — winner becomes baseline, next hypothesis proposed
| Concept | Description |
|---|---|
| Experiment | Two agent configs (agent_a vs agent_b) differing in exactly one variable |
| DiffSpec | The single variable being tested: system_prompt, model, max_turns, allowed_tools, … |
| TaskItem | A task with optional expected_answer + check_fn for objective correctness |
| ModelRouter | Auto-select haiku vs sonnet per-task based on complexity signals |
| TournamentConfig | N-way round-robin comparison with correctness-based ranking |
| ResearchProgram | Natural language objective that drives automated experiment loops |
| Command | Description |
|---|---|
openbench run <file> |
Run a single A/B experiment |
openbench tournament <file> |
Run N-way round-robin tournament |
openbench research "<goal>" |
Auto-generate & run experiments from a goal |
openbench list |
List all experiment results |
openbench compare <name> |
Side-by-side comparison report |
openbench show <name> |
Show experiment details |
openbench runs <name> |
List individual trial runs |
openbench lineage <name> |
Show experiment lineage chain |
openbench save-program <name> |
Save a research program config |
openbench tui |
Interactive history browser (Textual TUI) |
Benchmark results from 240+ trials across math, coding, and reasoning tasks:
| Strategy | Best For | Correctness | Cost/Trial |
|---|---|---|---|
"claude-haiku-4-5" |
Budget-constrained, known-simple tasks | 87.5% | ~$0.011 |
ModelRouter() |
Mixed workloads, unknown complexity | 85.0% | ~$0.015 |
"claude-sonnet-4-6" |
Maximum efficiency, complex coding/analysis | 85.0% | ~$0.018 |
The 2.5% correctness gap is NOT statistically significant (McNemar's test, p > 0.30, n=80 paired trials). All three are equivalent on quality — the real difference is cost and efficiency.
from openbench.types import AgentConfig, ModelRouter
agent = AgentConfig(
name="auto",
model=ModelRouter(
default="claude-haiku-4-5", # Simple tasks (5× lower cost)
upgrade="claude-sonnet-4-6", # Complex tasks (2× fewer tokens)
threshold_tokens=200,
),
allowed_tools=["Read", "Bash", "Glob"],
max_turns=20,
)| Signal | Condition | Routes to |
|---|---|---|
| Difficulty tag | TaskItem(difficulty="hard") |
sonnet |
| Token count | Estimated input ≥ 200 tokens | sonnet |
| Keywords | "implement", "debug", "analyze", "build", … | sonnet |
| Default | None of the above | haiku |
From 200+ experiments on this platform:
| Finding | Recommendation |
|---|---|
| haiku + Bash = sonnet on math | Give haiku tools instead of upgrading models |
| Sonnet uses 66% fewer tokens on coding | Use sonnet for code tasks if efficiency matters |
| max_turns=20 is optimal (100% success) | Set max_turns = 2× expected tool calls |
| Structured prompts can hurt | Prefer minimal prompts over heavy templates |
| Turn budget matters more than model choice | Optimize turns before spending on bigger models |
| Posture ("be careful") > procedure (step lists) | Attitude-based prompts outperform rigid instructions |
Full experiment write-ups → reports/
src/openbench/
├── types.py # AgentConfig, TaskItem, ModelRouter, TournamentConfig
├── runner.py # ExperimentRunner — orchestrates agent execution
├── evaluator.py # LLM-as-judge with optional ground truth
├── compare.py # Rich terminal comparison reports
├── tournament.py # N-way round-robin tournaments
├── planner.py # ExperimentPlanner — NL goal → A/B experiment
├── program.py # ResearchProgram — NL objective for auto-research
├── autoloop.py # AutoResearchLoop — main orchestration loop
├── isolation.py # Environment isolation for agent runs
├── metrics.py # Metrics collection and calculation
├── storage.py # JSONL + JSON persistence
├── cli.py # Typer CLI
├── _sdk_call.py # Thin wrapper for LLM calls via SDK
├── _tui.py # TUI helpers (progress bars, callbacks)
├── _history_tui.py # Interactive Textual TUI for browsing results
└── _utils.py # Shared internal utilities
experiments/ # 80+ experiment definitions
programs/ # Saved ResearchProgram configs
results/ # Raw trial data (JSONL + metadata)
reports/ # 45+ experiment reports
tests/ # pytest test suite
docs/ # Guides, memory, plans
MIT © 2025 cklxx

