OpenBench

A/B testing platform for Claude agents.
Automates the plan → run → evaluate → repeat loop to find optimal agent configurations.

What is OpenBench?

An experimentation platform that treats Claude agent configurations as hypotheses and tests them with statistical rigor. Define two agent configs that differ in exactly one variable, run them against the same tasks, and get a verdict backed by data — not vibes.

Over 7 days of dogfooding: ~200 experiment groups, ~6,000 trials, ~$110 USD across math, coding, reasoning, and security tasks.

Preview

AutoResearch Loop — generates hypotheses, runs A/B tests, iterates automatically

Research Complete — key findings, best config, and next steps

Quick Start

# Install
pip install -e .

# Run a single A/B experiment
openbench run experiments/quicktest_model.py

# Automated research from a natural language goal
openbench research "Find the best system prompt for a concise Q&A assistant" --max-iter 3

# N-way tournament
openbench tournament experiments/verified_math_tournament.py --yes

# Browse results interactively
openbench tui

Requires claude-agent-sdk (uses Claude Max subscription — no API key needed).

How It Works

You: "Find the optimal turn budget for coding tasks"
        ↓
  Planner (LLM)           — generates A/B hypothesis & experiment config
        ↓
  Runner                   — executes both agents in isolated temp dirs
        │                     collects: output, tokens, cost, tool calls, traces
        ↓
  Evaluator                — LLM-as-judge + optional ground-truth check_fn
        ↓
  Comparator               — statistical comparison, winner selection
        ↓
  Loop (if auto-research)  — winner becomes baseline, next hypothesis proposed

Key Concepts

Concept	Description
Experiment	Two agent configs (`agent_a` vs `agent_b`) differing in exactly one variable
DiffSpec	The single variable being tested: `system_prompt`, `model`, `max_turns`, `allowed_tools`, …
TaskItem	A task with optional `expected_answer` + `check_fn` for objective correctness
ModelRouter	Auto-select haiku vs sonnet per-task based on complexity signals
TournamentConfig	N-way round-robin comparison with correctness-based ranking
ResearchProgram	Natural language objective that drives automated experiment loops

CLI

Command	Description
`openbench run <file>`	Run a single A/B experiment
`openbench tournament <file>`	Run N-way round-robin tournament
`openbench research "<goal>"`	Auto-generate & run experiments from a goal
`openbench list`	List all experiment results
`openbench compare <name>`	Side-by-side comparison report
`openbench show <name>`	Show experiment details
`openbench runs <name>`	List individual trial runs
`openbench lineage <name>`	Show experiment lineage chain
`openbench save-program <name>`	Save a research program config
`openbench tui`	Interactive history browser (Textual TUI)

Model Selection Guide

Benchmark results from 240+ trials across math, coding, and reasoning tasks:

Strategy	Best For	Correctness	Cost/Trial
`"claude-haiku-4-5"`	Budget-constrained, known-simple tasks	87.5%	~$0.011
`ModelRouter()`	Mixed workloads, unknown complexity	85.0%	~$0.015
`"claude-sonnet-4-6"`	Maximum efficiency, complex coding/analysis	85.0%	~$0.018

The 2.5% correctness gap is NOT statistically significant (McNemar's test, p > 0.30, n=80 paired trials). All three are equivalent on quality — the real difference is cost and efficiency.

ModelRouter — Automatic Model Selection

from openbench.types import AgentConfig, ModelRouter

agent = AgentConfig(
    name="auto",
    model=ModelRouter(
        default="claude-haiku-4-5",   # Simple tasks (5× lower cost)
        upgrade="claude-sonnet-4-6",  # Complex tasks (2× fewer tokens)
        threshold_tokens=200,
    ),
    allowed_tools=["Read", "Bash", "Glob"],
    max_turns=20,
)

Signal	Condition	Routes to
Difficulty tag	`TaskItem(difficulty="hard")`	sonnet
Token count	Estimated input ≥ 200 tokens	sonnet
Keywords	"implement", "debug", "analyze", "build", …	sonnet
Default	None of the above	haiku

Key Findings

From 200+ experiments on this platform:

Finding	Recommendation
haiku + Bash = sonnet on math	Give haiku tools instead of upgrading models
Sonnet uses 66% fewer tokens on coding	Use sonnet for code tasks if efficiency matters
max_turns=20 is optimal (100% success)	Set `max_turns = 2× expected tool calls`
Structured prompts can hurt	Prefer minimal prompts over heavy templates
Turn budget matters more than model choice	Optimize turns before spending on bigger models
Posture ("be careful") > procedure (step lists)	Attitude-based prompts outperform rigid instructions

Full experiment write-ups → reports/

Architecture

src/openbench/
├── types.py              # AgentConfig, TaskItem, ModelRouter, TournamentConfig
├── runner.py             # ExperimentRunner — orchestrates agent execution
├── evaluator.py          # LLM-as-judge with optional ground truth
├── compare.py            # Rich terminal comparison reports
├── tournament.py         # N-way round-robin tournaments
├── planner.py            # ExperimentPlanner — NL goal → A/B experiment
├── program.py            # ResearchProgram — NL objective for auto-research
├── autoloop.py           # AutoResearchLoop — main orchestration loop
├── isolation.py          # Environment isolation for agent runs
├── metrics.py            # Metrics collection and calculation
├── storage.py            # JSONL + JSON persistence
├── cli.py                # Typer CLI
├── _sdk_call.py          # Thin wrapper for LLM calls via SDK
├── _tui.py               # TUI helpers (progress bars, callbacks)
├── _history_tui.py       # Interactive Textual TUI for browsing results
└── _utils.py             # Shared internal utilities

experiments/              # 80+ experiment definitions
programs/                 # Saved ResearchProgram configs
results/                  # Raw trial data (JSONL + metadata)
reports/                  # 45+ experiment reports
tests/                    # pytest test suite
docs/                     # Guides, memory, plans

Name		Name	Last commit message	Last commit date
Latest commit History 63 Commits
docs		docs
experiments		experiments
programs		programs
reports		reports
results		results
src/openbench		src/openbench
tests		tests
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
README.md		README.md
TODOS.md		TODOS.md
pyproject.toml		pyproject.toml
run_injection_matrix.py		run_injection_matrix.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OpenBench

What is OpenBench?

Preview

Quick Start

How It Works

Key Concepts

CLI

Model Selection Guide

ModelRouter — Automatic Model Selection

Key Findings

Architecture

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

OpenBench

What is OpenBench?

Preview

Quick Start

How It Works

Key Concepts

CLI

Model Selection Guide

ModelRouter — Automatic Model Selection

Key Findings

Architecture

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages