Terminal-Bench 2 Audit

Tools for auditing and improving test coverage quality across the 89 challenges in Terminal-Bench 2.0. See https://jonathanpgabor.substack.com/p/every-benchmark-is-broken for background.

Overview

This repository contains:

test_audit.py — Audits each challenge's test suite by sending the task description, container filesystem state, test code, and reference solution to an LLM, which rates coverage from 1-5.
harden_tests.py — Takes audit results and generates improved tests for challenges that scored below a threshold.
terminal_bench_2.py — The Inspect AI evaluation task definition for running Terminal-Bench 2.0 challenges.

Setup

Requires Python 3.12+. Install dependencies with uv:

uv sync

Set API keys for your chosen provider:

export ANTHROPIC_API_KEY=...
# or
export OPENAI_API_KEY=...

Docker must be running, as the audit tool pulls and inspects challenge container images.

Usage

Auditing Tests

# Audit all challenges using Claude
uv run python test_audit.py --provider anthropic

# Audit all challenges using GPT
uv run python test_audit.py --provider openai

# Audit a specific challenge
uv run python test_audit.py --challenge chess-best-move --provider anthropic

# Audit first 10 challenges with 4 parallel workers
uv run python test_audit.py --limit 10 --parallel 4 --provider anthropic

Results are saved to a timestamped directory under audit_results/ containing:

File	Description
`audit.txt`	Full LLM feedback for each challenge
`prompts.txt`	Prompts sent to the auditor
`ratings.json`	`{challenge_name: rating}` dictionary
`stats.json`	Summary statistics (average, distribution, etc.)

Hardening Tests

After an audit run, improve tests that scored at or below a threshold:

# Harden tests rated 3 or below (default threshold)
uv run python harden_tests.py --audit-dir audit_results/20260120_125158 --provider anthropic

# Use a stricter threshold
uv run python harden_tests.py --audit-dir audit_results/20260120_125158 --threshold 2 --provider anthropic

Results are saved to a timestamped directory under harden_results/.

Running the Benchmark

The evaluation uses Inspect AI:

inspect eval terminal_bench_2.py

See TERMINALBENCH.md for full benchmark documentation.

Challenge Structure

Each of the 89 challenges lives under challenges/ with the following layout:

challenges/<name>/
├── eval.yaml              # Task description and variants
├── compose.yaml           # Docker container specification
├── tests/
│   ├── test_outputs.py    # Python test suite
│   └── test.sh            # Test runner script
└── solution/
    └── solve.sh           # Reference solution

How the Audit Works

For each challenge, the auditor:

Reads the task description from eval.yaml
Pulls the Docker image and inspects the container filesystem (directory tree + key file contents)
Reads the test suite (test_outputs.py and any helper files)
Reads the reference solution (solve.sh)
Sends everything to an LLM with a prompt asking it to evaluate test coverage quality
Extracts a 1-5 rating from the response

The LLM rates coverage considering whether the tests:

Verify correctness of the solution (not just that it runs)
Cover edge cases and failure modes
Are resistant to shortcut solutions that don't actually solve the task

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
audit_results/20260120_125158		audit_results/20260120_125158
challenges		challenges
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
TERMINALBENCH.md		TERMINALBENCH.md
__init__.py		__init__.py
harden_tests.py		harden_tests.py
pyproject.toml		pyproject.toml
terminal_bench_2.py		terminal_bench_2.py
test_audit.py		test_audit.py
util.py		util.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Terminal-Bench 2 Audit

Overview

Setup

Usage

Auditing Tests

Hardening Tests

Running the Benchmark

Challenge Structure

How the Audit Works

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Terminal-Bench 2 Audit

Overview

Setup

Usage

Auditing Tests

Hardening Tests

Running the Benchmark

Challenge Structure

How the Audit Works

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages