SearchBench - Honest benchmarks for agentic search APIs

SearchBench is a lean, opinionated benchmark for evaluating search APIs with strict, comparable settings. It answers one question: which provider performs best for real-world research today?

What it does

Strict provider modes - no fallbacks, no escalation, one mode per API.
Curated queries - 50 public questions + a private, gitignored set.
Bias-mitigated judging - GPT-4o-mini with preflight and fallback grading.
Shareable reports - static HTML output with history tracking.
Adaptive timeouts - data-driven calibration from prior runs.

Quickstart

python3 -m venv .venv
./.venv/bin/python -m pip install -r requirements.txt

cp .env.example .env
# Add your API keys

./scripts/searchbench run
# Optional: install CLI entrypoint
# . .venv/bin/activate
# python3 -m pip install -e .
# searchbench run

Core commands

./scripts/searchbench run
./scripts/searchbench quick
./scripts/searchbench history
./scripts/searchbench summary
./scripts/searchbench validate
./scripts/searchbench report
./scripts/searchbench calibrate
./scripts/searchbench debug --provider exa --queries hard --count 5

Query sets

searchbench/queries/public.json - 50 curated, verifiable questions.
searchbench/queries/hard.json - hard, evidence-gated questions.
searchbench/queries/private.json - personal set (gitignored). Copy from searchbench/queries/private.json.template.

Hard benchmark

Run the evidence-gated hard set:

./scripts/searchbench run --queries hard
./scripts/searchbench summary

# Evidence modes: strict (default), min (citations only), off (ignore evidence)
./scripts/searchbench run --queries hard --evidence min

Results

Reports and history are written to:

results/latest.html
results/<YYYY-MM-DD>.html
results/history.json

Pricing (January 2025)

Provider	Cost/Query	Free Tier	Notes
Exa (/answer)	$0.01	$10 credits	Search + answer
Parallel	$0.005	16,000 queries	Strict v1beta search
Brave	$0.005	2,000/month	AI summary enabled
Linkup	~$0.0055	Free tier	Standard mode
Tavily	$0.008 paid	1,000/month	Basic search

Configuration

Timeouts live in config.toml and can be recalibrated from history: Optional environment knobs:

QUERY_CONCURRENCY (default 2) controls how many queries run in parallel.
JUDGE_CONCURRENCY (default 6) caps concurrent judge calls.

./scripts/searchbench calibrate
./scripts/searchbench debug --provider exa --queries hard --count 5
./scripts/searchbench calibrate --apply

Tests

./.venv/bin/python -m unittest discover -s tests

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
.beads		.beads
results		results
scripts		scripts
searchbench		searchbench
tests		tests
.env.example		.env.example
.gitignore		.gitignore
AGENTS.md		AGENTS.md
LICENSE		LICENSE
README.md		README.md
SEARCHBENCH_SPEC_REVISED.md		SEARCHBENCH_SPEC_REVISED.md
config.toml		config.toml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SearchBench - Honest benchmarks for agentic search APIs

What it does

Quickstart

Core commands

Query sets

Hard benchmark

Results

Pricing (January 2025)

Configuration

Tests

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SearchBench - Honest benchmarks for agentic search APIs

What it does

Quickstart

Core commands

Query sets

Hard benchmark

Results

Pricing (January 2025)

Configuration

Tests

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages