CLI tool for benchmarking LLM providers and models. Measures response time, quality (tool accuracy, coherence, relevance), and cost — then generates a presentation-grade PDF report with LLM-generated insights.
Designed for comparing the same model across different providers (e.g., Claude Sonnet via Anthropic direct vs. via Gloo AI) to answer: is the provider adding latency, reducing quality, or saving enough money to justify the tradeoff?
- Multi-provider benchmarking — test multiple providers in a single run, each with its own auth (API key or OAuth client_credentials)
- Head-to-head comparisons — tag models with
base_modelto group the same underlying model across providers - LLM-generated insights — configurable report LLM generates executive summaries and per-comparison analysis
- Per-provider framing —
insights_contextlets you control the tone of generated insights (e.g., acknowledge a provider's guardrails or value-add) - PDF reports — timestamped, presentation-ready reports with side-by-side metrics and delta indicators
- Flexible auth — API keys via env vars, or OAuth2 client_credentials flow with Basic or POST body methods
- Custom headers — per-provider and per-model
extra_headersfor toggling features like extended context
Requires Python 3.11+.
# Clone the repo
git clone <repo-url> && cd ai-benchpress
# Install in editable mode
pip install -e .
# Or with pipx for isolated install
pipx install -e .- Copy the example config and environment file:
cp benchpress.yaml.example benchpress.yaml
cp .env.example .env- Add your API keys to
.env:
ANTHROPIC_API_KEY=sk-ant-...
# For OAuth providers:
# GLOO_AI_CLIENT_ID=...
# GLOO_AI_CLIENT_SECRET=...-
Edit
benchpress.yamlto configure your providers and models. -
Validate your config:
benchpress validate- Run benchmarks:
benchpress run -vAll configuration lives in benchpress.yaml. See benchpress.yaml.example for a fully documented example.
Each provider needs a name, base_url, and authentication:
providers:
# API key auth
- name: "anthropic"
base_url: "https://api.anthropic.com/v1"
api_key_env: "ANTHROPIC_API_KEY"
models: [...]
# OAuth client_credentials auth
- name: "gloo-ai"
base_url: "https://platform.ai.gloo.com/ai/v2/"
oauth:
token_url: "https://platform.ai.gloo.com/oauth2/token"
client_id_env: "GLOO_AI_CLIENT_ID"
client_secret_env: "GLOO_AI_CLIENT_SECRET"
scopes: ["api/access"]
auth_method: "basic" # or "post_body"
models: [...]
# No auth (e.g., local Ollama)
- name: "ollama"
base_url: "http://localhost:11434/v1"
models: [...]Tag models with base_model to group the same underlying model across providers:
# Under anthropic provider
- id: "claude-sonnet-4-5"
display_name: "Sonnet 4.5"
base_model: "claude-sonnet-4-5"
# Under gloo-ai provider
- id: "gloo-anthropic-claude-sonnet-4.5"
display_name: "Gloo AI - Sonnet 4.5"
base_model: "claude-sonnet-4-5" # same tag = head-to-head comparisonControl how the report LLM frames each provider:
- name: "gloo-ai"
insights_context: |
Gloo AI provides faith-based guardrails and prompt enhancements.
Acknowledge the value these safety features add beyond raw performance.Add custom headers at the provider or model level:
- name: "anthropic"
extra_headers:
X-Custom-Header: "value" # applied to all models
models:
- id: "claude-opus-4-6"
extra_headers:
anthropic-beta: "interleaved-thinking-2025-05-14" # per-modelModel headers merge on top of provider headers (model wins on conflict).
Configure which LLM generates report insights:
report:
results_dir: "./results"
report_path: "./reports"
# Option A: reference an existing provider
llm_provider: "anthropic"
llm_model: "claude-sonnet-4-5"
# Option B: standalone config (mutually exclusive with Option A)
# llm:
# base_url: "https://api.anthropic.com/v1"
# api_key_env: "ANTHROPIC_API_KEY"
# model: "claude-sonnet-4-5"If no LLM is configured, reports are generated with metrics only.
# Run benchmarks
benchpress run [OPTIONS]
-c, --config Config file path (default: benchpress.yaml)
-e, --env .env file path
-m, --models Comma-separated model filter
-p, --providers Comma-separated provider filter
-n, --num-requests Override number of requests per model
--concurrency Override concurrency level
--interval Override interval between requests (ms)
-o, --output Override report output directory
--no-pdf Skip PDF generation, only save JSON
-v, --verbose Print per-model results during run
# Re-generate PDF from previous JSON results
benchpress report -i results/benchpress-YYYYMMDD-HHMMSS.json
-c, --config Config file (needed for LLM insights)
-o, --output Override report output directory
# List configured models
benchpress list-models
# Validate config and check env vars
benchpress validateEach run produces:
- JSON results in
results/benchpress-YYYYMMDD-HHMMSS.json— raw data for all requests - PDF report in
reports/benchpress-report-YYYYMMDD-HHMMSS.pdf— presentation-ready report
- Title & Executive Summary — config, winner badges, LLM-generated narrative
- Head-to-Head Comparisons — side-by-side metrics with deltas for matched models
- Additional Models — unmatched models in a compact table
- Data Appendix — full metrics table, per-category breakdown, error summary
ai-benchpress/
benchpress/
cli.py # CLI commands (typer)
config.py # YAML config loading + auth (API key, OAuth)
client.py # HTTP client for OpenAI-compatible APIs
runner.py # Async benchmark orchestration
prompts.py # Test prompt generation
scorer.py # Result scoring (tool accuracy, coherence, relevance)
models.py # Pydantic data models
report.py # PDF report generation + LLM insight calls
benchpress.yaml.example
.env.example
pyproject.toml
- Fork the repo and create a feature branch
- Install in editable mode:
pip install -e . - Make your changes
- Test with a small run:
benchpress run -n 3 -v - Verify config loading:
benchpress validate - Submit a pull request
- Add the provider block to
benchpress.yamlwith auth config - Add models with
base_modeltags if comparing against existing providers - Add
insights_contextto frame the provider in report narratives - Run
benchpress validateto check credentials - Test with
benchpress run -p your-provider -n 3 -v
- All providers use the OpenAI-compatible
/chat/completionsendpoint - Keep
benchpress.yaml.exampleup to date with any new config fields - Credential values belong in
.env, never in YAML files — use*_envfields to reference env var names