Benchpress

CLI tool for benchmarking LLM providers and models. Measures response time, quality (tool accuracy, coherence, relevance), and cost — then generates a presentation-grade PDF report with LLM-generated insights.

Designed for comparing the same model across different providers (e.g., Claude Sonnet via Anthropic direct vs. via Gloo AI) to answer: is the provider adding latency, reducing quality, or saving enough money to justify the tradeoff?

Features

Multi-provider benchmarking — test multiple providers in a single run, each with its own auth (API key or OAuth client_credentials)
Head-to-head comparisons — tag models with base_model to group the same underlying model across providers
LLM-generated insights — configurable report LLM generates executive summaries and per-comparison analysis
Per-provider framing — insights_context lets you control the tone of generated insights (e.g., acknowledge a provider's guardrails or value-add)
PDF reports — timestamped, presentation-ready reports with side-by-side metrics and delta indicators
Flexible auth — API keys via env vars, or OAuth2 client_credentials flow with Basic or POST body methods
Custom headers — per-provider and per-model extra_headers for toggling features like extended context

Installation

Requires Python 3.11+.

# Clone the repo
git clone <repo-url> && cd ai-benchpress

# Install in editable mode
pip install -e .

# Or with pipx for isolated install
pipx install -e .

Quick Start

Copy the example config and environment file:

cp benchpress.yaml.example benchpress.yaml
cp .env.example .env

Add your API keys to .env:

ANTHROPIC_API_KEY=sk-ant-...
# For OAuth providers:
# GLOO_AI_CLIENT_ID=...
# GLOO_AI_CLIENT_SECRET=...

Edit benchpress.yaml to configure your providers and models.
Validate your config:

benchpress validate

Run benchmarks:

benchpress run -v

Configuration

All configuration lives in benchpress.yaml. See benchpress.yaml.example for a fully documented example.

Providers

Each provider needs a name, base_url, and authentication:

providers:
  # API key auth
  - name: "anthropic"
    base_url: "https://api.anthropic.com/v1"
    api_key_env: "ANTHROPIC_API_KEY"
    models: [...]

  # OAuth client_credentials auth
  - name: "gloo-ai"
    base_url: "https://platform.ai.gloo.com/ai/v2/"
    oauth:
      token_url: "https://platform.ai.gloo.com/oauth2/token"
      client_id_env: "GLOO_AI_CLIENT_ID"
      client_secret_env: "GLOO_AI_CLIENT_SECRET"
      scopes: ["api/access"]
      auth_method: "basic"    # or "post_body"
    models: [...]

  # No auth (e.g., local Ollama)
  - name: "ollama"
    base_url: "http://localhost:11434/v1"
    models: [...]

Cross-Provider Comparisons

Tag models with base_model to group the same underlying model across providers:

# Under anthropic provider
- id: "claude-sonnet-4-5"
  display_name: "Sonnet 4.5"
  base_model: "claude-sonnet-4-5"

# Under gloo-ai provider
- id: "gloo-anthropic-claude-sonnet-4.5"
  display_name: "Gloo AI - Sonnet 4.5"
  base_model: "claude-sonnet-4-5"    # same tag = head-to-head comparison

Provider Insights Context

Control how the report LLM frames each provider:

- name: "gloo-ai"
  insights_context: |
    Gloo AI provides faith-based guardrails and prompt enhancements.
    Acknowledge the value these safety features add beyond raw performance.

Extra Headers

Add custom headers at the provider or model level:

- name: "anthropic"
  extra_headers:
    X-Custom-Header: "value"        # applied to all models
  models:
    - id: "claude-opus-4-6"
      extra_headers:
        anthropic-beta: "interleaved-thinking-2025-05-14"  # per-model

Model headers merge on top of provider headers (model wins on conflict).

Report & Insights LLM

Configure which LLM generates report insights:

report:
  results_dir: "./results"
  report_path: "./reports"

  # Option A: reference an existing provider
  llm_provider: "anthropic"
  llm_model: "claude-sonnet-4-5"

  # Option B: standalone config (mutually exclusive with Option A)
  # llm:
  #   base_url: "https://api.anthropic.com/v1"
  #   api_key_env: "ANTHROPIC_API_KEY"
  #   model: "claude-sonnet-4-5"

If no LLM is configured, reports are generated with metrics only.

CLI Reference

# Run benchmarks
benchpress run [OPTIONS]
  -c, --config        Config file path (default: benchpress.yaml)
  -e, --env           .env file path
  -m, --models        Comma-separated model filter
  -p, --providers     Comma-separated provider filter
  -n, --num-requests  Override number of requests per model
  --concurrency       Override concurrency level
  --interval          Override interval between requests (ms)
  -o, --output        Override report output directory
  --no-pdf            Skip PDF generation, only save JSON
  -v, --verbose       Print per-model results during run

# Re-generate PDF from previous JSON results
benchpress report -i results/benchpress-YYYYMMDD-HHMMSS.json
  -c, --config        Config file (needed for LLM insights)
  -o, --output        Override report output directory

# List configured models
benchpress list-models

# Validate config and check env vars
benchpress validate

Output

Each run produces:

JSON results in results/benchpress-YYYYMMDD-HHMMSS.json — raw data for all requests
PDF report in reports/benchpress-report-YYYYMMDD-HHMMSS.pdf — presentation-ready report

Report Structure

Title & Executive Summary — config, winner badges, LLM-generated narrative
Head-to-Head Comparisons — side-by-side metrics with deltas for matched models
Additional Models — unmatched models in a compact table
Data Appendix — full metrics table, per-category breakdown, error summary

Project Structure

ai-benchpress/
  benchpress/
    cli.py          # CLI commands (typer)
    config.py       # YAML config loading + auth (API key, OAuth)
    client.py       # HTTP client for OpenAI-compatible APIs
    runner.py       # Async benchmark orchestration
    prompts.py      # Test prompt generation
    scorer.py       # Result scoring (tool accuracy, coherence, relevance)
    models.py       # Pydantic data models
    report.py       # PDF report generation + LLM insight calls
  benchpress.yaml.example
  .env.example
  pyproject.toml

Contributing

Fork the repo and create a feature branch
Install in editable mode: pip install -e .
Make your changes
Test with a small run: benchpress run -n 3 -v
Verify config loading: benchpress validate
Submit a pull request

Adding a New Provider

Add the provider block to benchpress.yaml with auth config
Add models with base_model tags if comparing against existing providers
Add insights_context to frame the provider in report narratives
Run benchpress validate to check credentials
Test with benchpress run -p your-provider -n 3 -v

Guidelines

All providers use the OpenAI-compatible /chat/completions endpoint
Keep benchpress.yaml.example up to date with any new config fields
Credential values belong in .env, never in YAML files — use *_env fields to reference env var names

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
benchpress		benchpress
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
benchpress.yaml		benchpress.yaml
benchpress.yaml.example		benchpress.yaml.example
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Benchpress

Features

Installation

Quick Start

Configuration

Providers

Cross-Provider Comparisons

Provider Insights Context

Extra Headers

Report & Insights LLM

CLI Reference

Output

Report Structure

Project Structure

Contributing

Adding a New Provider

Guidelines

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Benchpress

Features

Installation

Quick Start

Configuration

Providers

Cross-Provider Comparisons

Provider Insights Context

Extra Headers

Report & Insights LLM

CLI Reference

Output

Report Structure

Project Structure

Contributing

Adding a New Provider

Guidelines

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages