The open-source benchmark runner for AI agent efficiency.
Evaluate any API across 8 dimensions of agent-readiness using multi-LLM scoring.
Installation | Quickstart | How It Works | Templates | Scoring | prowl.world
APIs are designed for humans to read docs and figure things out. But agents don't read docs -- they make HTTP calls and parse responses. An API that's great for humans can be terrible for agents.
prowl-bench measures what matters for agents:
- Does the API respond with parseable, predictable JSON? Not HTML error pages, not XML, not random formats.
- Can an agent authenticate on the first try? Or does it need 47 steps, an OAuth dance, and a CAPTCHA?
- Are errors actionable?
{"error": "invalid"}tells an agent nothing.{"error": "missing required field 'email'", "code": "VALIDATION_ERROR"}tells it exactly what to fix. - How many tokens does it cost to understand? A 50-page OpenAPI spec vs a clean
/llms.txt-- the difference is real money.
Traditional API testing tools measure uptime and response time. prowl-bench measures whether an AI agent can actually use your API.
$ prowl-bench run https://api.stripe.com
Benchmarking https://api.stripe.com ...
SPEC Fetched llms.txt ................................ OK (0.8s)
ANALYZE Extracting service structure .................... OK (2.1s)
PLAN Designing 12 test cases ......................... OK (1.4s)
EXECUTE Running tests against live API .................. OK (3.2s)
INTERPRET Normalizing scores (3 LLMs) ................... OK (2.8s)
prowl-bench v0.1.0 | Template: api_benchmark | LLMs: claude, gpt-4o, gemini
ββ Stripe API Score: 82 βββββββββββββββββββββββββ
β β
β auth simplicity ββββββββββ 8.0 β
β consistency ββββββββββ 9.0 β
β doc quality ββββββββββ 8.5 β
β error clarity ββββββββββ 9.2 β
β first try success ββββββββββ 7.0 β
β latency ββββββββββ 8.0 β
β response parseab.. ββββββββββ 9.5 β
β token efficiency ββββββββββ 7.0 β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Issues: 2
- OpenAPI spec is 48,000+ tokens β consider publishing /llms.txt
- POST /v1/charges returns HTML on 402 status codes
Recommendations:
- Add structured error codes to all 4xx responses
- Publish a condensed /llms.txt for agent consumers
pip install prowl-benchRequires Python 3.10+. No system dependencies.
# Set at least one LLM API key
export ANTHROPIC_API_KEY="sk-ant-..."
# or OPENAI_API_KEY, or GOOGLE_API_KEY β more keys = more balanced scoring
# Benchmark any API
prowl-bench run https://api.stripe.com
# With a specific template
prowl-bench run https://api.stripe.com --template api_benchmark
# With credentials for authenticated endpoints
prowl-bench run https://api.openai.com \
--credential "sk-proj-abc123" \
--credential-type bearer_token
# Output as JSON (for pipelines)
prowl-bench run https://api.example.com --output json > results.json
# CI mode: exit 1 if score below threshold
prowl-bench run https://api.example.com --min-score 70prowl-bench runs a 4-phase pipeline. Each phase is driven by an LLM that reads real data, not hardcoded heuristics.
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β prowl-bench pipeline β
βββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββ ββββββββββββ βββββββββββββ βββββββββββββββ
β ANALYZE ββββ>β PLAN ββββ>β EXECUTE ββββ>β INTERPRET β
β β β β β β β β
β Read spec β β Design β β Run real β β Score 0-10 β
β Extract β β test β β HTTP β β across 8 β
β structure β β cases β β requests β β dimensions β
β + auth β β + probes β β + record β β (multi-LLM) β
βββββββββββββ ββββββββββββ βββββββββββββ βββββββββββββββ
β β
β OpenAPI / llms.txt / HTML β Claude + GPT-4o
βββββββββββββββββ input ββββββββββββββββββββββββββ β + Gemini average
βΌ
βββββββββββββββ
β REPORT β
β Terminal / β
β JSON / CI β
βββββββββββββββ
Phase 1 -- ANALYZE: The LLM reads the API spec (OpenAPI, llms.txt, or raw HTML) and extracts the service type, authentication method, endpoints, pricing model, and rate limits.
Phase 2 -- PLAN: Based on the analysis, the LLM designs targeted test cases: endpoint probes, error handling checks, auth flow tests, and pricing verification.
Phase 3 -- EXECUTE: Real HTTP requests are made against the live API. Every request goes through a sandbox that blocks SSRF, validates payloads, and prevents prompt injection. Responses, latencies, and errors are recorded.
Phase 4 -- INTERPRET: All available LLMs score the results independently across 8 dimensions. Scores are averaged for balance. More LLM providers = less bias.
6 benchmark templates, auto-detected from service metadata:
| Template | Credentials | Auto-detected when | Best for |
|---|---|---|---|
api_benchmark |
Required | Has OpenAPI spec or benchmark guide | REST APIs, LLM providers |
platform_profile |
No | No API indicators found | SaaS platforms, web tools |
mcp_compliance |
No | Has MCP manifest URL | MCP servers |
docs_quality |
No | Has API docs URL only | Documentation audits |
defi_yield |
Required | Categories: defi, staking, yield | DeFi protocols |
crypto_app |
Required | Categories: crypto, exchange, wallet | Exchanges, wallets |
# List all templates with details
prowl-bench templates
# Force a specific template
prowl-bench run https://example.com --template platform_profile8 dimensions, weighted for real-world agent efficiency:
| Dimension | Weight | What it measures |
|---|---|---|
| token_efficiency | 25% | How many tokens an agent needs to understand and use the API |
| first_try_success | 20% | Percentage of calls that succeed on the first attempt |
| response_parseability | 15% | Clean, predictable JSON vs HTML error pages and mixed formats |
| error_clarity | 15% | Whether errors tell the agent exactly what to fix |
| doc_quality | 10% | Completeness of spec, docs, or llms.txt |
| auth_simplicity | 5% | How many steps to authenticate (1 header vs OAuth dance) |
| latency | 5% | Raw response speed |
| consistency | 5% | Same request always returns the same response shape |
Each dimension is scored 0-10, then weighted to produce an overall score of 0-100.
Token efficiency and first-try success carry the most weight because they directly impact agent cost and reliability.
prowl-bench runs the INTERPRET phase across every available LLM provider and averages scores to reduce single-model bias:
| Provider | Env Variable | Model |
|---|---|---|
| Claude | ANTHROPIC_API_KEY |
Claude Sonnet |
| GPT-4o | OPENAI_API_KEY |
GPT-4o |
| Gemini | GOOGLE_API_KEY |
Gemini 2.5 Flash |
| Claude CLI | (fallback) | Uses web subscription |
Set multiple keys for more balanced results:
export ANTHROPIC_API_KEY="sk-ant-..."
export OPENAI_API_KEY="sk-..."
export GOOGLE_API_KEY="AI..."With all three, each model scores independently and results are averaged. The JSON output includes per-model breakdowns.
import asyncio
from prowl_bench import BenchmarkReport
from prowl_bench.core.pipeline import run_benchmark
async def main():
report: BenchmarkReport = await run_benchmark(
url="https://api.stripe.com",
name="Stripe",
spec_content="...", # OpenAPI spec, llms.txt, or any text
)
print(f"Overall: {report.overall_score}/100")
print(f"Template: {report.template}")
for dim, score in sorted(report.dimensions.items()):
print(f" {dim}: {score}/10")
for issue in report.issues:
print(f" Issue: {issue}")
asyncio.run(main())For JSON export:
from prowl_bench.output.json_export import report_to_json
json_str = report_to_json(report)Add prowl-bench to your CI pipeline to catch agent-efficiency regressions:
# .github/workflows/api-bench.yml
name: API Benchmark
on:
push:
branches: [main]
schedule:
- cron: '0 6 * * 1' # Weekly Monday 6am
jobs:
benchmark:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: '3.12'
- run: pip install prowl-bench
- name: Run benchmark
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
run: |
prowl-bench run https://api.yourservice.com \
--min-score 70 \
--output json > benchmark.json
- name: Upload results
uses: actions/upload-artifact@v4
with:
name: benchmark-results
path: benchmark.jsonThe --min-score flag exits with code 1 if the overall score drops below the threshold, failing the CI job.
Results can be submitted to prowl.world for public aggregation. Submitted benchmarks are weighted by runner trust tier and averaged across all contributors.
# One-time: register for an agent key
prowl-bench register
# Set the key
export PROWL_AGENT_KEY="ak_abc123..."
# Benchmark and submit
prowl-bench run https://api.example.com --submitSubmitted results appear on the service's public profile at prowl.world/app#/service/{slug}.
prowl-bench sandboxes all outbound requests:
- SSRF prevention -- URLs are validated against blocked networks, private IPs, cloud metadata endpoints, and localhost
- Payload caps -- Request bodies are capped at 10KB
- Prompt injection protection -- All user inputs are sanitized before being sent to LLMs
- Rate limiting -- Max 20 HTTP requests per benchmark run
- No credential leakage -- Credentials are never included in LLM prompts or output
src/prowl_bench/
βββ cli.py # Typer CLI (run, templates, register)
βββ config.py # Settings from env vars
βββ core/
β βββ pipeline.py # 4-phase benchmark pipeline
β βββ scoring.py # Weighted score computation
β βββ types.py # Dataclasses (BenchmarkReport, etc.)
β βββ json_utils.py # Safe JSON extraction from LLM output
βββ llm/
β βββ router.py # Multi-provider LLM router
β βββ providers.py # Claude, GPT-4o, Gemini, CLI fallback
β βββ prompts.py # System prompts for each phase
βββ output/
β βββ terminal.py # Rich terminal rendering
β βββ json_export.py # JSON report export
βββ sandbox/
β βββ url_validator.py # SSRF prevention
β βββ payload_validator.py # Size + content validation
β βββ prompt_sanitizer.py # Injection protection
βββ submission/
β βββ client.py # Submit results to prowl.world
βββ templates/
βββ base.py # Base template class
βββ api_benchmark.py # REST API benchmark
βββ platform_profile.py # SaaS platform profile
βββ mcp_compliance.py # MCP server compliance
βββ docs_quality.py # Documentation audit
βββ defi_yield.py # DeFi protocol benchmark
βββ crypto_app.py # Crypto app benchmark
See CONTRIBUTING.md for guidelines.
# Clone and install dev dependencies
git clone https://github.com/opcastil11/prowl-bench.git
cd prowl-bench
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"
# Run tests
pytest tests/ -q
# Lint
ruff check src/ tests/Prowl is funded by the $PROWL token on Solana. Token proceeds fund LLM inference costs, crawler infrastructure, and open source development.
- Token: $PROWL on Pump.fun
- Mint:
DRg2EnkqTNFVnBegv1KReGTWs1cGBNCfyyUnY6bkpump - Chain: Solana
- Payment: Paid API endpoints accept $PROWL token transfers (any amount)
Apache 2.0 -- see LICENSE.
