Skip to content

Benchmarking and Diff

github-actions[bot] edited this page Mar 15, 2026 · 1 revision

Benchmarking and Diff

contextbudget benchmark

Compare deterministic packing strategies for one task:

Strategy Description
naive full-context All files included without compression
top-k selection Top-ranked files included without compression
compressed pack Full compression pipeline
cache-assisted pack Compressed with summary cache reuse
contextbudget benchmark "add rate limiting to auth API" --repo .
contextbudget benchmark <task> --workspace <workspace.toml>

Outputs:

  • Terminal summary
  • JSON artifact with strategy comparisons
  • Markdown report
  • Active token-estimator backend report
  • estimator_samples — compact comparison of built-in estimators on local sample text

contextbudget diff

Compare two run artifacts and inspect what changed.

contextbudget diff old-run.json new-run.json

Inspects:

  • Task differences
  • Files added/removed in packed context
  • Ranked score changes
  • Token, savings, risk, and cache deltas

Included Benchmark Dataset

A realistic multi-file Python service ships with the repository at benchmarks/dataset/ so you can run reproducible benchmarks without an external project.

The dataset is a task-manager API with:

  • Models (task.py, user.py)
  • Services (task_service.py, user_service.py)
  • Database layer (connection.py, repository.py)
  • Route handlers (tasks.py, users.py)
  • Tests

This is representative of a production codebase with meaningful cross-file dependencies.


Pre-Generated Results

Task Baseline Compressed Reduction
Add Redis caching to task lookup endpoints 12,230 tok 7,937 tok 35%
Add JWT authentication 12,230 tok 3,259 tok 73%
Refactor database module 12,230 tok 1,768 tok 86%

See docs/benchmarks/ in the repository for the full pre-generated reports.


Reproduce Locally

Run all three canonical tasks:

python benchmarks/run_benchmarks.py

Or run a single task:

contextbudget benchmark "Add Redis caching to task lookup endpoints" \
    --repo benchmarks/dataset --max-tokens 8000

Using Diff in CI

After two runs (e.g. before and after a code change), diff them to detect context regressions:

contextbudget pack "add caching" --repo . --max-tokens 30000
mv run.json baseline-run.json

# make changes to the repo...

contextbudget pack "add caching" --repo . --max-tokens 30000
contextbudget diff baseline-run.json run.json

For PR-level diffing, see contextbudget pr-audit in the CLI Reference.

Clone this wiki locally