A framework for the evaluation of local AI systems.
The Proxy Evaluation Suite (PES) is a command-line framework designed to provide a stable, reproducible, and extensible environment for benchmarking AI inference engines, with a primary focus on local, on-device performance on Apple Silicon.
While many tools measure raw throughput, PES is built on the philosophy that true performance is a holistic measure of the user experience. It provides the instrumentation to capture not just speed, but also resource efficiency and stability—the critical factors for local AI.
-
Holistic Performance Metrics: Go beyond tokens-per-second. PES is architected to measure the complete lifecycle of a local inference task, including:
- Time-To-First-Token (TTFT)
- Total Generation Time & Throughput
- Inter-Token Latency (mean of per-token deltas)
- Model Load Time
- Peak & Average Memory Usage (RSS)
- Average CPU Utilization
-
Declarative Benchmark Suites: Define complex, multi-step, comparative benchmarks in simple YAML files. This allows you to version-control your entire evaluation suite alongside your code, ensuring reproducible results and a clear history of performance over time.
-
Extensible by Design: PES is a framework, not just a script. The clear separation of concerns into
Targets(the engine being tested),Scenarios(the benchmark logic), andAnalyzers(the reporting logic) allows for simple, decoupled extension. -
Local-First Focus: Natively handles the complete lifecycle of local inference servers—startup, readiness probing, execution, and robust teardown—providing an accurate picture of real-world performance.
You can run your first comparative benchmark in a single command. The following will test a local mlx-lm instance against the OpenAI API for Time-To-First-Token.
pes --scenario ttft \
--target mlx_lm --mlx-model "meta-llama/Llama-3.1-8B-Instruct" \
--target openai --openai-model "gpt-4o" \
--repetitions 3 \
--prompt-lengths "32,256"After the run, a new directory will be created in runs/, containing:
results.json: The raw, detailed metrics from the run.ttft_summary.csv: A CSV file with summary statistics.ttft_summary.png: A bar chart visualizing the performance comparison.
For complex or recurring evaluations, the YAML configuration is the recommended approach.
-
Create a
suite.yamlfile:# suite.yml run_settings: # Directory where all runs from this suite will be stored output_dir: "runs/my_benchmark_suite" benchmarks: - name: "Nightly TTFT Regression (PIE vs MLX)" scenario: ttft repetitions: 7 prompt_lengths: "32, 64, 128, 256" targets: - name: pie_local pie_config: "nightly" # Fetches the latest dev build pie_model_path: "meta-llama/Llama-3.1-8B-Instruct" - name: mlx_lm mlx_model: "meta-llama/Llama-3.1-8B-Instruct" - name: "Long Context Stability Check" scenario: ttft repetitions: 1 prompt_lengths: "4096, 8192" targets: - name: pie_local pie_config: "nightly" pie_model_path: "meta-llama/Llama-3.1-8B-Instruct"
-
Execute the suite:
# Run the entire suite and persist results to a database pes --config suite.yml --db-path results.dbThis command will execute each benchmark defined in the file sequentially, creating a separate run directory for each and saving the results to the specified SQLite database.
PES is built on a few simple, powerful abstractions:
- Target: The system under test. This could be a local C++ binary (
PIELocalTarget), a Python server (MLXTarget), or a remote API (OpenAITarget). The target is responsible for its own setup, execution, and teardown. - Scenario: The benchmark logic. A scenario defines the specific workload to be run against a target, such as measuring TTFT (
TTFTScenario), single-stream throughput (ThroughputScenario), or a holistic latency+throughput run (LatencyAndThroughputScenario). - Orchestrator: The core component that manages the lifecycle of a single run, pairing a
Scenariowith aTarget, invoking theMonitor, and collecting results. - Storage: The persistence layer. By default, PES saves results to JSON files, but can be configured to use a persistent
SQLiteStoragebackend for longitudinal performance tracking.
PES includes a powerful reporting tool (pes report) that generates both interactive HTML visualizations and concise text summaries of your benchmark results.
- Interactive HTML Dashboards: Visualize performance trends over time with Plotly charts
- Text Summaries: Get quick, actionable insights in plain text format
- Flexible Input Sources: Read from SQLite database or directly from JSON result files
- Historical Analysis: Track performance regressions and improvements across runs
Generate comprehensive dashboards with trend analysis from your results database:
# Generate both HTML dashboard and text summary
pes report --db-path results.db \
--output-html dashboard.html \
--output-txt summary.txt
# Generate only the text summary for quick feedback
pes report --db-path results.db --output-txt latest.txtAnalyze individual benchmark runs directly without database persistence:
# Generate summary from a specific run's results.json
pes report --json-file runs/my_run/results.json \
--output-html report.html
# Useful for rapid development iteration
pes --scenario ttft --target pie_local ... --output-dir runs/dev_test
pes report --json-file runs/dev_test/*/results.json --output-txt quick_report.txt- Text summary: deterministic snapshot with batch throughput and multi-candidate headline metrics.
- Artifacts: persisted ScenarioRun (
scenario_run.json) plus raw database entries for downstream analysis.
Example text summary:
Run Name: 20241012_batch_throughput_on_pie_local
Scenario: batch_throughput
Targets: pie_local
Timestamp: 2024-10-12T03:14:15Z
Batch Throughput Metrics:
- Batch 1: 210.2 tok/s | makespan 485 ms | TTFT p95 92 ms | latency p95 140 ms
- Batch 8: 1,412.4 tok/s | makespan 780 ms | TTFT p95 165 ms | latency p95 310 ms
The multi_candidate scenario appends a separate section with candidate throughput and best-of overhead data.
- Purpose: measure aggregate throughput and prompt-level latency distributions across batch sizes.
- CLI example:
pes --scenario batch_throughput \
--target pie_local --pie-config $PIE_DEV_PATH --pie-model-path meta-llama/Llama-3.1-8B-Instruct \
--batch-sizes 1,4,8 --max-new-tokens 128- YAML example:
benchmarks:
- name: "Batch Throughput Benchmark"
scenario: batch_throughput
batch_sizes: "1, 8, 16"
max_new_tokens: 128
prompt_text: "Summarize the refactor status."
targets:
- name: pie_local
pie_config: $PIE_DEV_PATH
pie_model_path: meta-llama/Llama-3.1-8B-Instruct- Headline metrics surfaced by reporting:
batch_tokens_per_secondprompt_ttft_p95_msprompt_latency_p95_ms
- Purpose: evaluate multi-candidate (
n) and server-sidebest_ofsampling efficiency. - CLI example:
pes --scenario multi_candidate \
--target pie_local --pie-config $PIE_DEV_PATH --pie-model-path meta-llama/Llama-3.1-8B-Instruct \
--n-values 1,2,4 --best-of-values 1,2,4 --max-new-tokens 128- YAML example:
benchmarks:
- name: "Multi-Candidate Sweep"
scenario: multi_candidate
n_values: "1, 2, 4"
best_of_values: "1, 2, 4"
repetitions: 2
targets:
- name: pie_local
pie_config: $PIE_DEV_PATH
pie_model_path: meta-llama/Llama-3.1-8B-Instruct- Headline metrics:
candidate_tokens_per_secondbest_of_overheadcandidate_ttft_avg_mscandidate_total_time_avg_ms
Adding new components is designed to be straightforward.
- Create a new file in
pes/targets/. - Implement the
Targetabstract base class. - Register your new target with the framework using a decorator.
# pes/targets/my_custom_target.py
from pes.core.registry import register_target
from pes.targets.base import Target
@register_target("my_engine")
class MyEngineTarget(Target):
# ... implement abstract methods: setup, run_generation, teardown ...
pass- Create a new file in
pes/scenarios/. - Implement the
Scenarioabstract base class. - Register your new scenario with the framework.
# pes/scenarios/my_scenario.py
from pes.core.registry import register_scenario
from pes.scenarios.base import Scenario
@register_scenario("accuracy_test")
class AccuracyScenario(Scenario):
# ... implement abstract methods: add_cli_args, execute, etc. ...
passgit clone https://github.com/TheProxyCompany/proxy-evaluation-suite.git
cd proxy-evaluation-suite
pip install -e .Core dependencies are automatically installed, including:
pandas,plotly,scipyfor data analysis and visualizationhttpxfor async HTTP client functionalitypydanticfor data validationpsutilfor system monitoring
For development:
pip install -e .[dev]For longitudinal performance tracking, use the SQLite storage backend:
# Run benchmarks and save to database
pes --config suite.yml --db-path results.db
# Query historical results
sqlite3 results.db "SELECT * FROM runs WHERE scenario='ttft' ORDER BY timestamp DESC LIMIT 10"PES is designed for CI/CD integration. Example GitHub Actions workflow:
- name: Run Performance Benchmarks
run: |
pes --config suite/nightly.yml --db-path results.db
pes report --db-path results.db --output-txt performance.txt
- name: Check Performance Regression
run: |
# Parse performance.txt and fail if regression detected
python scripts/check_regression.py performance.txt