Proxy Evaluation Suite (PES)

A framework for the evaluation of local AI systems.

The Proxy Evaluation Suite (PES) is a command-line framework designed to provide a stable, reproducible, and extensible environment for benchmarking AI inference engines, with a primary focus on local, on-device performance on Apple Silicon.

While many tools measure raw throughput, PES is built on the philosophy that true performance is a holistic measure of the user experience. It provides the instrumentation to capture not just speed, but also resource efficiency and stability—the critical factors for local AI.

Core Philosophy & Features

Holistic Performance Metrics: Go beyond tokens-per-second. PES is architected to measure the complete lifecycle of a local inference task, including:
- Time-To-First-Token (TTFT)
- Total Generation Time & Throughput
- Inter-Token Latency (mean of per-token deltas)
- Model Load Time
- Peak & Average Memory Usage (RSS)
- Average CPU Utilization
Declarative Benchmark Suites: Define complex, multi-step, comparative benchmarks in simple YAML files. This allows you to version-control your entire evaluation suite alongside your code, ensuring reproducible results and a clear history of performance over time.
Extensible by Design: PES is a framework, not just a script. The clear separation of concerns into Targets (the engine being tested), Scenarios (the benchmark logic), and Analyzers (the reporting logic) allows for simple, decoupled extension.
Local-First Focus: Natively handles the complete lifecycle of local inference servers—startup, readiness probing, execution, and robust teardown—providing an accurate picture of real-world performance.

Quick Start

You can run your first comparative benchmark in a single command. The following will test a local mlx-lm instance against the OpenAI API for Time-To-First-Token.

pes --scenario ttft \
    --target mlx_lm --mlx-model "meta-llama/Llama-3.1-8B-Instruct" \
    --target openai --openai-model "gpt-4o" \
    --repetitions 3 \
    --prompt-lengths "32,256"

After the run, a new directory will be created in runs/, containing:

results.json: The raw, detailed metrics from the run.
ttft_summary.csv: A CSV file with summary statistics.
ttft_summary.png: A bar chart visualizing the performance comparison.

Configuration using Benchmark Suites

For complex or recurring evaluations, the YAML configuration is the recommended approach.

Create a suite.yaml file:

# suite.yml
run_settings:
  # Directory where all runs from this suite will be stored
  output_dir: "runs/my_benchmark_suite"

benchmarks:
  - name: "Nightly TTFT Regression (PIE vs MLX)"
    scenario: ttft
    repetitions: 7
    prompt_lengths: "32, 64, 128, 256"
    targets:
      - name: pie_local
        pie_config: "nightly" # Fetches the latest dev build
        pie_model_path: "meta-llama/Llama-3.1-8B-Instruct"
      - name: mlx_lm
        mlx_model: "meta-llama/Llama-3.1-8B-Instruct"

  - name: "Long Context Stability Check"
    scenario: ttft
    repetitions: 1
    prompt_lengths: "4096, 8192"
    targets:
      - name: pie_local
        pie_config: "nightly"
        pie_model_path: "meta-llama/Llama-3.1-8B-Instruct"

Execute the suite:
```
# Run the entire suite and persist results to a database
pes --config suite.yml --db-path results.db
```
This command will execute each benchmark defined in the file sequentially, creating a separate run directory for each and saving the results to the specified SQLite database.

Core Concepts

PES is built on a few simple, powerful abstractions:

Target: The system under test. This could be a local C++ binary (PIELocalTarget), a Python server (MLXTarget), or a remote API (OpenAITarget). The target is responsible for its own setup, execution, and teardown.
Scenario: The benchmark logic. A scenario defines the specific workload to be run against a target, such as measuring TTFT (TTFTScenario), single-stream throughput (ThroughputScenario), or a holistic latency+throughput run (LatencyAndThroughputScenario).
Orchestrator: The core component that manages the lifecycle of a single run, pairing a Scenario with a Target, invoking the Monitor, and collecting results.
Storage: The persistence layer. By default, PES saves results to JSON files, but can be configured to use a persistent SQLiteStorage backend for longitudinal performance tracking.

Dashboard and Reporting

PES includes a powerful reporting tool (pes report) that generates both interactive HTML visualizations and concise text summaries of your benchmark results.

Dashboard Features

Interactive HTML Dashboards: Visualize performance trends over time with Plotly charts
Text Summaries: Get quick, actionable insights in plain text format
Flexible Input Sources: Read from SQLite database or directly from JSON result files
Historical Analysis: Track performance regressions and improvements across runs

Using the Dashboard

From Database (Historical Analysis)

Generate comprehensive dashboards with trend analysis from your results database:

# Generate both HTML dashboard and text summary
pes report --db-path results.db \
           --output-html dashboard.html \
           --output-txt summary.txt

# Generate only the text summary for quick feedback
pes report --db-path results.db --output-txt latest.txt

From JSON Files (Quick Development Feedback)

Analyze individual benchmark runs directly without database persistence:

# Generate summary from a specific run's results.json
pes report --json-file runs/my_run/results.json \
           --output-html report.html

# Useful for rapid development iteration
pes --scenario ttft --target pie_local ... --output-dir runs/dev_test
pes report --json-file runs/dev_test/*/results.json --output-txt quick_report.txt

Dashboard Output

Reporting Outputs

Text summary: deterministic snapshot with batch throughput and multi-candidate headline metrics.
Artifacts: persisted ScenarioRun (scenario_run.json) plus raw database entries for downstream analysis.

Example text summary:

Run Name: 20241012_batch_throughput_on_pie_local
Scenario: batch_throughput
Targets: pie_local
Timestamp: 2024-10-12T03:14:15Z

Batch Throughput Metrics:
  - Batch 1: 210.2 tok/s | makespan 485 ms | TTFT p95 92 ms | latency p95 140 ms
  - Batch 8: 1,412.4 tok/s | makespan 780 ms | TTFT p95 165 ms | latency p95 310 ms

The multi_candidate scenario appends a separate section with candidate throughput and best-of overhead data.

New Scenarios & Metrics

BatchThroughputScenario

Purpose: measure aggregate throughput and prompt-level latency distributions across batch sizes.
CLI example:

pes --scenario batch_throughput \
    --target pie_local --pie-config $PIE_DEV_PATH --pie-model-path meta-llama/Llama-3.1-8B-Instruct \
    --batch-sizes 1,4,8 --max-new-tokens 128

YAML example:

benchmarks:
  - name: "Batch Throughput Benchmark"
    scenario: batch_throughput
    batch_sizes: "1, 8, 16"
    max_new_tokens: 128
    prompt_text: "Summarize the refactor status."
    targets:
      - name: pie_local
        pie_config: $PIE_DEV_PATH
        pie_model_path: meta-llama/Llama-3.1-8B-Instruct

Headline metrics surfaced by reporting:
- batch_tokens_per_second
- prompt_ttft_p95_ms
- prompt_latency_p95_ms

MultiCandidateScenario

Purpose: evaluate multi-candidate (n) and server-side best_of sampling efficiency.
CLI example:

pes --scenario multi_candidate \
    --target pie_local --pie-config $PIE_DEV_PATH --pie-model-path meta-llama/Llama-3.1-8B-Instruct \
    --n-values 1,2,4 --best-of-values 1,2,4 --max-new-tokens 128

YAML example:

benchmarks:
  - name: "Multi-Candidate Sweep"
    scenario: multi_candidate
    n_values: "1, 2, 4"
    best_of_values: "1, 2, 4"
    repetitions: 2
    targets:
      - name: pie_local
        pie_config: $PIE_DEV_PATH
        pie_model_path: meta-llama/Llama-3.1-8B-Instruct

Headline metrics:
- candidate_tokens_per_second
- best_of_overhead
- candidate_ttft_avg_ms
- candidate_total_time_avg_ms

Extending PES

Adding new components is designed to be straightforward.

Adding a New Target

Create a new file in pes/targets/.
Implement the Target abstract base class.
Register your new target with the framework using a decorator.

# pes/targets/my_custom_target.py
from pes.core.registry import register_target
from pes.targets.base import Target

@register_target("my_engine")
class MyEngineTarget(Target):
    # ... implement abstract methods: setup, run_generation, teardown ...
    pass

Adding a New Scenario

Create a new file in pes/scenarios/.
Implement the Scenario abstract base class.
Register your new scenario with the framework.

# pes/scenarios/my_scenario.py
from pes.core.registry import register_scenario
from pes.scenarios.base import Scenario

@register_scenario("accuracy_test")
class AccuracyScenario(Scenario):
    # ... implement abstract methods: add_cli_args, execute, etc. ...
    pass

Installation

From Source

git clone https://github.com/TheProxyCompany/proxy-evaluation-suite.git
cd proxy-evaluation-suite
pip install -e .

Dependencies

Core dependencies are automatically installed, including:

pandas, plotly, scipy for data analysis and visualization
httpx for async HTTP client functionality
pydantic for data validation
psutil for system monitoring

For development:

pip install -e .[dev]

Advanced Usage

Database Persistence

For longitudinal performance tracking, use the SQLite storage backend:

# Run benchmarks and save to database
pes --config suite.yml --db-path results.db

# Query historical results
sqlite3 results.db "SELECT * FROM runs WHERE scenario='ttft' ORDER BY timestamp DESC LIMIT 10"

Continuous Integration

PES is designed for CI/CD integration. Example GitHub Actions workflow:

- name: Run Performance Benchmarks
  run: |
    pes --config suite/nightly.yml --db-path results.db
    pes report --db-path results.db --output-txt performance.txt

- name: Check Performance Regression
  run: |
    # Parse performance.txt and fail if regression detected
    python scripts/check_regression.py performance.txt

Name		Name	Last commit message	Last commit date
Latest commit History 173 Commits
.github/workflows		.github/workflows
pes		pes
runs/comparison		runs/comparison
scripts		scripts
suite		suite
tests		tests
.gitignore		.gitignore
README.md		README.md
RESULTS.md		RESULTS.md
pyproject.toml		pyproject.toml
results.db		results.db

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Proxy Evaluation Suite (PES)

Core Philosophy & Features

Quick Start

Configuration using Benchmark Suites

Core Concepts

Dashboard and Reporting

Dashboard Features

Using the Dashboard

From Database (Historical Analysis)

From JSON Files (Quick Development Feedback)

Dashboard Output

Reporting Outputs

New Scenarios & Metrics

BatchThroughputScenario

MultiCandidateScenario

Extending PES

Adding a New Target

Adding a New Scenario

Installation

From Source

Dependencies

Advanced Usage

Database Persistence

Continuous Integration

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

TheProxyCompany/proxy-eval-suite

Folders and files

Latest commit

History

Repository files navigation

Proxy Evaluation Suite (PES)

Core Philosophy & Features

Quick Start

Configuration using Benchmark Suites

Core Concepts

Dashboard and Reporting

Dashboard Features

Using the Dashboard

From Database (Historical Analysis)

From JSON Files (Quick Development Feedback)

Dashboard Output

Reporting Outputs

New Scenarios & Metrics

BatchThroughputScenario

MultiCandidateScenario

Extending PES

Adding a New Target

Adding a New Scenario

Installation

From Source

Dependencies

Advanced Usage

Database Persistence

Continuous Integration

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages