Skip to content

TheProxyCompany/proxy-eval-suite

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

173 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Proxy Evaluation Suite (PES)

A framework for the evaluation of local AI systems.


The Proxy Evaluation Suite (PES) is a command-line framework designed to provide a stable, reproducible, and extensible environment for benchmarking AI inference engines, with a primary focus on local, on-device performance on Apple Silicon.

While many tools measure raw throughput, PES is built on the philosophy that true performance is a holistic measure of the user experience. It provides the instrumentation to capture not just speed, but also resource efficiency and stability—the critical factors for local AI.

Core Philosophy & Features

  • Holistic Performance Metrics: Go beyond tokens-per-second. PES is architected to measure the complete lifecycle of a local inference task, including:

    • Time-To-First-Token (TTFT)
    • Total Generation Time & Throughput
    • Inter-Token Latency (mean of per-token deltas)
    • Model Load Time
    • Peak & Average Memory Usage (RSS)
    • Average CPU Utilization
  • Declarative Benchmark Suites: Define complex, multi-step, comparative benchmarks in simple YAML files. This allows you to version-control your entire evaluation suite alongside your code, ensuring reproducible results and a clear history of performance over time.

  • Extensible by Design: PES is a framework, not just a script. The clear separation of concerns into Targets (the engine being tested), Scenarios (the benchmark logic), and Analyzers (the reporting logic) allows for simple, decoupled extension.

  • Local-First Focus: Natively handles the complete lifecycle of local inference servers—startup, readiness probing, execution, and robust teardown—providing an accurate picture of real-world performance.

Quick Start

You can run your first comparative benchmark in a single command. The following will test a local mlx-lm instance against the OpenAI API for Time-To-First-Token.

pes --scenario ttft \
    --target mlx_lm --mlx-model "meta-llama/Llama-3.1-8B-Instruct" \
    --target openai --openai-model "gpt-4o" \
    --repetitions 3 \
    --prompt-lengths "32,256"

After the run, a new directory will be created in runs/, containing:

  • results.json: The raw, detailed metrics from the run.
  • ttft_summary.csv: A CSV file with summary statistics.
  • ttft_summary.png: A bar chart visualizing the performance comparison.

Configuration using Benchmark Suites

For complex or recurring evaluations, the YAML configuration is the recommended approach.

  1. Create a suite.yaml file:

    # suite.yml
    run_settings:
      # Directory where all runs from this suite will be stored
      output_dir: "runs/my_benchmark_suite"
    
    benchmarks:
      - name: "Nightly TTFT Regression (PIE vs MLX)"
        scenario: ttft
        repetitions: 7
        prompt_lengths: "32, 64, 128, 256"
        targets:
          - name: pie_local
            pie_config: "nightly" # Fetches the latest dev build
            pie_model_path: "meta-llama/Llama-3.1-8B-Instruct"
          - name: mlx_lm
            mlx_model: "meta-llama/Llama-3.1-8B-Instruct"
    
      - name: "Long Context Stability Check"
        scenario: ttft
        repetitions: 1
        prompt_lengths: "4096, 8192"
        targets:
          - name: pie_local
            pie_config: "nightly"
            pie_model_path: "meta-llama/Llama-3.1-8B-Instruct"
  2. Execute the suite:

    # Run the entire suite and persist results to a database
    pes --config suite.yml --db-path results.db

    This command will execute each benchmark defined in the file sequentially, creating a separate run directory for each and saving the results to the specified SQLite database.

Core Concepts

PES is built on a few simple, powerful abstractions:

  • Target: The system under test. This could be a local C++ binary (PIELocalTarget), a Python server (MLXTarget), or a remote API (OpenAITarget). The target is responsible for its own setup, execution, and teardown.
  • Scenario: The benchmark logic. A scenario defines the specific workload to be run against a target, such as measuring TTFT (TTFTScenario), single-stream throughput (ThroughputScenario), or a holistic latency+throughput run (LatencyAndThroughputScenario).
  • Orchestrator: The core component that manages the lifecycle of a single run, pairing a Scenario with a Target, invoking the Monitor, and collecting results.
  • Storage: The persistence layer. By default, PES saves results to JSON files, but can be configured to use a persistent SQLiteStorage backend for longitudinal performance tracking.

Dashboard and Reporting

PES includes a powerful reporting tool (pes report) that generates both interactive HTML visualizations and concise text summaries of your benchmark results.

Dashboard Features

  • Interactive HTML Dashboards: Visualize performance trends over time with Plotly charts
  • Text Summaries: Get quick, actionable insights in plain text format
  • Flexible Input Sources: Read from SQLite database or directly from JSON result files
  • Historical Analysis: Track performance regressions and improvements across runs

Using the Dashboard

From Database (Historical Analysis)

Generate comprehensive dashboards with trend analysis from your results database:

# Generate both HTML dashboard and text summary
pes report --db-path results.db \
           --output-html dashboard.html \
           --output-txt summary.txt

# Generate only the text summary for quick feedback
pes report --db-path results.db --output-txt latest.txt

From JSON Files (Quick Development Feedback)

Analyze individual benchmark runs directly without database persistence:

# Generate summary from a specific run's results.json
pes report --json-file runs/my_run/results.json \
           --output-html report.html

# Useful for rapid development iteration
pes --scenario ttft --target pie_local ... --output-dir runs/dev_test
pes report --json-file runs/dev_test/*/results.json --output-txt quick_report.txt

Dashboard Output

Reporting Outputs

  • Text summary: deterministic snapshot with batch throughput and multi-candidate headline metrics.
  • Artifacts: persisted ScenarioRun (scenario_run.json) plus raw database entries for downstream analysis.

Example text summary:

Run Name: 20241012_batch_throughput_on_pie_local
Scenario: batch_throughput
Targets: pie_local
Timestamp: 2024-10-12T03:14:15Z

Batch Throughput Metrics:
  - Batch 1: 210.2 tok/s | makespan 485 ms | TTFT p95 92 ms | latency p95 140 ms
  - Batch 8: 1,412.4 tok/s | makespan 780 ms | TTFT p95 165 ms | latency p95 310 ms

The multi_candidate scenario appends a separate section with candidate throughput and best-of overhead data.

New Scenarios & Metrics

BatchThroughputScenario

  • Purpose: measure aggregate throughput and prompt-level latency distributions across batch sizes.
  • CLI example:
pes --scenario batch_throughput \
    --target pie_local --pie-config $PIE_DEV_PATH --pie-model-path meta-llama/Llama-3.1-8B-Instruct \
    --batch-sizes 1,4,8 --max-new-tokens 128
  • YAML example:
benchmarks:
  - name: "Batch Throughput Benchmark"
    scenario: batch_throughput
    batch_sizes: "1, 8, 16"
    max_new_tokens: 128
    prompt_text: "Summarize the refactor status."
    targets:
      - name: pie_local
        pie_config: $PIE_DEV_PATH
        pie_model_path: meta-llama/Llama-3.1-8B-Instruct
  • Headline metrics surfaced by reporting:
    • batch_tokens_per_second
    • prompt_ttft_p95_ms
    • prompt_latency_p95_ms

MultiCandidateScenario

  • Purpose: evaluate multi-candidate (n) and server-side best_of sampling efficiency.
  • CLI example:
pes --scenario multi_candidate \
    --target pie_local --pie-config $PIE_DEV_PATH --pie-model-path meta-llama/Llama-3.1-8B-Instruct \
    --n-values 1,2,4 --best-of-values 1,2,4 --max-new-tokens 128
  • YAML example:
benchmarks:
  - name: "Multi-Candidate Sweep"
    scenario: multi_candidate
    n_values: "1, 2, 4"
    best_of_values: "1, 2, 4"
    repetitions: 2
    targets:
      - name: pie_local
        pie_config: $PIE_DEV_PATH
        pie_model_path: meta-llama/Llama-3.1-8B-Instruct
  • Headline metrics:
    • candidate_tokens_per_second
    • best_of_overhead
    • candidate_ttft_avg_ms
    • candidate_total_time_avg_ms

Extending PES

Adding new components is designed to be straightforward.

Adding a New Target

  1. Create a new file in pes/targets/.
  2. Implement the Target abstract base class.
  3. Register your new target with the framework using a decorator.
# pes/targets/my_custom_target.py
from pes.core.registry import register_target
from pes.targets.base import Target

@register_target("my_engine")
class MyEngineTarget(Target):
    # ... implement abstract methods: setup, run_generation, teardown ...
    pass

Adding a New Scenario

  1. Create a new file in pes/scenarios/.
  2. Implement the Scenario abstract base class.
  3. Register your new scenario with the framework.
# pes/scenarios/my_scenario.py
from pes.core.registry import register_scenario
from pes.scenarios.base import Scenario

@register_scenario("accuracy_test")
class AccuracyScenario(Scenario):
    # ... implement abstract methods: add_cli_args, execute, etc. ...
    pass

Installation

From Source

git clone https://github.com/TheProxyCompany/proxy-evaluation-suite.git
cd proxy-evaluation-suite
pip install -e .

Dependencies

Core dependencies are automatically installed, including:

  • pandas, plotly, scipy for data analysis and visualization
  • httpx for async HTTP client functionality
  • pydantic for data validation
  • psutil for system monitoring

For development:

pip install -e .[dev]

Advanced Usage

Database Persistence

For longitudinal performance tracking, use the SQLite storage backend:

# Run benchmarks and save to database
pes --config suite.yml --db-path results.db

# Query historical results
sqlite3 results.db "SELECT * FROM runs WHERE scenario='ttft' ORDER BY timestamp DESC LIMIT 10"

Continuous Integration

PES is designed for CI/CD integration. Example GitHub Actions workflow:

- name: Run Performance Benchmarks
  run: |
    pes --config suite/nightly.yml --db-path results.db
    pes report --db-path results.db --output-txt performance.txt

- name: Check Performance Regression
  run: |
    # Parse performance.txt and fail if regression detected
    python scripts/check_regression.py performance.txt

About

Proxy Evaluation Suite

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •