Skip to content

Latest commit

 

History

History

README.md

Alloy Gateway Benchmarks

This folder contains benchmarking tools and results for measuring the Alloy AI Gateway's performance overhead.

Quick Start

# 1. Start the Alloy server (from project root)
OPENAI_API_KEY=your-key cargo run --release --bin alloy -- --port 3000

# 2. Run the gateway latency benchmark
cd benchmarks
rustc -O latency_bench.rs -o latency_bench && ./latency_bench

# 3. Run the end-to-end comparison (requires OPENAI_API_KEY)
./e2e_bench.sh

Benchmark Scripts

1. latency_bench.rs - Gateway Latency Measurement

Measures the pure gateway overhead using persistent HTTP connections. This gives the most accurate measurement of how much latency Alloy adds.

rustc -O latency_bench.rs -o latency_bench
./latency_bench

What it measures:

  • TCP write (send request)
  • Gateway routing and processing
  • Response serialization
  • TCP read (receive response)

2. e2e_bench.sh - End-to-End Comparison

Compares direct OpenAI API calls vs requests through Alloy to measure real-world overhead.

export OPENAI_API_KEY=your-key
./e2e_bench.sh

3. micro_bench.rs - Component-Level Benchmarks

Measures individual operations (JSON parsing, serialization, lookups) to identify optimization opportunities.

rustc -O micro_bench.rs -o micro_bench
./micro_bench

4. memory_bench.sh - Memory Usage Benchmark

Measures memory footprint at rest and under load.

./memory_bench.sh

What it measures:

  • Binary size on disk
  • Baseline memory (RSS) at rest
  • Memory growth under concurrent load
  • Memory stability after chat requests

Results

Gateway Latency (Health Endpoint)

Metric Value
Min 15.92 µs
Median 23.63 µs
Mean 25.29 µs
P95 37.79 µs
P99 56.96 µs
Max 121.33 µs
Throughput 39,540 req/s

End-to-End LLM Requests

Route Avg Latency
Direct to OpenAI ~1,112 ms
Through Alloy ~817 ms
Gateway Overhead ~24 µs (0.003%)

Memory Usage

Metric Value
Binary size 22.22 MB
Baseline memory (idle) 10.15 MB
Under load (1000 reqs) ~11.2 MB
Memory growth ~1 MB

Latency Breakdown

┌────────────────────────────────────────────────────────────────┐
│                  Typical LLM Request (~800ms)                   │
├────────────────────────────────────────────────────────────────┤
│ Alloy │  Network to OpenAI  │     OpenAI Processing            │
│ 24µs  │      ~200ms         │        ~600ms                    │
│ 0.003%│      (~25%)         │        (~75%)                    │
└────────────────────────────────────────────────────────────────┘

Interpreting Results

  • Gateway overhead is negligible (~24µs) compared to LLM API latency (~800ms)
  • The overhead percentage is < 0.01% of total request time
  • Network variance between runs is larger than the gateway overhead itself
  • The gateway achieves ~40K requests/second throughput for health checks

Test Environment

  • Hardware: Apple Silicon (M-series) / x86_64
  • OS: macOS / Linux
  • Rust: 1.75+ (release build with optimizations)
  • Server: Alloy running in release mode

Tips for Accurate Benchmarking

  1. Always use release builds: cargo build --release
  2. Warm up the server: Run a few hundred requests before measuring
  3. Use connection reuse: Persistent connections eliminate TCP handshake overhead
  4. Run multiple iterations: At least 1000+ for statistical significance
  5. Minimize background processes: Close other applications during benchmarking