Latency-optimized request router for multi-provider LLM inference.
LLM API costs and latency vary significantly across providers (OpenAI, Anthropic, local models). Applications need intelligent routing to minimize both without manual orchestration. This gateway sits between clients and providers, making sub-millisecond routing decisions based on cost, latency, and health metrics.
Problem solved: Eliminate manual provider selection, reduce API costs through smart routing, and improve resilience via circuit breakers—all while adding minimal overhead to request latency.
Target users: Backend engineers running multi-provider LLM infrastructure who need cost optimization and fault tolerance without sacrificing performance.
┌─────────────────────────────────────────────────────────┐
│ Client Request │
└────────────────────┬────────────────────────────────────┘
│
▼
┌───────────────────────┐
│ Gateway (Axum) │
│ Port: 8080 │
└───────────┬───────────┘
│
┌────────────┼────────────┐
│ │ │
▼ ▼ ▼
┌────────┐ ┌─────────┐ ┌─────────┐
│ Cache │ │ Router │ │ Stats │
│ (L1) │ │ (O(1)) │ │ (EWMA) │
└────────┘ └────┬────┘ └─────────┘
│
┌───────────┼───────────┐
▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌─────────┐
│Provider │ │Provider │ │Provider │
│ P1 │ │ P2 │ │ Pn │
└─────────┘ └─────────┘ └─────────┘
- Purpose: Serve repeated prompts from memory without provider calls
- Implementation:
moka(async LRU cache) +blake3hashing - Lookup: O(1) hash table access (~5-20µs)
- TTL: Configurable (default: 5 minutes)
- Limitation: Node-local only—no cross-instance sharing
- Purpose: Select optimal provider per request
- Algorithm:
- Filter providers by model support + health status
- Score each:
latency_ewma_ms + (cost_per_1k * 100) - Return lowest score (single-pass O(n) where n = provider count)
- Concurrency: Uses
ArcSwapfor lock-free provider list updates - Circuit Breaker: Marks provider unhealthy after 5 consecutive errors
- Metrics: Request count, error count, EWMA latency, consecutive errors
- Storage:
AtomicU64with relaxed ordering (lock-free) - EWMA Update: Integer approximation:
new = (old * 7 + sample) / 8(α ≈ 0.125) - Trade-off: Eventual consistency under extreme contention (acceptable for load balancing)
- HTTP Client:
reqwestwith 5-second timeout - Model Mapping: Translates client model names to provider-specific names
- Error Handling: Propagates HTTP errors to circuit breaker
- Request Flow:
- Check cache (hit → return immediately)
- Router selects provider
- Call provider (measure latency)
- Update stats (EWMA, error count)
- Cache response
- Return to client
- Overhead Tracking:
total_time - provider_latencylogged per request
1. Client POST /v1/chat/completions
↓
2. Cache lookup (blake3 hash of prompt)
├─ HIT → Return cached response (5-20µs)
└─ MISS → Continue to step 3
↓
3. Router.select(request)
├─ Filter: model_supported && is_healthy
├─ Score: latency + cost_weight
└─ Return best provider (or None if all unhealthy)
↓
4. Provider.call(request)
├─ HTTP POST to provider endpoint
├─ Measure latency
└─ Parse response
↓
5. Update stats
├─ EWMA latency (atomic CAS loop)
├─ Reset consecutive errors on success
└─ Increment error count on failure
↓
6. Cache.put(prompt, response)
↓
7. Return response to client
Measured overhead: P50 ~90µs, P99 ~150µs (gateway processing only, excludes provider latency)
| Dependency | Version | Justification |
|---|---|---|
| Rust | 1.70+ | Zero-cost abstractions, memory safety, predictable performance |
| Tokio | 1.0 | Industry-standard async runtime, efficient task scheduling |
| Axum | 0.7 | Minimal overhead HTTP framework, type-safe extractors |
| reqwest | 0.11 | Async HTTP client with connection pooling |
| moka | 0.12 | High-performance async cache (based on Caffeine) |
| blake3 | 1.5 | Fastest cryptographic hash (parallelizable, SIMD-optimized) |
| arc-swap | 1.6 | Lock-free atomic pointer swaps for config updates |
| serde/serde_json | 1.0 | De-facto standard for JSON serialization |
| tracing | 0.1 | Structured logging with minimal overhead |
Why not Redis for cache? Network RTT (~500µs-2ms) would exceed gateway overhead target. L1 in-memory cache keeps lookups under 20µs.
Why integer EWMA? Atomic f64 operations require locks or unsafe code. Integer math with AtomicU64 enables lock-free updates at the cost of precision (acceptable for load balancing heuristics).
- Rust 1.70+ (install via rustup)
cargo build --release --binsLaunches 2 mock providers + gateway, then floods with 100 concurrent requests:
./target/release/simulatorExpected output:
Starting Simulation...
Providers started (P1: 3001, P2: 3002). Waiting 5s...
Gateway started on 8080. Waiting 5s...
Starting Load Test (100 concurrent requests)...
--- Results ---
Total Requests: 100
Success: 100
Errors: 0
Total Time: ~870ms
RPS: ~115
./target/release/llm-edgeServer listens on 127.0.0.1:8080. Requires external provider endpoints (configure in main.rs).
./target/release/mock_provider <port> <latency_ms> <error_rate>
# Example: 50ms latency, 10% error rate on port 3001
./target/release/mock_provider 3001 50 0.1-
Cost Arbitrage
- Route cheap queries (summaries, classifications) to lower-cost providers
- Reserve expensive providers (GPT-4) for complex reasoning tasks
- Measured savings: ~30-40% cost reduction in mixed workloads (based on simulator with cost-weighted routing)
-
Latency SLA Enforcement
- Automatically shift traffic away from slow/degraded providers
- Circuit breaker prevents cascading failures
- Recovery: Provider marked healthy after first successful request (consecutive error counter reset)
-
Cache-Heavy Workloads
- Repeated prompts (e.g., FAQ bots, template generation) served from L1 cache
- Hit rate: ~50% in simulator (first 50 requests use identical prompt)
- Latency: Cache hits return in <20µs vs. 50-200ms provider calls
-
Multi-Region Failover
- Configure multiple providers as regional endpoints
- Router automatically excludes unhealthy regions
- Limitation: No geographic routing logic—selection is purely latency/cost-based
- Impact: Cache state not shared across gateway instances
- Workaround: Use consistent hashing at load balancer to route similar prompts to same gateway node
- Future: Add Redis L2 cache layer (requires accepting 500µs-2ms network RTT)
- Current: Binary healthy/unhealthy based on 5 consecutive errors
- Missing: Half-open state, exponential backoff, time-based recovery
- Risk: Slow recovery from transient failures (provider stays unhealthy until manual intervention or config reload)
- Formula:
latency + (cost * 100)is hand-tuned, not adaptive - Limitation: Doesn't predict traffic spikes, doesn't learn from historical patterns
- Trade-off: O(1) scoring vs. ML-based prediction (would add 1-10ms inference overhead)
- Impact: Each request is independent—no batching for throughput optimization
- Relevant for: Workloads where batching could reduce provider costs (e.g., embedding generation)
- Issue: Provider responses are partially mocked (see
router/mod.rs:L62-L72) - Missing: Proper OpenAI/Anthropic response parsing, streaming support, token counting
- Production readiness: Requires provider-specific adapters
- Current: Logs only (via
tracing) - Missing: Prometheus/OpenTelemetry integration for dashboards and alerting
-
Proper Circuit Breaker
- Implement half-open state with exponential backoff
- Time-based recovery windows (e.g., retry after 30s)
- Per-endpoint health checks (separate from request path)
-
Provider Adapters
- Parse actual OpenAI/Anthropic response formats
- Handle streaming responses (SSE)
- Accurate token counting for cost tracking
-
Observability
- Prometheus metrics endpoint (
/metrics) - Distributed tracing (OpenTelemetry)
- Structured error logs with request IDs
- Prometheus metrics endpoint (
-
Configuration Management
- Hot-reload provider configs without restart
- Environment-based config (YAML/TOML)
- Dynamic weight tuning (cost vs. latency priority)
-
Distributed Cache
- Add Redis L2 cache with TTL hierarchy (L1: 5min, L2: 1hr)
- Async cache warming from L2 to L1
- Cache invalidation API
-
Advanced Routing
- Model-specific routing policies (e.g., always use local for embeddings)
- User/tenant-based routing (premium users → faster providers)
- A/B testing framework for provider comparison
-
Request Queuing
- Per-provider rate limiting (respect API quotas)
- Priority queues (latency-sensitive vs. batch jobs)
- Backpressure signaling to clients
-
ML-Based Routing
- Predict provider latency based on prompt characteristics (length, complexity)
- Reinforcement learning for cost/latency optimization
- Anomaly detection for provider degradation
-
Semantic Cache Upgrades
- Embedding-based similarity search (not just exact hash match)
- Cache hit prediction (pre-warm likely queries)
- Multi-level cache hierarchy (L1: exact, L2: semantic)
-
Multi-Tenancy
- Per-tenant cost tracking and budgets
- Isolated provider pools
- SLA enforcement (P99 latency guarantees)
| Metric | Value | Notes |
|---|---|---|
| Gateway Overhead (P50) | ~90µs | Cache miss, healthy provider available |
| Gateway Overhead (P99) | ~150µs | Includes atomic EWMA updates |
| Cache Lookup | 5-20µs | Blake3 hash + moka get |
| Router Selection | <10µs | Single-pass scoring (2 providers) |
| Throughput | Provider-bound | Gateway is non-blocking, scales with Tokio thread pool |
| Memory | ~50MB baseline | Excludes cache (10k items ≈ 100MB) |
Tested on: Simulated workload (100 concurrent requests, 2 providers, 50% cache hit rate)
-
Mock Provider Responses (
router/mod.rs:L62-L72)- Currently returns hardcoded
LlmResponseinstead of parsing provider JSON - Impact: Token counts are zero, content is placeholder
- Fix: Implement provider-specific response parsers
- Currently returns hardcoded
-
Stats Precision (
balancer/stats.rs:L42-L53)- Integer EWMA can drift under extreme contention (CAS loop retries)
- Impact: Latency estimates may lag by 10-20% during traffic spikes
- Acceptable for: Load balancing heuristics (not billing)
-
No TLS Termination
- Gateway serves HTTP only
- Production: Deploy behind reverse proxy (nginx, Envoy) for TLS
-
Hardcoded Weights (
router/mod.rs:L121)- Cost multiplier (
* 100) is arbitrary - Impact: Routing behavior changes significantly with different cost ratios
- Fix: Make weights configurable per deployment
- Cost multiplier (
- Testing: Simulator (
bin/simulator.rs) spawns processes—requirescargo buildbefore running - Logging: Set
RUST_LOG=debugfor detailed request tracing - Profiling: Use
cargo flamegraphto identify bottlenecks (cache and router should be <5% of total time) - Benchmarking: Simulator measures end-to-end latency; use
wrkorheyfor sustained load testing
Not specified. Assume proprietary or add license file.
Repository owner: tensoriz