-
Notifications
You must be signed in to change notification settings - Fork 0
feat: prompt cache-aware routing — prefer providers with warm caches #124
Copy link
Copy link
Open
Labels
enhancementNew feature or requestNew feature or request
Description
Problem
The proxy already does raw body passthrough to preserve upstream prompt caching, but routing is static (weight-based). Two providers serving the same model may have very different cache hit rates, but traffic doesn't favor the one with the warmer cache.
Proposal
Track per-provider cache hit/miss signals from response headers (e.g., Anthropic's x-cache-hit, token usage cache_read_input_tokens) and use this to influence routing:
- Record cache hit rates per
(provider, model)pair - When multiple providers can serve a request, boost routing weight for the provider with higher cache hit rate
- This directly reduces input token costs and latency
Expected Benefit
- Lower latency on cache hits (no re-processing of cached prefix)
- Reduced token costs (cache reads are cheaper)
- Smarter traffic distribution without manual weight tuning
Related
Part of stability & performance exploration vs direct Claude Code CLI → LLM gateway connections.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request