Skip to content

feat: prompt cache-aware routing — prefer providers with warm caches #124

@kianwoon

Description

@kianwoon

Problem

The proxy already does raw body passthrough to preserve upstream prompt caching, but routing is static (weight-based). Two providers serving the same model may have very different cache hit rates, but traffic doesn't favor the one with the warmer cache.

Proposal

Track per-provider cache hit/miss signals from response headers (e.g., Anthropic's x-cache-hit, token usage cache_read_input_tokens) and use this to influence routing:

  1. Record cache hit rates per (provider, model) pair
  2. When multiple providers can serve a request, boost routing weight for the provider with higher cache hit rate
  3. This directly reduces input token costs and latency

Expected Benefit

  • Lower latency on cache hits (no re-processing of cached prefix)
  • Reduced token costs (cache reads are cheaper)
  • Smarter traffic distribution without manual weight tuning

Related

Part of stability & performance exploration vs direct Claude Code CLI → LLM gateway connections.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions