Skip to content

feat: request deduplication / response coalescing #126

@kianwoon

Description

@kianwoon

Problem

When Claude Code sends overlapping requests (e.g., multiple tool calls hitting the same model with identical payloads), each request independently hits the upstream provider. This wastes rate limit quota, tokens, and latency.

Proposal

Detect identical request payloads within a short time window and share the upstream response:

  1. Hash the request body (excluding model field) to create a dedup key
  2. If a request with the same key is already in-flight, attach the new client as a second subscriber
  3. When the upstream response arrives, stream it to all subscribers
  4. TTL-based cache for very recent identical requests (e.g., 1-2s window)

Important considerations:

  • Must handle streaming responses (clone the PassThrough stream for each subscriber)
  • Should be opt-in via config to avoid surprises with non-idempotent side effects
  • Cache size must be bounded

Expected Benefit

  • Reduced upstream API costs (fewer duplicate requests)
  • Lower latency for duplicate requests (piggyback on in-flight response)
  • Better rate limit utilization

Related

Part of stability & performance exploration vs direct Claude Code CLI → LLM gateway connections.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions