Skip to content

Cache optimization: system prompt instability + context distribution #75

@nexus-marbell

Description

@nexus-marbell

Findings from cache diagnostics (#74)

What we learned

  1. System prompt hash changes every requestcbc86cf7dbd72931a304d4a0 across 3 consecutive requests. Same token count (~8,216) but different content. Something non-deterministic is being injected per turn.

  2. Tools are stabletools_hash=02f1e5f2 consistent across all requests. ~26k tokens. Not the cache-breaking culprit.

  3. System prompt is only ~8k tokens — Claude Code's full context load (rules, skills, memory, agents, CLAUDE.md — estimated 50k+) is NOT in the system prompt field. It's likely injected as user/system-reminder messages throughout the input array.

  4. Cache hits are 64-1,216 tokens out of 94k+ per request (~0.1-1.3%). The x-grok-conv-id header is working but there's almost nothing stable to cache.

Why this matters

At grok-4.20 rates, Kelvin's session is burning ~94k input tokens per turn with effectively zero caching. A 30-turn session costs ~$5.60 in input tokens alone.

Potential approaches (for discussion)

A. Pin system prompt at bridge layer
Capture the system prompt on first request, serve the exact bytes on subsequent requests. Only update if the content materially changes (beyond timestamp/session noise).

B. Restructure request for cache-friendly ordering
Move stable content (pinned system prompt + tools) to the front of the serialized body. Push volatile content (messages, dynamic context) to the end so the prefix match extends further.

C. Investigate what's changing in the system prompt
Diff the system prompt across requests to find the non-deterministic element. Could be a timestamp, conversation ID, compaction counter, or dynamic context injection. If it's a single field, we can strip/pin just that.

D. Pre-serialize with deterministic JSON
Use json.dumps(sort_keys=True, separators=(',', ':')) to ensure byte-identical serialization. Send raw bytes to xAI instead of letting httpx re-serialize.

Priority

Not urgent — but the cost impact is significant for heavy Grok usage. Park for now, revisit when bridge sees regular production traffic.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions