Findings from cache diagnostics (#74)
What we learned
-
System prompt hash changes every request — cbc86cf7 → dbd72931 → a304d4a0 across 3 consecutive requests. Same token count (~8,216) but different content. Something non-deterministic is being injected per turn.
-
Tools are stable — tools_hash=02f1e5f2 consistent across all requests. ~26k tokens. Not the cache-breaking culprit.
-
System prompt is only ~8k tokens — Claude Code's full context load (rules, skills, memory, agents, CLAUDE.md — estimated 50k+) is NOT in the system prompt field. It's likely injected as user/system-reminder messages throughout the input array.
-
Cache hits are 64-1,216 tokens out of 94k+ per request (~0.1-1.3%). The x-grok-conv-id header is working but there's almost nothing stable to cache.
Why this matters
At grok-4.20 rates, Kelvin's session is burning ~94k input tokens per turn with effectively zero caching. A 30-turn session costs ~$5.60 in input tokens alone.
Potential approaches (for discussion)
A. Pin system prompt at bridge layer
Capture the system prompt on first request, serve the exact bytes on subsequent requests. Only update if the content materially changes (beyond timestamp/session noise).
B. Restructure request for cache-friendly ordering
Move stable content (pinned system prompt + tools) to the front of the serialized body. Push volatile content (messages, dynamic context) to the end so the prefix match extends further.
C. Investigate what's changing in the system prompt
Diff the system prompt across requests to find the non-deterministic element. Could be a timestamp, conversation ID, compaction counter, or dynamic context injection. If it's a single field, we can strip/pin just that.
D. Pre-serialize with deterministic JSON
Use json.dumps(sort_keys=True, separators=(',', ':')) to ensure byte-identical serialization. Send raw bytes to xAI instead of letting httpx re-serialize.
Priority
Not urgent — but the cost impact is significant for heavy Grok usage. Park for now, revisit when bridge sees regular production traffic.
Related
Findings from cache diagnostics (#74)
What we learned
System prompt hash changes every request —
cbc86cf7→dbd72931→a304d4a0across 3 consecutive requests. Same token count (~8,216) but different content. Something non-deterministic is being injected per turn.Tools are stable —
tools_hash=02f1e5f2consistent across all requests. ~26k tokens. Not the cache-breaking culprit.System prompt is only ~8k tokens — Claude Code's full context load (rules, skills, memory, agents, CLAUDE.md — estimated 50k+) is NOT in the system prompt field. It's likely injected as user/system-reminder messages throughout the
inputarray.Cache hits are 64-1,216 tokens out of 94k+ per request (~0.1-1.3%). The
x-grok-conv-idheader is working but there's almost nothing stable to cache.Why this matters
At grok-4.20 rates, Kelvin's session is burning ~94k input tokens per turn with effectively zero caching. A 30-turn session costs ~$5.60 in input tokens alone.
Potential approaches (for discussion)
A. Pin system prompt at bridge layer
Capture the system prompt on first request, serve the exact bytes on subsequent requests. Only update if the content materially changes (beyond timestamp/session noise).
B. Restructure request for cache-friendly ordering
Move stable content (pinned system prompt + tools) to the front of the serialized body. Push volatile content (messages, dynamic context) to the end so the prefix match extends further.
C. Investigate what's changing in the system prompt
Diff the system prompt across requests to find the non-deterministic element. Could be a timestamp, conversation ID, compaction counter, or dynamic context injection. If it's a single field, we can strip/pin just that.
D. Pre-serialize with deterministic JSON
Use
json.dumps(sort_keys=True, separators=(',', ':'))to ensure byte-identical serialization. Send raw bytes to xAI instead of letting httpx re-serialize.Priority
Not urgent — but the cost impact is significant for heavy Grok usage. Park for now, revisit when bridge sees regular production traffic.
Related