Cache optimization: system prompt instability + context distribution

## Findings from cache diagnostics (#74)

### What we learned

1. **System prompt hash changes every request** — `cbc86cf7` → `dbd72931` → `a304d4a0` across 3 consecutive requests. Same token count (~8,216) but different content. Something non-deterministic is being injected per turn.

2. **Tools are stable** — `tools_hash=02f1e5f2` consistent across all requests. ~26k tokens. Not the cache-breaking culprit.

3. **System prompt is only ~8k tokens** — Claude Code's full context load (rules, skills, memory, agents, CLAUDE.md — estimated 50k+) is NOT in the system prompt field. It's likely injected as user/system-reminder messages throughout the `input` array.

4. **Cache hits are 64-1,216 tokens** out of 94k+ per request (~0.1-1.3%). The `x-grok-conv-id` header is working but there's almost nothing stable to cache.

### Why this matters

At grok-4.20 rates, Kelvin's session is burning ~94k input tokens per turn with effectively zero caching. A 30-turn session costs ~$5.60 in input tokens alone.

### Potential approaches (for discussion)

**A. Pin system prompt at bridge layer**
Capture the system prompt on first request, serve the exact bytes on subsequent requests. Only update if the content materially changes (beyond timestamp/session noise).

**B. Restructure request for cache-friendly ordering**  
Move stable content (pinned system prompt + tools) to the front of the serialized body. Push volatile content (messages, dynamic context) to the end so the prefix match extends further.

**C. Investigate what's changing in the system prompt**
Diff the system prompt across requests to find the non-deterministic element. Could be a timestamp, conversation ID, compaction counter, or dynamic context injection. If it's a single field, we can strip/pin just that.

**D. Pre-serialize with deterministic JSON**
Use `json.dumps(sort_keys=True, separators=(',', ':'))` to ensure byte-identical serialization. Send raw bytes to xAI instead of letting httpx re-serialize.

### Priority

Not urgent — but the cost impact is significant for heavy Grok usage. Park for now, revisit when bridge sees regular production traffic.

### Related
- #69 (prompt caching — header shipped)
- #71 (caching implementation)
- #73 (field name fix)
- #74 (diagnostic hashing)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cache optimization: system prompt instability + context distribution #75

Findings from cache diagnostics (#74)

What we learned

Why this matters

Potential approaches (for discussion)

Priority

Related

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Cache optimization: system prompt instability + context distribution #75

Description

Findings from cache diagnostics (#74)

What we learned

Why this matters

Potential approaches (for discussion)

Priority

Related

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions