-
Notifications
You must be signed in to change notification settings - Fork 373
Open
Labels
Milestone
Description
Summary
In agentic loops, the same conversation hits Plano's routing endpoint multiple times. Today each call re-evaluates routing independently, so the selected model can change mid-conversation. Session pinning ensures that once a model is selected for a session, subsequent routing calls in that session return the same model.
How it works
The caller sends an X-Session-Id header in the routing request. Plano caches the routing decision keyed by that ID:
- First call with
X-Session-Id: abc→ run routing, select model, cacheabc → gpt-4o - Subsequent calls with
X-Session-Id: abc→ skip routing, return cachedgpt-4o - Cache entries expire via configurable TTL (default 30 min)
- No
X-Session-Idheader → routing runs fresh every time (current behavior, no breaking change)
Request
POST /routing/v1/chat/completions
X-Session-Id: session-abc-123
Content-Type: application/json
{
"messages": [...]
}
Response
{
"model": "gpt-4o",
"route": "quick",
"session_id": "session-abc-123",
"pinned": true,
"trace_id": "abc123..."
}pinned: true indicates the result came from cache. pinned: false on first routing decision.
Implementation
- Extract
X-Session-Idfrom request headers in the routing handler - Add an in-memory TTL cache in
RouterServicekeyed by session ID (e.g.HashMap<String, (String, Instant)>behind a mutex) - Before calling
determine_route(), check cache for a valid (non-expired) entry - On cache miss, run routing and store the result
- TTL configurable via
routing.session_ttl_secondsin plano config (default 1800) - No database or external state needed
Reactions are currently unavailable