[DEV-1433] Add kapacitor eval command (sequential, transcript-based)#11
Conversation
kapacitor eval <sessionId> runs an LLM-as-judge evaluation over a stored
session. 13 questions across 4 categories (safety, plan adherence, quality,
efficiency) are sent to Claude in sequence — each in its own headless
invocation with no tools — with the server's compacted eval-context
(DEV-1432) embedded in the prompt. Verdicts are aggregated into per-category
+ overall scores, rendered to the terminal, and POSTed to the server's
eval results endpoint (DEV-1433a) for persistence as a SessionEvalCompleted
event on the session's stream.
Key decisions vs. the superseded PR #223:
- Transcript is embedded directly, not searched via MCP. No Kontext, no
submodule, no embeddings. Judges read the full (compacted) trace.
- Question taxonomy and category structure salvaged verbatim from #223;
prompt template rewritten for the full-trace approach and stored as an
embedded resource.
- ClaudeCliRunner gained optional `model` and `maxTurns` parameters;
default (haiku, 1) preserves behaviour for TitleGenerator / WhatsDone.
Eval defaults to sonnet — overridable via `--model`.
Verdict parsing is tolerant of markdown code fences (judges sometimes wrap
their JSON despite the prompt's instruction otherwise) and overrides
category/question_id mismatches against the asked-about question to defend
against mild hallucination.
Unit tests cover verdict parsing (clean JSON, fenced JSON, malformed,
category/id override), aggregation (per-category averaging, canonical
category ordering, fail-verdict derivation), and prompt-template placeholder
substitution — 8/8 passing, full suite 184/184. `dotnet publish -c Release`
produces zero IL3050/IL2026 warnings.
Server endpoint (POST /api/sessions/{id}/evals) and eval-context endpoint
are in separate PRs on the kapacitor-server repo (#474 and already-merged
#473).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Review Summary by QodoAdd kapacitor eval command with transcript-based LLM-as-judge evaluation
WalkthroughsDescription• Add kapacitor eval command for LLM-as-judge session evaluation • Embed full session trace directly in prompts, no MCP/embeddings needed • Aggregate verdicts into per-category and overall scores (1-5 scale) • Persist eval results to server via POST endpoint • Extend ClaudeCliRunner with configurable model and maxTurns parameters Diagramflowchart LR
A["Session ID"] -->|Fetch eval-context| B["Compacted Trace JSON"]
B -->|13 questions| C["Sequential Judge Invocations"]
C -->|Parse verdicts| D["Verdict List"]
D -->|Aggregate scores| E["Per-Category Results"]
E -->|Render + POST| F["Terminal Display & Server Persistence"]
File Changes1. src/kapacitor/ClaudeCliRunner.cs
|
Code Review by Qodo
1.
|
Defense-in-depth layer to complement the server-side validator (#474): - Reject verdicts whose score is outside 1..5 (null return, judge output skipped from the aggregate) - Always derive verdict from score, ignoring whatever string the judge placed in the verdict field. The prompt documents the mapping (pass=4-5, warn=2-3, fail=1); trusting the score over the judge's verdict string canonicalizes away mild hallucinations (verdict="banana", score=5/verdict="fail", etc.) without discarding otherwise-useful scores. Means the CLI never sends contract-violating payloads to the server in the common case, while the server's validator remains authoritative. Three new tests: - Score out of range → null - score/verdict mismatch → verdict derived from score - Garbage verdict string → sanitized via derivation 11 EvalCommandTests pass, full suite 187/187, AOT publish clean. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Three fixes from PR #11 review (Finding #3 on invalid verdicts was already addressed in 31cd70d): 1. Flag value becomes sessionId (Action required) — ResolveSessionId picked the first non-"--" token, so `kapacitor eval --model sonnet <id>` ended up calling the server with "sonnet" as the session id. Extracted ResolveSessionId into a testable ArgParsing helper that accepts a list of value-bearing flags and correctly skips their values. The eval dispatcher declares ["--model", "--threshold"] as such. 8 unit tests cover positional-before-flags, flags-before-positional, mixed value and boolean flags, and the env-var fallback (serialized via NotInParallel since tests mutate process-level env). 2. Prompt exceeds argv limits (Action required) — eval embeds the full compacted trace in the prompt; for larger sessions that easily exceeds Windows' 32K argv limit (and stresses macOS/Linux limits too). Added a `promptViaStdin` flag to ClaudeCliRunner; when true the prompt is written to the child process's stdin after start instead of as a positional argv, and `claude -p` reads it from stdin. EvalCommand opts in; TitleGenerator / WhatsDone keep their existing argv behaviour. 3. Eval not in main help — added an entry under the Session section of help-usage.txt so `kapacitor --help` surfaces the command. AOT publish clean (0 IL warnings), full suite 195/195 pass. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New "Session evaluation (LLM-as-judge)" section under post-session analysis commands. Explains the no-tools judge pattern, lists example flags, and notes that results persist to the session stream as SessionEvalCompleted events. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
`kapacitor eval ` runs an LLM-as-judge evaluation over a stored session. 13 questions across 4 categories (safety, plan adherence, quality, efficiency) are sent to Claude in sequence — each in its own headless invocation with no tools — with the server's compacted eval-context (DEV-1432) embedded in the prompt. Verdicts are aggregated into per-category + overall scores, rendered to the terminal, and POSTed to the server's eval results endpoint (DEV-1433a, open at kurrent-io/Kurrent.Capacitor#474) for persistence.
What's different from the superseded PR #223
Implementation notes
Closes DEV-1433 pending the server endpoint merging (kurrent-io/Kurrent.Capacitor#474).
Test plan
🤖 Generated with Claude Code