Skip to content

[DEV-1433] Add kapacitor eval command (sequential, transcript-based)#11

Merged
alexeyzimarev merged 4 commits intomainfrom
alexeyzimarev/dev-1433b-cli-eval-command
Apr 13, 2026
Merged

[DEV-1433] Add kapacitor eval command (sequential, transcript-based)#11
alexeyzimarev merged 4 commits intomainfrom
alexeyzimarev/dev-1433b-cli-eval-command

Conversation

@alexeyzimarev
Copy link
Copy Markdown
Member

Summary

`kapacitor eval ` runs an LLM-as-judge evaluation over a stored session. 13 questions across 4 categories (safety, plan adherence, quality, efficiency) are sent to Claude in sequence — each in its own headless invocation with no tools — with the server's compacted eval-context (DEV-1432) embedded in the prompt. Verdicts are aggregated into per-category + overall scores, rendered to the terminal, and POSTed to the server's eval results endpoint (DEV-1433a, open at kurrent-io/Kurrent.Capacitor#474) for persistence.

What's different from the superseded PR #223

  • Transcript is embedded directly, not searched via MCP. No Kontext, no submodule, no embeddings. Judges read the full (compacted) trace.
  • Question taxonomy salvaged verbatim; prompt template rewritten for the full-trace approach and stored as an embedded resource.
  • Judge fidelity isn't capped by search queries — this was the main concern with the PR #223 approach.

Implementation notes

  • `ClaudeCliRunner.RunAsync` gained optional `model` and `maxTurns` parameters; defaults preserve behaviour for `TitleGenerator` / `WhatsDone`. Eval defaults to sonnet, overridable via `--model`.
  • Verdict parsing is tolerant of markdown code fences (judges sometimes wrap JSON despite the prompt's instruction) and overrides `category` / `question_id` mismatches to defend against mild hallucination.
  • Options: `--model`, `--chain`, `--threshold` (passes through to the server's eval-context endpoint).

Closes DEV-1433 pending the server endpoint merging (kurrent-io/Kurrent.Capacitor#474).

Test plan

  • `dotnet build src/kapacitor/kapacitor.csproj` — clean
  • `dotnet publish src/kapacitor/kapacitor.csproj -c Release` — zero IL3050/IL2026 warnings (AOT-clean)
  • Unit tests — 8 new EvalCommandTests pass; full suite 184/184
    • Verdict parsing: clean JSON, fenced JSON, malformed, category/id override
    • Aggregation: per-category averaging, canonical category ordering, fail-verdict derivation
    • Prompt template placeholder substitution
  • CI
  • End-to-end against a local server (deferred until #474 lands)

🤖 Generated with Claude Code

kapacitor eval <sessionId> runs an LLM-as-judge evaluation over a stored
session. 13 questions across 4 categories (safety, plan adherence, quality,
efficiency) are sent to Claude in sequence — each in its own headless
invocation with no tools — with the server's compacted eval-context
(DEV-1432) embedded in the prompt. Verdicts are aggregated into per-category
+ overall scores, rendered to the terminal, and POSTed to the server's
eval results endpoint (DEV-1433a) for persistence as a SessionEvalCompleted
event on the session's stream.

Key decisions vs. the superseded PR #223:
- Transcript is embedded directly, not searched via MCP. No Kontext, no
  submodule, no embeddings. Judges read the full (compacted) trace.
- Question taxonomy and category structure salvaged verbatim from #223;
  prompt template rewritten for the full-trace approach and stored as an
  embedded resource.
- ClaudeCliRunner gained optional `model` and `maxTurns` parameters;
  default (haiku, 1) preserves behaviour for TitleGenerator / WhatsDone.
  Eval defaults to sonnet — overridable via `--model`.

Verdict parsing is tolerant of markdown code fences (judges sometimes wrap
their JSON despite the prompt's instruction otherwise) and overrides
category/question_id mismatches against the asked-about question to defend
against mild hallucination.

Unit tests cover verdict parsing (clean JSON, fenced JSON, malformed,
category/id override), aggregation (per-category averaging, canonical
category ordering, fail-verdict derivation), and prompt-template placeholder
substitution — 8/8 passing, full suite 184/184. `dotnet publish -c Release`
produces zero IL3050/IL2026 warnings.

Server endpoint (POST /api/sessions/{id}/evals) and eval-context endpoint
are in separate PRs on the kapacitor-server repo (#474 and already-merged
#473).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@linear
Copy link
Copy Markdown

linear bot commented Apr 13, 2026

@qodo-code-review
Copy link
Copy Markdown

Review Summary by Qodo

Add kapacitor eval command with transcript-based LLM-as-judge evaluation

✨ Enhancement

Grey Divider

Walkthroughs

Description
• Add kapacitor eval command for LLM-as-judge session evaluation
• Embed full session trace directly in prompts, no MCP/embeddings needed
• Aggregate verdicts into per-category and overall scores (1-5 scale)
• Persist eval results to server via POST endpoint
• Extend ClaudeCliRunner with configurable model and maxTurns parameters
Diagram
flowchart LR
  A["Session ID"] -->|Fetch eval-context| B["Compacted Trace JSON"]
  B -->|13 questions| C["Sequential Judge Invocations"]
  C -->|Parse verdicts| D["Verdict List"]
  D -->|Aggregate scores| E["Per-Category Results"]
  E -->|Render + POST| F["Terminal Display & Server Persistence"]
Loading

Grey Divider

File Changes

1. src/kapacitor/ClaudeCliRunner.cs ✨ Enhancement +30/-10

Parameterize Claude model and max turns

• Add optional model and maxTurns parameters to RunAsync method
• Default to haiku model and 1 turn for backward compatibility
• Pass parameters through to process invocation
• Update documentation to reflect configurable model selection

src/kapacitor/ClaudeCliRunner.cs


2. src/kapacitor/Commands/EvalCommand.cs ✨ Enhancement +300/-0

Implement full eval command orchestration and verdict processing

• Implement HandleEval orchestration: fetch eval-context, run 13 sequential judge invocations,
 aggregate verdicts, render results, persist to server
• Define 13-question taxonomy across 4 categories (safety, plan_adherence, quality, efficiency)
• Implement ParseVerdict with markdown code fence stripping and category/question_id override for
 hallucination defense
• Implement Aggregate to compute per-category averages and overall score with canonical category
 ordering
• Implement Render for terminal display with per-question markers (✓/!/✗) and evidence quoting
• Implement BuildQuestionPrompt for template placeholder substitution

src/kapacitor/Commands/EvalCommand.cs


3. src/kapacitor/Models.cs ✨ Enhancement +108/-0

Add eval-related data models and JSON serialization

• Add EvalContextEntry record for trace entries (kind, timestamp, text, tool)
• Add EvalContextCompactionSummary record for compaction metadata
• Add EvalContextResult record for eval-context endpoint response
• Add EvalQuestionVerdict record for per-question judge verdicts
• Add EvalCategoryResult record for aggregated category scores
• Add SessionEvalCompletedPayload record for POST payload to server
• Register new types in KapacitorJsonContext for AOT serialization

src/kapacitor/Models.cs


View more (4)
4. src/kapacitor/Program.cs ✨ Enhancement +18/-0

Wire eval command into CLI entry point

• Add eval command case to main switch statement
• Parse --model, --chain, and --threshold options
• Resolve session ID from args or environment variable
• Delegate to EvalCommand.HandleEval

src/kapacitor/Program.cs


5. test/kapacitor.Tests.Unit/EvalCommandTests.cs 🧪 Tests +170/-0

Add comprehensive unit tests for eval command

• Test ParseVerdict with clean JSON, markdown-fenced JSON, malformed JSON, and category/id
 override scenarios
• Test Aggregate for per-category averaging, canonical category ordering, and fail-verdict
 derivation
• Test BuildQuestionPrompt for placeholder substitution completeness

test/kapacitor.Tests.Unit/EvalCommandTests.cs


6. src/kapacitor/Resources/help-eval.txt 📝 Documentation +23/-0

Add help documentation for eval command

• Document kapacitor eval usage, options (--model, --chain, --threshold), and implementation
 notes
• Explain sequential judge invocation, no-tools constraint, and session ID resolution

src/kapacitor/Resources/help-eval.txt


7. src/kapacitor/Resources/prompt-eval-question.txt 📝 Documentation +61/-0

Add embedded prompt template for judge invocations

• Embedded prompt template for judge invocations with placeholders for session metadata, trace JSON,
 and question details
• Define response format (JSON with category, question_id, score 1-5, verdict, finding, evidence)
• Provide scoring rubric (5=no concerns, 4=minor, 3=moderate, 2=significant, 1=critical)
• Instruct judges to cite specific tool calls, files, or turns in findings

src/kapacitor/Resources/prompt-eval-question.txt


Grey Divider

Qodo Logo

@qodo-code-review
Copy link
Copy Markdown

qodo-code-review bot commented Apr 13, 2026

Code Review by Qodo

🐞 Bugs (0) 📘 Rule violations (0) 📎 Requirement gaps (0) 🖥 UI issues (0) 🎨 UX Issues (0)

Grey Divider


Action required

1. Flag value becomes sessionId🐞
Description
ResolveSessionId treats any non--- token as positional, so `kapacitor eval --model sonnet
<sessionId> resolves the sessionId as sonnet` and calls the server with the wrong id. This breaks
common invocation patterns and is reinforced by the eval usage string printed by the new command
handler.
Code

src/kapacitor/Program.cs[R120-136]

+    case "eval": {
+        var evalSessionId = ResolveSessionId(args);
+
+        if (evalSessionId is null) {
+            Console.Error.WriteLine("Usage: kapacitor eval [--model sonnet] [--chain] [--threshold N] [sessionId]");
+            Console.Error.WriteLine("  No session ID provided and KAPACITOR_SESSION_ID not set.");
+
+            return 1;
+        }
+
+        var evalChain     = args.Contains("--chain");
+        var evalModel     = GetArg(args, "--model") ?? "sonnet";
+        var evalThreshold = GetArg(args, "--threshold") is { } ts && int.TryParse(ts, out var parsed)
+            ? parsed
+            : (int?)null;
+
+        return await EvalCommand.HandleEval(baseUrl!, evalSessionId, evalModel, evalChain, evalThreshold);
Evidence
The new eval command handler resolves sessionId from raw args, while ResolveSessionId selects the
first token not starting with --, which includes values for flags like --model and --threshold
(e.g., sonnet). This makes the eval handler send requests to /api/sessions/sonnet/... instead of
the intended session.

src/kapacitor/Program.cs[120-136]
src/kapacitor/Program.cs[733-738]
src/kapacitor/Program.cs[727-731]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`ResolveSessionId` currently grabs the first arg that doesn't start with `--`. For commands that have flags with values (like `eval --model sonnet` / `--threshold 2000`), this incorrectly treats the flag value as the session id.

### Issue Context
The new `eval` command supports `--model <name>` and `--threshold <n>` and calls `ResolveSessionId(args)` before executing the command.

### Fix Focus Areas
- src/kapacitor/Program.cs[120-136]
- src/kapacitor/Program.cs[733-738]
- src/kapacitor/Program.cs[727-731]

### Suggested fix
- Replace `ResolveSessionId` with a small arg scanner that:
 - skips `--flag value` pairs for known value-bearing flags (at least `--model`, `--threshold`, and any other existing value flags in Program.cs like `--owner/--repo/--pr/...`),
 - skips boolean flags (`--chain`, `--full`, etc.),
 - returns the first remaining positional arg as sessionId.
- Update the eval usage line printed on missing session id to match the supported ordering (either require `<sessionId>` first, or support `[options] [sessionId]` reliably).

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


2. Prompt exceeds argv limits🐞
Description
Eval embeds the full traceJson into each question prompt and then calls
ClaudeCliRunner.RunAsync(prompt, ...), which passes the prompt via claude -p <prompt>; large
sessions can exceed OS command-line length limits and cause eval to fail to start the process. This
is likely for longer traces since the code logs traceJson.Length and repeats the same large
payload 13 times.
Code

src/kapacitor/Commands/EvalCommand.cs[R96-111]

+        var promptTemplate = EmbeddedResources.Load("prompt-eval-question.txt");
+        var verdicts       = new List<EvalQuestionVerdict>();
+
+        for (var i = 0; i < Questions.Length; i++) {
+            var q = Questions[i];
+            Log($"[{i + 1}/{Questions.Length}] {q.Category}/{q.Id}...");
+
+            var prompt = BuildQuestionPrompt(promptTemplate, context.SessionId, evalRunId, q, traceJson);
+
+            var result = await ClaudeCliRunner.RunAsync(
+                prompt,
+                TimeSpan.FromMinutes(5),
+                msg => Log($"  {msg}"),
+                model: model,
+                maxTurns: 1
+            );
Evidence
Eval constructs prompts by inlining the entire trace JSON (traceJson) into the prompt template,
and the Claude runner passes that prompt as a command-line argument (-p + prompt). Command-line
length is finite (notably on Windows), so sufficiently large traces will prevent claude from
starting or receiving a truncated prompt.

src/kapacitor/Commands/EvalCommand.cs[96-111]
src/kapacitor/Resources/prompt-eval-question.txt[10-26]
src/kapacitor/ClaudeCliRunner.cs[76-85]
src/kapacitor/Commands/EvalCommand.cs[79-85]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`EvalCommand` builds a very large prompt (full trace JSON) and `ClaudeCliRunner` passes it on the command line via `claude -p <prompt>`. For large traces this can exceed OS argv limits and cause `Process.Start` / CLI invocation to fail.

### Issue Context
Eval explicitly embeds `{TRACE_JSON}` and repeats it for 13 sequential judge invocations.

### Fix Focus Areas
- src/kapacitor/Commands/EvalCommand.cs[96-111]
- src/kapacitor/Resources/prompt-eval-question.txt[10-26]
- src/kapacitor/ClaudeCliRunner.cs[62-87]

### Suggested fix
- Update `ClaudeCliRunner` to avoid passing the prompt via argv for large prompts. Preferred options (choose what the `claude` CLI supports):
 - Stream prompt over stdin (e.g., `RedirectStandardInput = true` and a CLI mode that reads prompt from stdin), or
 - Write the prompt to a temp file and use a CLI flag that reads prompt content from file.
- If neither is available, add a hard length guard (with a clear error) before starting the process, and in eval suggest using a smaller server compaction threshold / disabling `--chain` to reduce prompt size.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Remediation recommended

3. Invalid verdicts persisted🐞
Description
ParseVerdict accepts any score/verdict values from the judge and persists them into the
per-question list without normalization; Render treats unknown verdict strings as failures and
aggregation can be skewed by out-of-range scores. This can produce incorrect terminal output and
potentially server-side validation errors if the POST endpoint expects score in 1–5 and verdict
in {pass,warn,fail}.
Code

src/kapacitor/Commands/EvalCommand.cs[R187-204]

+    internal static EvalQuestionVerdict? ParseVerdict(string rawResponse, EvalQuestion question) {
+        var json = StripCodeFences(rawResponse.Trim());
+
+        try {
+            var parsed = JsonSerializer.Deserialize(json, KapacitorJsonContext.Default.EvalQuestionVerdict);
+            if (parsed is null) return null;
+
+            // Reject verdicts that claim a different category/question than
+            // we asked about — judges sometimes hallucinate ids.
+            if (parsed.Category != question.Category || parsed.QuestionId != question.Id) {
+                return parsed with { Category = question.Category, QuestionId = question.Id };
+            }
+
+            return parsed;
+        } catch (JsonException) {
+            return null;
+        }
+    }
Evidence
The prompt contract requires score 1–5 and verdict in pass|warn|fail, but the parser only
strips code fences and overrides mismatched category/question ids; it does not clamp/validate score
or normalize verdict casing/values. The raw per-question verdict is later rendered via a strict
string switch, where anything other than pass/warn is displayed as a failure marker.

src/kapacitor/Resources/prompt-eval-question.txt[36-45]
src/kapacitor/Commands/EvalCommand.cs[187-204]
src/kapacitor/Commands/EvalCommand.cs[221-233]
src/kapacitor/Commands/EvalCommand.cs[278-287]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
Per-question judge outputs are accepted as-is; invalid `score` values (outside 1–5) and unexpected `verdict` strings (wrong casing/unknown values) are not corrected, yet they are rendered and persisted.

### Issue Context
The eval prompt defines a strict schema, but models sometimes deviate (this PR already accounts for fenced JSON). Similar defensive normalization should be applied to `score`/`verdict`.

### Fix Focus Areas
- src/kapacitor/Commands/EvalCommand.cs[187-204]
- src/kapacitor/Commands/EvalCommand.cs[221-233]
- src/kapacitor/Commands/EvalCommand.cs[278-287]
- src/kapacitor/Resources/prompt-eval-question.txt[36-45]

### Suggested fix
- In `ParseVerdict` (or immediately after parsing):
 - Clamp or reject scores outside 1..5.
 - Normalize verdict to lowercase and either:
   - recompute verdict from score (`VerdictForScore(score)`), or
   - validate it is one of {pass,warn,fail} and override otherwise.
- Consider also trimming `Finding`/`Evidence` fields to avoid accidental huge outputs.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


4. Eval not in main help🐞
Description
kapacitor --help (help-usage.txt) doesn’t list the new eval command even though Program.cs adds
it and a per-command help file exists. This reduces discoverability and can lead users to assume the
feature doesn’t exist.
Code

src/kapacitor/Program.cs[R120-136]

+    case "eval": {
+        var evalSessionId = ResolveSessionId(args);
+
+        if (evalSessionId is null) {
+            Console.Error.WriteLine("Usage: kapacitor eval [--model sonnet] [--chain] [--threshold N] [sessionId]");
+            Console.Error.WriteLine("  No session ID provided and KAPACITOR_SESSION_ID not set.");
+
+            return 1;
+        }
+
+        var evalChain     = args.Contains("--chain");
+        var evalModel     = GetArg(args, "--model") ?? "sonnet";
+        var evalThreshold = GetArg(args, "--threshold") is { } ts && int.TryParse(ts, out var parsed)
+            ? parsed
+            : (int?)null;
+
+        return await EvalCommand.HandleEval(baseUrl!, evalSessionId, evalModel, evalChain, evalThreshold);
Evidence
The new command is wired in Program.cs and has its own help file, but the main usage text’s Session
section omits eval, so users won’t see it in the general command list.

src/kapacitor/Program.cs[120-137]
src/kapacitor/Resources/help-usage.txt[31-41]
src/kapacitor/Resources/help-eval.txt[1-15]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
The new `eval` command is not listed in the top-level `kapacitor --help` output.

### Issue Context
`Program.cs` includes a new `case "eval"`, and `Resources/help-eval.txt` exists, but `Resources/help-usage.txt` does not mention it.

### Fix Focus Areas
- src/kapacitor/Resources/help-usage.txt[31-41]

### Suggested fix
Add an `eval` line to the `Session:` section, matching the supported syntax (including `--chain`, `--model`, `--threshold`, and optional `[id]`).

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

ⓘ The new review experience is currently in Beta. Learn more

Grey Divider

Qodo Logo

Comment thread src/kapacitor/Program.cs
Comment thread src/kapacitor/Commands/EvalCommand.cs
alexeyzimarev and others added 3 commits April 13, 2026 13:25
Defense-in-depth layer to complement the server-side validator (#474):
- Reject verdicts whose score is outside 1..5 (null return, judge output
  skipped from the aggregate)
- Always derive verdict from score, ignoring whatever string the judge
  placed in the verdict field. The prompt documents the mapping
  (pass=4-5, warn=2-3, fail=1); trusting the score over the judge's
  verdict string canonicalizes away mild hallucinations (verdict="banana",
  score=5/verdict="fail", etc.) without discarding otherwise-useful scores.

Means the CLI never sends contract-violating payloads to the server in the
common case, while the server's validator remains authoritative.

Three new tests:
- Score out of range → null
- score/verdict mismatch → verdict derived from score
- Garbage verdict string → sanitized via derivation

11 EvalCommandTests pass, full suite 187/187, AOT publish clean.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Three fixes from PR #11 review (Finding #3 on invalid verdicts was already
addressed in 31cd70d):

1. Flag value becomes sessionId (Action required) — ResolveSessionId picked
   the first non-"--" token, so `kapacitor eval --model sonnet <id>` ended
   up calling the server with "sonnet" as the session id. Extracted
   ResolveSessionId into a testable ArgParsing helper that accepts a list
   of value-bearing flags and correctly skips their values. The eval
   dispatcher declares ["--model", "--threshold"] as such. 8 unit tests
   cover positional-before-flags, flags-before-positional, mixed value and
   boolean flags, and the env-var fallback (serialized via NotInParallel
   since tests mutate process-level env).

2. Prompt exceeds argv limits (Action required) — eval embeds the full
   compacted trace in the prompt; for larger sessions that easily exceeds
   Windows' 32K argv limit (and stresses macOS/Linux limits too). Added a
   `promptViaStdin` flag to ClaudeCliRunner; when true the prompt is
   written to the child process's stdin after start instead of as a
   positional argv, and `claude -p` reads it from stdin. EvalCommand opts
   in; TitleGenerator / WhatsDone keep their existing argv behaviour.

3. Eval not in main help — added an entry under the Session section of
   help-usage.txt so `kapacitor --help` surfaces the command.

AOT publish clean (0 IL warnings), full suite 195/195 pass.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New "Session evaluation (LLM-as-judge)" section under post-session analysis
commands. Explains the no-tools judge pattern, lists example flags, and
notes that results persist to the session stream as SessionEvalCompleted
events.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@alexeyzimarev alexeyzimarev merged commit 8657fa3 into main Apr 13, 2026
3 checks passed
@alexeyzimarev alexeyzimarev deleted the alexeyzimarev/dev-1433b-cli-eval-command branch April 13, 2026 11:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant