[DEV-1440] M1: Extract shared eval library#14
Conversation
Refactors kapacitor.Commands.EvalCommand into a reusable kapacitor.Eval library so the daemon (milestone 2) can reuse the same orchestration without duplicating it. No behaviour change — `kapacitor eval <id>` produces identical output and the server contracts are untouched. New namespace layout: - kapacitor.Eval.EvalQuestions: canonical 13-question / 4-category taxonomy and category-order helper. The single source of truth; both prompt building and aggregation reference it. - kapacitor.Eval.IEvalObserver: observer surface for progress. The CLI supplies a stderr-logging implementation; milestone 2 will add a SignalR-pushing implementation for the daemon. Callbacks are shaped specifically so EvalStarted / OnQuestionCompleted / OnFinished / OnFailed map 1:1 to the SignalR events documented in DEV-1440. - kapacitor.Eval.EvalService: RunAsync drives the full pipeline (fetch context, fetch retained facts, run 13 judges sequentially, aggregate, persist, retain new facts) and reports every phase through IEvalObserver. Returns the aggregate on success, null on failure; OnFinished / OnFailed are fired either way so observers don't need to also inspect the return value. kapacitor.Commands.EvalCommand shrinks to a thin adapter: - Creates the authenticated HTTP client - Provides a ConsoleEvalObserver that maps each callback to a timestamped stderr log line (matching the pre-refactor output exactly) - Renders the returned aggregate as the terminal report Types remain internal — the daemon lives in the same assembly, so public isn't needed yet; revisit when/if the server repo consumes the library across assembly boundaries. Tests renamed EvalCommandTests -> EvalServiceTests and retargeted to the new namespace. All 21 existing eval tests continue to pass without changes to their assertions. Full suite 205/205, AOT publish clean. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Review Summary by QodoExtract shared eval library for daemon reuse (DEV-1440 M1)
WalkthroughsDescription• Extract eval orchestration into reusable kapacitor.Eval library • Move 13-question taxonomy to EvalQuestions for single source of truth • Introduce IEvalObserver interface for progress reporting across environments • Refactor EvalCommand to thin CLI adapter over EvalService • Rename test class to EvalServiceTests with retargeted assertions Diagramflowchart LR
EvalCommand["EvalCommand<br/>(thin CLI adapter)"]
EvalService["EvalService<br/>(core orchestration)"]
EvalQuestions["EvalQuestions<br/>(taxonomy)"]
IEvalObserver["IEvalObserver<br/>(progress surface)"]
ConsoleObserver["ConsoleEvalObserver<br/>(stderr logging)"]
EvalCommand -- "calls RunAsync" --> EvalService
EvalCommand -- "implements" --> IEvalObserver
EvalCommand -- "creates" --> ConsoleObserver
EvalService -- "references" --> EvalQuestions
EvalService -- "reports via" --> IEvalObserver
ConsoleObserver -- "implements" --> IEvalObserver
File Changes1. src/kapacitor/Commands/EvalCommand.cs
|
Code Review by Qodo
1.
|
Daemon side of the dashboard-driven eval pipeline. Pairs with the server M3 endpoint in kurrent-io/Kurrent.Capacitor#477 and depends on the M1 shared eval library in #14. - New SignalR wire types in Models.cs match the server's DaemonCommands.cs: RunEvalCommand (server -> daemon dispatch) plus the four daemon -> server progress events (EvalStarted, EvalQuestionCompleted, EvalFinished, EvalFailed). Registered in KapacitorJsonContext for source-gen serialization. - ServerConnection registers a "RunEval" handler and exposes per-event send methods (EvalStartedAsync etc.) that mirror the existing AgentRegisteredAsync / LaunchFailedAsync pattern. - New EvalRunner singleton subscribes to OnRunEval. Each incoming command spawns a fire-and-forget Task that builds an authenticated HttpClient, instantiates a DaemonEvalObserver bound to the run, and drives EvalService.RunAsync. Unhandled exceptions are caught and translated to an EvalFailed relay so the dashboard learns about daemon-side failures rather than waiting forever. - DaemonEvalObserver maps the IEvalObserver surface to SignalR sends: OnStarted -> EvalStartedAsync, OnQuestionCompleted -> EvalQuestionCompletedAsync, OnFinished -> EvalFinishedAsync, OnFailed -> EvalFailedAsync. Info / per-question-start / per-question-failure / fact-retained callbacks just log locally — they're not interesting enough to justify SignalR chatter for every judge. - Wired into DaemonRunner DI: AddSingleton<EvalRunner> + an explicit GetRequiredService at startup so the constructor's OnRunEval subscription happens before the host starts taking traffic. Full suite 205/205, AOT publish clean. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Four findings on PR #14: 1. Observer exceptions abort eval (Action required) — IEvalObserver documented that observer throws are caught and don't abort the eval, but EvalService called callbacks directly. A SignalR push failure on the daemon would have crashed the run mid-flight, possibly skipping OnFailed. Fixed via a SafeObserver wrapper inside RunAsync that delegates to the caller's observer with a try/catch around each call; exceptions log to stderr (with a nested try/catch in case stderr itself fails) and the eval continues. 2. Progress events reversed (Recommended) — OnContextFetched was emitted before OnStarted. The CLI observer maps these to "Fetched..." then "Evaluating session..." log lines, so the user-facing output order was the reverse of the pre-refactor shape. Swapped — now OnStarted fires first, then OnContextFetched. 3. 401 prints extra line (Recommended) — HandleUnauthorizedAsync writes to stderr directly, then EvalService called observer.OnFailed with "unauthenticated", which the CLI observer also wrote to stderr — resulting in two lines for the same condition. Replaced the HandleUnauthorizedAsync call with a direct StatusCode == 401 check and a single observer.OnFailed("authentication failed — run 'kapacitor login' to re-authenticate"). The observer is now the single reporting channel; daemon callers also benefit (they get EvalFailed instead of nothing for 401s). 4. Cancellation partly ignored (Action required) — RunAsync took a CancellationToken but didn't forward it to FetchAllJudgeFactsAsync / PostJudgeFactAsync, and ThrowIfCancellationRequested could escape without firing OnFailed. Now: ct threads through both helpers (and their HTTP calls + ReadAsStringAsync), and the body of RunAsync is wrapped in a try/catch (OperationCanceledException) that fires observer.OnFailed("cancelled") before returning null — observers always see exactly one terminal callback. Doc updated to reflect that the SafeObserver guarantee + cancellation contract are now actually enforced. Full suite 205/205, AOT publish clean. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Daemon side of the dashboard-driven eval pipeline. Pairs with the server M3 endpoint in kurrent-io/Kurrent.Capacitor#477 and depends on the M1 shared eval library in #14. - New SignalR wire types in Models.cs match the server's DaemonCommands.cs: RunEvalCommand (server -> daemon dispatch) plus the four daemon -> server progress events (EvalStarted, EvalQuestionCompleted, EvalFinished, EvalFailed). Registered in KapacitorJsonContext for source-gen serialization. - ServerConnection registers a "RunEval" handler and exposes per-event send methods (EvalStartedAsync etc.) that mirror the existing AgentRegisteredAsync / LaunchFailedAsync pattern. - New EvalRunner singleton subscribes to OnRunEval. Each incoming command spawns a fire-and-forget Task that builds an authenticated HttpClient, instantiates a DaemonEvalObserver bound to the run, and drives EvalService.RunAsync. Unhandled exceptions are caught and translated to an EvalFailed relay so the dashboard learns about daemon-side failures rather than waiting forever. - DaemonEvalObserver maps the IEvalObserver surface to SignalR sends: OnStarted -> EvalStartedAsync, OnQuestionCompleted -> EvalQuestionCompletedAsync, OnFinished -> EvalFinishedAsync, OnFailed -> EvalFailedAsync. Info / per-question-start / per-question-failure / fact-retained callbacks just log locally — they're not interesting enough to justify SignalR chatter for every judge. - Wired into DaemonRunner DI: AddSingleton<EvalRunner> + an explicit GetRequiredService at startup so the constructor's OnRunEval subscription happens before the host starts taking traffic. Full suite 205/205, AOT publish clean. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Daemon side of the dashboard-driven eval pipeline. Pairs with the server M3 endpoint in kurrent-io/Kurrent.Capacitor#477 and depends on the M1 shared eval library in #14. - New SignalR wire types in Models.cs match the server's DaemonCommands.cs: RunEvalCommand (server -> daemon dispatch) plus the four daemon -> server progress events (EvalStarted, EvalQuestionCompleted, EvalFinished, EvalFailed). Registered in KapacitorJsonContext for source-gen serialization. - ServerConnection registers a "RunEval" handler and exposes per-event send methods (EvalStartedAsync etc.) that mirror the existing AgentRegisteredAsync / LaunchFailedAsync pattern. - New EvalRunner singleton subscribes to OnRunEval. Each incoming command spawns a fire-and-forget Task that builds an authenticated HttpClient, instantiates a DaemonEvalObserver bound to the run, and drives EvalService.RunAsync. Unhandled exceptions are caught and translated to an EvalFailed relay so the dashboard learns about daemon-side failures rather than waiting forever. - DaemonEvalObserver maps the IEvalObserver surface to SignalR sends: OnStarted -> EvalStartedAsync, OnQuestionCompleted -> EvalQuestionCompletedAsync, OnFinished -> EvalFinishedAsync, OnFailed -> EvalFailedAsync. Info / per-question-start / per-question-failure / fact-retained callbacks just log locally — they're not interesting enough to justify SignalR chatter for every judge. - Wired into DaemonRunner DI: AddSingleton<EvalRunner> + an explicit GetRequiredService at startup so the constructor's OnRunEval subscription happens before the host starts taking traffic. Full suite 205/205, AOT publish clean. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* [DEV-1440] milestone 2: daemon RunEvalCommand handler Daemon side of the dashboard-driven eval pipeline. Pairs with the server M3 endpoint in kurrent-io/Kurrent.Capacitor#477 and depends on the M1 shared eval library in #14. - New SignalR wire types in Models.cs match the server's DaemonCommands.cs: RunEvalCommand (server -> daemon dispatch) plus the four daemon -> server progress events (EvalStarted, EvalQuestionCompleted, EvalFinished, EvalFailed). Registered in KapacitorJsonContext for source-gen serialization. - ServerConnection registers a "RunEval" handler and exposes per-event send methods (EvalStartedAsync etc.) that mirror the existing AgentRegisteredAsync / LaunchFailedAsync pattern. - New EvalRunner singleton subscribes to OnRunEval. Each incoming command spawns a fire-and-forget Task that builds an authenticated HttpClient, instantiates a DaemonEvalObserver bound to the run, and drives EvalService.RunAsync. Unhandled exceptions are caught and translated to an EvalFailed relay so the dashboard learns about daemon-side failures rather than waiting forever. - DaemonEvalObserver maps the IEvalObserver surface to SignalR sends: OnStarted -> EvalStartedAsync, OnQuestionCompleted -> EvalQuestionCompletedAsync, OnFinished -> EvalFinishedAsync, OnFailed -> EvalFailedAsync. Info / per-question-start / per-question-failure / fact-retained callbacks just log locally — they're not interesting enough to justify SignalR chatter for every judge. - Wired into DaemonRunner DI: AddSingleton<EvalRunner> + an explicit GetRequiredService at startup so the constructor's OnRunEval subscription happens before the host starts taking traffic. Full suite 205/205, AOT publish clean. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * [DEV-1440] address review feedback on daemon eval runner Three findings on PR #15 (the other two — observer-throw guard and judge-fact cancellation propagation — were already addressed by the M1 follow-up in 1f655f4): 1. EvalRunId mismatch (Action required) — server dispatches RunEvalCommand with an EvalRunId, but EvalService generated its own GUID, leading to two different ids in one run's event stream (EvalStarted used the service-generated id; subsequent question / finished / failed events used the dispatched id captured in DaemonEvalObserver). Fixed by adding an optional `evalRunId` parameter to EvalService.RunAsync; CLI passes null (mints a fresh id, current behaviour) and the daemon passes cmd.EvalRunId so the whole run, including the persisted SessionEvalCompleted aggregate, shares one correlation id end-to-end. 2. Out-of-order progress events (Recommended) — DaemonEvalObserver's per-event Task.Run can interleave concurrent SignalR sends. Added a SemaphoreSlim(1,1) gate inside Relay so the background sends drain in their enqueue order — the dashboard sees EvalStarted before any question completion, and EvalFinished/EvalFailed last, deterministically. 3. Daemon evals not cancellable on shutdown (Recommended) — EvalRunner spawned Task.Run with no link to the host lifecycle. Now injects IHostApplicationLifetime, captures ApplicationStopping, and passes it as ct to EvalService.RunAsync. M1's outer try/catch turns in-flight cancellation into a clean OnFailed("cancelled") relay so the dashboard learns the eval stopped instead of waiting forever. Full suite 205/205, AOT publish clean. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
Refactors
kapacitor.Commands.EvalCommandinto a reusablekapacitor.Evallibrary so the daemon (milestone 2) can reuse the same orchestration without duplicating it. No behaviour change — `kapacitor eval ` produces identical output and the server contracts are untouched.First milestone of DEV-1440.
New namespace layout
CLI adapter
`kapacitor.Commands.EvalCommand` shrinks to:
Visibility
Types remain `internal` — the daemon lives in the same assembly, so `public` isn't needed yet. Revisit if/when the server repo consumes this library across assembly boundaries.
Test plan
What's next
M2 will introduce a `RunEvalCommand` SignalR command on the daemon side, implementing `IEvalObserver` to push progress events back to the server. The server dispatch endpoint (M3) and UI tab (M5) depend on it.
🤖 Generated with Claude Code