Skip to content

[DEV-1440] M1: Extract shared eval library#14

Merged
alexeyzimarev merged 2 commits intomainfrom
alexeyzimarev/dev-1440-m1-shared-eval-library
Apr 13, 2026
Merged

[DEV-1440] M1: Extract shared eval library#14
alexeyzimarev merged 2 commits intomainfrom
alexeyzimarev/dev-1440-m1-shared-eval-library

Conversation

@alexeyzimarev
Copy link
Copy Markdown
Member

Summary

Refactors kapacitor.Commands.EvalCommand into a reusable kapacitor.Eval library so the daemon (milestone 2) can reuse the same orchestration without duplicating it. No behaviour change — `kapacitor eval ` produces identical output and the server contracts are untouched.

First milestone of DEV-1440.

New namespace layout

  • `kapacitor.Eval.EvalQuestions` — canonical 13-question / 4-category taxonomy and category-order helper. Single source of truth.
  • `kapacitor.Eval.IEvalObserver` — observer surface for progress. The CLI supplies a stderr-logging implementation; M2 will add a SignalR-pushing implementation for the daemon. Callbacks are shaped so `OnStarted` / `OnQuestionCompleted` / `OnFinished` / `OnFailed` map 1:1 to the SignalR events documented in DEV-1440.
  • `kapacitor.Eval.EvalService` — `RunAsync` drives the full pipeline (fetch context, fetch retained facts, run 13 judges sequentially, aggregate, persist, retain new facts) and reports every phase through `IEvalObserver`. Returns the aggregate on success, null on failure.

CLI adapter

`kapacitor.Commands.EvalCommand` shrinks to:

  • Create authenticated HTTP client
  • `ConsoleEvalObserver` — maps each callback to a timestamped stderr log line (matches pre-refactor output exactly)
  • Render the returned aggregate as the terminal report

Visibility

Types remain `internal` — the daemon lives in the same assembly, so `public` isn't needed yet. Revisit if/when the server repo consumes this library across assembly boundaries.

Test plan

  • `dotnet build src/kapacitor/kapacitor.csproj` — clean
  • `dotnet publish -c Release` — zero IL3050/IL2026 warnings (AOT-clean)
  • Full unit suite — 205/205 pass
    • 21 existing eval tests (ParseVerdict, ExtractRetainFact, Aggregate, FormatKnownPatterns, BuildQuestionPrompt) migrated to target the new namespace — no assertion changes needed
    • `EvalCommandTests` renamed to `EvalServiceTests` to match the actual SUT
  • CI
  • Manual smoke of `kapacitor eval ` against a local server (behaviour-preserving refactor, but the argv → observer → stderr path is worth exercising once)

What's next

M2 will introduce a `RunEvalCommand` SignalR command on the daemon side, implementing `IEvalObserver` to push progress events back to the server. The server dispatch endpoint (M3) and UI tab (M5) depend on it.

🤖 Generated with Claude Code

Refactors kapacitor.Commands.EvalCommand into a reusable kapacitor.Eval
library so the daemon (milestone 2) can reuse the same orchestration
without duplicating it. No behaviour change — `kapacitor eval <id>`
produces identical output and the server contracts are untouched.

New namespace layout:
- kapacitor.Eval.EvalQuestions: canonical 13-question / 4-category
  taxonomy and category-order helper. The single source of truth; both
  prompt building and aggregation reference it.
- kapacitor.Eval.IEvalObserver: observer surface for progress. The CLI
  supplies a stderr-logging implementation; milestone 2 will add a
  SignalR-pushing implementation for the daemon. Callbacks are shaped
  specifically so EvalStarted / OnQuestionCompleted / OnFinished /
  OnFailed map 1:1 to the SignalR events documented in DEV-1440.
- kapacitor.Eval.EvalService: RunAsync drives the full pipeline (fetch
  context, fetch retained facts, run 13 judges sequentially, aggregate,
  persist, retain new facts) and reports every phase through
  IEvalObserver. Returns the aggregate on success, null on failure;
  OnFinished / OnFailed are fired either way so observers don't need to
  also inspect the return value.

kapacitor.Commands.EvalCommand shrinks to a thin adapter:
- Creates the authenticated HTTP client
- Provides a ConsoleEvalObserver that maps each callback to a timestamped
  stderr log line (matching the pre-refactor output exactly)
- Renders the returned aggregate as the terminal report

Types remain internal — the daemon lives in the same assembly, so public
isn't needed yet; revisit when/if the server repo consumes the library
across assembly boundaries.

Tests renamed EvalCommandTests -> EvalServiceTests and retargeted to the
new namespace. All 21 existing eval tests continue to pass without
changes to their assertions. Full suite 205/205, AOT publish clean.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@linear
Copy link
Copy Markdown

linear bot commented Apr 13, 2026

@qodo-code-review
Copy link
Copy Markdown

Review Summary by Qodo

Extract shared eval library for daemon reuse (DEV-1440 M1)

✨ Enhancement

Grey Divider

Walkthroughs

Description
• Extract eval orchestration into reusable kapacitor.Eval library
• Move 13-question taxonomy to EvalQuestions for single source of truth
• Introduce IEvalObserver interface for progress reporting across environments
• Refactor EvalCommand to thin CLI adapter over EvalService
• Rename test class to EvalServiceTests with retargeted assertions
Diagram
flowchart LR
  EvalCommand["EvalCommand<br/>(thin CLI adapter)"]
  EvalService["EvalService<br/>(core orchestration)"]
  EvalQuestions["EvalQuestions<br/>(taxonomy)"]
  IEvalObserver["IEvalObserver<br/>(progress surface)"]
  ConsoleObserver["ConsoleEvalObserver<br/>(stderr logging)"]
  
  EvalCommand -- "calls RunAsync" --> EvalService
  EvalCommand -- "implements" --> IEvalObserver
  EvalCommand -- "creates" --> ConsoleObserver
  EvalService -- "references" --> EvalQuestions
  EvalService -- "reports via" --> IEvalObserver
  ConsoleObserver -- "implements" --> IEvalObserver
Loading

Grey Divider

File Changes

1. src/kapacitor/Commands/EvalCommand.cs Refactoring +47/-395

Thin CLI adapter over EvalService library

• Refactored from 410 lines to 29 lines, moving orchestration logic to EvalService
• Removed question taxonomy, verdict parsing, aggregation, and HTTP logic
• Added ConsoleEvalObserver class implementing IEvalObserver for stderr logging
• Simplified HandleEval to create HTTP client, call EvalService.RunAsync, and render results
• Updated Render method to use EvalService.VerdictForScore instead of local method

src/kapacitor/Commands/EvalCommand.cs


2. src/kapacitor/Eval/EvalQuestions.cs ✨ Enhancement +51/-0

Canonical question taxonomy and category ordering

• New file establishing canonical 13-question taxonomy across 4 categories
• Defines Question record with Category, Id, and Text fields
• Exports All array as single source of truth for question definitions
• Provides Categories array and CategoryOrder method for consistent ordering
• Replaces inline question definitions previously in EvalCommand

src/kapacitor/Eval/EvalQuestions.cs


3. src/kapacitor/Eval/EvalService.cs ✨ Enhancement +410/-0

Core eval orchestration with observer-based progress reporting

• New file containing core eval orchestration logic extracted from EvalCommand
• Implements RunAsync method driving full pipeline: fetch context, run judges, aggregate, persist
• Includes prompt construction, verdict parsing, fact extraction, and aggregation logic
• Reports all phases through IEvalObserver callbacks for progress tracking
• Provides public static methods for verdict parsing, prompt building, and aggregation
• Handles HTTP communication with server for context, judge facts, and result persistence

src/kapacitor/Eval/EvalService.cs


View more (2)
4. src/kapacitor/Eval/IEvalObserver.cs ✨ Enhancement +45/-0

Observer interface for progress reporting across environments

• New interface defining progress surface for eval runs
• Includes 9 callback methods: OnInfo, OnStarted, OnContextFetched, OnQuestionStarted,
 OnQuestionCompleted, OnQuestionFailed, OnFactRetained, OnFinished, OnFailed
• Callbacks shaped to map 1:1 to SignalR events for daemon milestone 2
• Allows different implementations (CLI stderr logging vs daemon SignalR pushing)
• Includes comprehensive XML documentation for each callback

src/kapacitor/Eval/IEvalObserver.cs


5. test/kapacitor.Tests.Unit/EvalServiceTests.cs 🧪 Tests +25/-25

Retarget eval tests to EvalService namespace

• Renamed from EvalCommandTests to EvalServiceTests to match new SUT
• Updated all 21 test method calls from EvalCommand.* to EvalService.*
• Changed question definition from EvalCommand.EvalQuestion to EvalQuestions.Question
• All assertions remain unchanged; tests continue to pass without modification
• Covers verdict parsing, aggregation, prompt building, pattern formatting, and fact extraction

test/kapacitor.Tests.Unit/EvalServiceTests.cs


Grey Divider

Qodo Logo

@qodo-code-review
Copy link
Copy Markdown

qodo-code-review bot commented Apr 13, 2026

Code Review by Qodo

🐞 Bugs (0) 📘 Rule violations (0) 📎 Requirement gaps (0)

Grey Divider


Action required

1. Observer exceptions abort eval🐞
Description
IEvalObserver promises observer exceptions are caught and don’t abort the eval, but EvalService
invokes observer callbacks directly without any try/catch, so an observer throw will terminate
RunAsync and may skip OnFailed/OnFinished.
Code

src/kapacitor/Eval/EvalService.cs[R89-97]

+        observer.OnContextFetched(
+            context.Trace.Count,
+            traceJson.Length,
+            context.Compaction.ToolResultsTotal,
+            context.Compaction.ToolResultsTruncated,
+            context.Compaction.BytesSaved
+        );
+        observer.OnStarted(evalRunId, context.SessionId, model, EvalQuestions.All.Length);
+
Evidence
The IEvalObserver contract explicitly states the service catches observer exceptions, but
EvalService calls observer methods directly (e.g., OnContextFetched/OnStarted) with no guarding
wrapper; any exception will propagate out of RunAsync.

src/kapacitor/Eval/IEvalObserver.cs[11-16]
src/kapacitor/Eval/EvalService.cs[89-97]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`IEvalObserver` documents that observer exceptions are caught/logged and do not abort the eval, but `EvalService.RunAsync` calls observer methods directly. If an observer throws (e.g., SignalR push fails), the eval orchestration will crash and may not emit `OnFailed`/`OnFinished`.

### Issue Context
This library is intended for reuse by the daemon (M2). In that environment, observer callbacks are more likely to do I/O and fail transiently.

### Fix Focus Areas
- src/kapacitor/Eval/EvalService.cs[55-179]
- src/kapacitor/Eval/IEvalObserver.cs[11-16]

### Suggested fix
- Add a small helper in `EvalService` like `SafeNotify(Action notify, string context)` that wraps each `observer.*` call in try/catch.
- On catch, log to a safe sink (e.g., `Console.Error.WriteLine`) or a dedicated internal logger; avoid calling back into the observer in the catch path.
- Use the helper for *all* observer calls (`OnInfo`, `OnStarted`, `OnContextFetched`, `OnQuestion*`, `OnFinished`, `OnFailed`).

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Remediation recommended

2. Progress events reversed🐞
Description
EvalService emits OnContextFetched before OnStarted, but the CLI observer maps these to user-facing
lines (“Fetched …” and “Evaluating session …”), causing progress output ordering to contradict the
CLI’s stated “pre-refactor shape”.
Code

src/kapacitor/Eval/EvalService.cs[R89-97]

+        observer.OnContextFetched(
+            context.Trace.Count,
+            traceJson.Length,
+            context.Compaction.ToolResultsTotal,
+            context.Compaction.ToolResultsTruncated,
+            context.Compaction.BytesSaved
+        );
+        observer.OnStarted(evalRunId, context.SessionId, model, EvalQuestions.All.Length);
+
Evidence
EvalService calls OnContextFetched and only then OnStarted. ConsoleEvalObserver logs OnStarted
as “Evaluating session …” and OnContextFetched as “Fetched …”, while its comment claims it matches
the pre-refactor output shape.

src/kapacitor/Eval/EvalService.cs[89-97]
src/kapacitor/Commands/EvalCommand.cs[57-70]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`EvalService.RunAsync` currently calls `observer.OnContextFetched(...)` before `observer.OnStarted(...)`. The CLI observer logs these in a user-facing way, so normal runs will print “Fetched …” before “Evaluating session …”, contradicting the intent of preserving the CLI output shape.

### Issue Context
`ConsoleEvalObserver` is explicitly documented as matching the old stderr format.

### Fix Focus Areas
- src/kapacitor/Eval/EvalService.cs[89-97]
- src/kapacitor/Commands/EvalCommand.cs[62-70]

### Suggested fix
- Emit `OnStarted(...)` before `OnContextFetched(...)` (or adjust observer contract / CLI observer to preserve the intended log order).
- If `OnStarted` must remain “after context fetched” semantically, still reorder those two calls because both are already after the fetch.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


3. 401 prints extra line🐞
Description
On a 401, HandleUnauthorizedAsync already writes the server message to stderr, but EvalService
additionally calls observer.OnFailed("unauthenticated"); with ConsoleEvalObserver this prints an
extra unprefixed line.
Code

src/kapacitor/Eval/EvalService.cs[R57-61]

+            if (await HttpClientExtensions.HandleUnauthorizedAsync(resp)) {
+                observer.OnFailed("unauthenticated");
+
+                return null;
+            }
Evidence
HandleUnauthorizedAsync prints the error message to Console.Error. EvalService then calls
observer.OnFailed("unauthenticated"), and the CLI observer writes the reason directly to stderr,
resulting in duplicated/changed error output for the same condition.

src/kapacitor/HttpClientExtensions.cs[111-130]
src/kapacitor/Eval/EvalService.cs[55-61]
src/kapacitor/Commands/EvalCommand.cs[86-88]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
When the eval-context call returns 401, the code path prints to stderr via `HandleUnauthorizedAsync`, then also emits `observer.OnFailed("unauthenticated")`. In the CLI this produces an extra line and changes the error shape.

### Issue Context
The extracted library should ideally not write directly to the console; instead, errors should flow through `IEvalObserver` so CLI/daemon can render appropriately.

### Fix Focus Areas
- src/kapacitor/Eval/EvalService.cs[55-61]
- src/kapacitor/HttpClientExtensions.cs[111-130]
- src/kapacitor/Commands/EvalCommand.cs[86-88]

### Suggested fix
- Prefer a single reporting mechanism:
 - Option A (minimal): if `HandleUnauthorizedAsync(resp)` returns true, return null without calling `observer.OnFailed(...)`.
 - Option B (better for library): refactor `HandleUnauthorizedAsync` to *return* the message (or provide a non-printing overload) and let `EvalService` call `observer.OnFailed(message)` without any direct console writes.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


4. Cancellation partly ignored🐞
Description
RunAsync accepts a CancellationToken, but judge-fact fetch/post requests don’t pass it to
GetWithRetryAsync/PostWithRetryAsync and cancellation via ThrowIfCancellationRequested can bypass
the method’s documented “OnFinished/OnFailed either way” behavior.
Code

src/kapacitor/Eval/EvalService.cs[R346-358]

+    static async Task<Dictionary<string, List<JudgeFact>>> FetchAllJudgeFactsAsync(
+            HttpClient    httpClient,
+            string        baseUrl,
+            string        encodedSessionId,
+            IEvalObserver observer
+        ) {
+        var result = new Dictionary<string, List<JudgeFact>>();
+
+        foreach (var category in EvalQuestions.Categories) {
+            try {
+                using var resp = await httpClient.GetWithRetryAsync(
+                    $"{baseUrl}/api/sessions/{encodedSessionId}/judge-facts?category={Uri.EscapeDataString(category)}"
+                );
Evidence
EvalService documents that observers receive a final OnFinished or OnFailed, but it throws on
cancellation inside the loop and doesn’t catch OperationCanceledException. Additionally, the
judge-facts HTTP helper methods omit passing ct even though the underlying extension methods
support it, so cancellation won’t be honored during those requests.

src/kapacitor/Eval/EvalService.cs[22-27]
src/kapacitor/Eval/EvalService.cs[106-110]
src/kapacitor/Eval/EvalService.cs[346-358]
src/kapacitor/Eval/EvalService.cs[377-407]
src/kapacitor/HttpClientExtensions.cs[80-91]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`EvalService.RunAsync` takes a `CancellationToken` but does not propagate it to judge-fact fetch/post HTTP calls, and it can throw `OperationCanceledException` without emitting `OnFailed`, contradicting the method’s contract comment that observers receive a final `OnFinished`/`OnFailed`.

### Issue Context
This matters more for the daemon use case where users may cancel long-running evals.

### Fix Focus Areas
- src/kapacitor/Eval/EvalService.cs[29-180]
- src/kapacitor/Eval/EvalService.cs[346-408]

### Suggested fix
- Thread `CancellationToken ct` into `FetchAllJudgeFactsAsync(...)` and `PostJudgeFactAsync(...)` and pass `ct: ct` to `GetWithRetryAsync`/`PostWithRetryAsync` and `ReadAsStringAsync(ct)`.
- Wrap the body of `RunAsync` (or at least the question loop) in a `try { ... } catch (OperationCanceledException) { observer.OnFailed("cancelled"); return null; }` (using the same safe-notify wrapper from the observer-exception fix).

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

ⓘ The new review experience is currently in Beta. Learn more

Grey Divider

Qodo Logo

Comment thread src/kapacitor/Eval/EvalService.cs
alexeyzimarev added a commit that referenced this pull request Apr 13, 2026
Daemon side of the dashboard-driven eval pipeline. Pairs with the server
M3 endpoint in kurrent-io/Kurrent.Capacitor#477 and depends on the M1
shared eval library in #14.

- New SignalR wire types in Models.cs match the server's DaemonCommands.cs:
  RunEvalCommand (server -> daemon dispatch) plus the four daemon -> server
  progress events (EvalStarted, EvalQuestionCompleted, EvalFinished,
  EvalFailed). Registered in KapacitorJsonContext for source-gen
  serialization.

- ServerConnection registers a "RunEval" handler and exposes per-event
  send methods (EvalStartedAsync etc.) that mirror the existing
  AgentRegisteredAsync / LaunchFailedAsync pattern.

- New EvalRunner singleton subscribes to OnRunEval. Each incoming
  command spawns a fire-and-forget Task that builds an authenticated
  HttpClient, instantiates a DaemonEvalObserver bound to the run, and
  drives EvalService.RunAsync. Unhandled exceptions are caught and
  translated to an EvalFailed relay so the dashboard learns about
  daemon-side failures rather than waiting forever.

- DaemonEvalObserver maps the IEvalObserver surface to SignalR sends:
  OnStarted -> EvalStartedAsync, OnQuestionCompleted ->
  EvalQuestionCompletedAsync, OnFinished -> EvalFinishedAsync, OnFailed
  -> EvalFailedAsync. Info / per-question-start / per-question-failure /
  fact-retained callbacks just log locally — they're not interesting
  enough to justify SignalR chatter for every judge.

- Wired into DaemonRunner DI: AddSingleton<EvalRunner> + an explicit
  GetRequiredService at startup so the constructor's OnRunEval
  subscription happens before the host starts taking traffic.

Full suite 205/205, AOT publish clean.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Four findings on PR #14:

1. Observer exceptions abort eval (Action required) — IEvalObserver
   documented that observer throws are caught and don't abort the eval,
   but EvalService called callbacks directly. A SignalR push failure on
   the daemon would have crashed the run mid-flight, possibly skipping
   OnFailed. Fixed via a SafeObserver wrapper inside RunAsync that
   delegates to the caller's observer with a try/catch around each call;
   exceptions log to stderr (with a nested try/catch in case stderr
   itself fails) and the eval continues.

2. Progress events reversed (Recommended) — OnContextFetched was emitted
   before OnStarted. The CLI observer maps these to "Fetched..." then
   "Evaluating session..." log lines, so the user-facing output order
   was the reverse of the pre-refactor shape. Swapped — now OnStarted
   fires first, then OnContextFetched.

3. 401 prints extra line (Recommended) — HandleUnauthorizedAsync writes
   to stderr directly, then EvalService called observer.OnFailed with
   "unauthenticated", which the CLI observer also wrote to stderr —
   resulting in two lines for the same condition. Replaced the
   HandleUnauthorizedAsync call with a direct StatusCode == 401 check
   and a single observer.OnFailed("authentication failed — run
   'kapacitor login' to re-authenticate"). The observer is now the
   single reporting channel; daemon callers also benefit (they get
   EvalFailed instead of nothing for 401s).

4. Cancellation partly ignored (Action required) — RunAsync took a
   CancellationToken but didn't forward it to FetchAllJudgeFactsAsync /
   PostJudgeFactAsync, and ThrowIfCancellationRequested could escape
   without firing OnFailed. Now: ct threads through both helpers (and
   their HTTP calls + ReadAsStringAsync), and the body of RunAsync is
   wrapped in a try/catch (OperationCanceledException) that fires
   observer.OnFailed("cancelled") before returning null — observers
   always see exactly one terminal callback.

Doc updated to reflect that the SafeObserver guarantee + cancellation
contract are now actually enforced.

Full suite 205/205, AOT publish clean.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
alexeyzimarev added a commit that referenced this pull request Apr 13, 2026
Daemon side of the dashboard-driven eval pipeline. Pairs with the server
M3 endpoint in kurrent-io/Kurrent.Capacitor#477 and depends on the M1
shared eval library in #14.

- New SignalR wire types in Models.cs match the server's DaemonCommands.cs:
  RunEvalCommand (server -> daemon dispatch) plus the four daemon -> server
  progress events (EvalStarted, EvalQuestionCompleted, EvalFinished,
  EvalFailed). Registered in KapacitorJsonContext for source-gen
  serialization.

- ServerConnection registers a "RunEval" handler and exposes per-event
  send methods (EvalStartedAsync etc.) that mirror the existing
  AgentRegisteredAsync / LaunchFailedAsync pattern.

- New EvalRunner singleton subscribes to OnRunEval. Each incoming
  command spawns a fire-and-forget Task that builds an authenticated
  HttpClient, instantiates a DaemonEvalObserver bound to the run, and
  drives EvalService.RunAsync. Unhandled exceptions are caught and
  translated to an EvalFailed relay so the dashboard learns about
  daemon-side failures rather than waiting forever.

- DaemonEvalObserver maps the IEvalObserver surface to SignalR sends:
  OnStarted -> EvalStartedAsync, OnQuestionCompleted ->
  EvalQuestionCompletedAsync, OnFinished -> EvalFinishedAsync, OnFailed
  -> EvalFailedAsync. Info / per-question-start / per-question-failure /
  fact-retained callbacks just log locally — they're not interesting
  enough to justify SignalR chatter for every judge.

- Wired into DaemonRunner DI: AddSingleton<EvalRunner> + an explicit
  GetRequiredService at startup so the constructor's OnRunEval
  subscription happens before the host starts taking traffic.

Full suite 205/205, AOT publish clean.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@alexeyzimarev alexeyzimarev merged commit dec2102 into main Apr 13, 2026
3 checks passed
@alexeyzimarev alexeyzimarev deleted the alexeyzimarev/dev-1440-m1-shared-eval-library branch April 13, 2026 15:05
alexeyzimarev added a commit that referenced this pull request Apr 13, 2026
Daemon side of the dashboard-driven eval pipeline. Pairs with the server
M3 endpoint in kurrent-io/Kurrent.Capacitor#477 and depends on the M1
shared eval library in #14.

- New SignalR wire types in Models.cs match the server's DaemonCommands.cs:
  RunEvalCommand (server -> daemon dispatch) plus the four daemon -> server
  progress events (EvalStarted, EvalQuestionCompleted, EvalFinished,
  EvalFailed). Registered in KapacitorJsonContext for source-gen
  serialization.

- ServerConnection registers a "RunEval" handler and exposes per-event
  send methods (EvalStartedAsync etc.) that mirror the existing
  AgentRegisteredAsync / LaunchFailedAsync pattern.

- New EvalRunner singleton subscribes to OnRunEval. Each incoming
  command spawns a fire-and-forget Task that builds an authenticated
  HttpClient, instantiates a DaemonEvalObserver bound to the run, and
  drives EvalService.RunAsync. Unhandled exceptions are caught and
  translated to an EvalFailed relay so the dashboard learns about
  daemon-side failures rather than waiting forever.

- DaemonEvalObserver maps the IEvalObserver surface to SignalR sends:
  OnStarted -> EvalStartedAsync, OnQuestionCompleted ->
  EvalQuestionCompletedAsync, OnFinished -> EvalFinishedAsync, OnFailed
  -> EvalFailedAsync. Info / per-question-start / per-question-failure /
  fact-retained callbacks just log locally — they're not interesting
  enough to justify SignalR chatter for every judge.

- Wired into DaemonRunner DI: AddSingleton<EvalRunner> + an explicit
  GetRequiredService at startup so the constructor's OnRunEval
  subscription happens before the host starts taking traffic.

Full suite 205/205, AOT publish clean.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
alexeyzimarev added a commit that referenced this pull request Apr 13, 2026
* [DEV-1440] milestone 2: daemon RunEvalCommand handler

Daemon side of the dashboard-driven eval pipeline. Pairs with the server
M3 endpoint in kurrent-io/Kurrent.Capacitor#477 and depends on the M1
shared eval library in #14.

- New SignalR wire types in Models.cs match the server's DaemonCommands.cs:
  RunEvalCommand (server -> daemon dispatch) plus the four daemon -> server
  progress events (EvalStarted, EvalQuestionCompleted, EvalFinished,
  EvalFailed). Registered in KapacitorJsonContext for source-gen
  serialization.

- ServerConnection registers a "RunEval" handler and exposes per-event
  send methods (EvalStartedAsync etc.) that mirror the existing
  AgentRegisteredAsync / LaunchFailedAsync pattern.

- New EvalRunner singleton subscribes to OnRunEval. Each incoming
  command spawns a fire-and-forget Task that builds an authenticated
  HttpClient, instantiates a DaemonEvalObserver bound to the run, and
  drives EvalService.RunAsync. Unhandled exceptions are caught and
  translated to an EvalFailed relay so the dashboard learns about
  daemon-side failures rather than waiting forever.

- DaemonEvalObserver maps the IEvalObserver surface to SignalR sends:
  OnStarted -> EvalStartedAsync, OnQuestionCompleted ->
  EvalQuestionCompletedAsync, OnFinished -> EvalFinishedAsync, OnFailed
  -> EvalFailedAsync. Info / per-question-start / per-question-failure /
  fact-retained callbacks just log locally — they're not interesting
  enough to justify SignalR chatter for every judge.

- Wired into DaemonRunner DI: AddSingleton<EvalRunner> + an explicit
  GetRequiredService at startup so the constructor's OnRunEval
  subscription happens before the host starts taking traffic.

Full suite 205/205, AOT publish clean.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* [DEV-1440] address review feedback on daemon eval runner

Three findings on PR #15 (the other two — observer-throw guard and
judge-fact cancellation propagation — were already addressed by the
M1 follow-up in 1f655f4):

1. EvalRunId mismatch (Action required) — server dispatches
   RunEvalCommand with an EvalRunId, but EvalService generated its own
   GUID, leading to two different ids in one run's event stream
   (EvalStarted used the service-generated id; subsequent question /
   finished / failed events used the dispatched id captured in
   DaemonEvalObserver). Fixed by adding an optional `evalRunId`
   parameter to EvalService.RunAsync; CLI passes null (mints a fresh
   id, current behaviour) and the daemon passes cmd.EvalRunId so the
   whole run, including the persisted SessionEvalCompleted aggregate,
   shares one correlation id end-to-end.

2. Out-of-order progress events (Recommended) — DaemonEvalObserver's
   per-event Task.Run can interleave concurrent SignalR sends. Added a
   SemaphoreSlim(1,1) gate inside Relay so the background sends drain
   in their enqueue order — the dashboard sees EvalStarted before any
   question completion, and EvalFinished/EvalFailed last, deterministically.

3. Daemon evals not cancellable on shutdown (Recommended) — EvalRunner
   spawned Task.Run with no link to the host lifecycle. Now injects
   IHostApplicationLifetime, captures ApplicationStopping, and passes
   it as ct to EvalService.RunAsync. M1's outer try/catch turns
   in-flight cancellation into a clean OnFailed("cancelled") relay so
   the dashboard learns the eval stopped instead of waiting forever.

Full suite 205/205, AOT publish clean.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant