feat: add eval harness and initial eval suites#309
feat: add eval harness and initial eval suites#309Hunter Lovell (hntrl) wants to merge 6 commits intomainfrom
Conversation
git-subtree-dir: .agents/skills/langsmith-skills git-subtree-split: aec5e8f059ff50b1cde655e50b27a25da7981e60
…kills/langsmith-skills'
|
|
|
||
| ### Custom matchers | ||
|
|
||
| The harness provides vitest matchers that also log LangSmith feedback: |
There was a problem hiding this comment.
are we sure that we want efficiency metrics to induce failures? in the Python version, agent_steps tool_call_requests and tool_call checks are EfficiencyAssertions that never fail the test. They're logged to LangSmith for tracking but don't block CI. I believe this is intentional bc a model that takes 3 steps instead of 2 still answered correctly. We would call that an efficiency regression, not an explicit failure. This could cause a problem bc swapping models will likely cause mass "failures" even if correctness is fine. But I can see a case for the other side too just wanted to flag.
Summary
Adds a beefier eval framework for testing agent behavior.
Changes
@deepagents/evalsharness (internal/eval-harness)The harness decouples what you're testing from which model you're testing against. The central abstraction is
EvalRunner:run({ query, initialFiles? })— pure invocation params. The agent runs against an in-memoryStateBackend(no containers needed — file tools operate on virtual FS in LangGraph state).extend({ systemPrompt?, tools?, subagents?, memory?, skills?, backend? })— returns a derived runner with agent configuration baked in. Tests customize the agent without rebuilding runners.EVAL_RUNNERenv var selects the model. 6 runners registered out of the box:sonnet-4-5,sonnet-4-5-thinking,opus-4-6,gpt-4.1,gpt-4.1-mini,o3-mini.Trajectory parsing walks LangGraph message arrays and groups
AIMessage→ToolMessagepairs into structured steps. 4 custom vitest matchers (toHaveAgentSteps,toHaveToolCallRequests,toHaveToolCallInStep,toHaveFinalTextContaining) double as LangSmith feedback loggers — every assertion also pushes a numeric score to the experiment dashboard.Eval suites
basicfilessubagentstasktool routing to named and general-purpose subagentshitlEvalRunner, usescreateDeepAgent+MemorySaver+Commanddirectly)memoryskillstool-usage-relationaltest_tool_usage_relational.py. Relational data with fake users/locations/foods connected by IDs. Tests sequential tool chaining (1→2→3→4 hops) and parallel fan-out (e.g. resolve 3 food IDs in parallel after a sequential lookup chain)Skills
eval-creator(.agents/skills/eval-creator/) — comprehensive guide for scaffolding new eval suites. Covers inline, fixture, external dataset, and LangSmith dataset patterns; sandbox-backed evals; LLM-as-judge scoring; published benchmark implementation.langsmith-trace(.agents/skills/langsmith-trace/) — replaces the trace skill from the deleted langsmith-skills subtree, installed viaskills-lock.json.