Skip to content

feat: add eval harness and initial eval suites#309

Open
Hunter Lovell (hntrl) wants to merge 6 commits intomainfrom
hunter/add-eval-harness
Open

feat: add eval harness and initial eval suites#309
Hunter Lovell (hntrl) wants to merge 6 commits intomainfrom
hunter/add-eval-harness

Conversation

@hntrl
Copy link
Member

Summary

Adds a beefier eval framework for testing agent behavior.

Changes

@deepagents/evals harness (internal/eval-harness)

The harness decouples what you're testing from which model you're testing against. The central abstraction is EvalRunner:

  • run({ query, initialFiles? }) — pure invocation params. The agent runs against an in-memory StateBackend (no containers needed — file tools operate on virtual FS in LangGraph state).
  • extend({ systemPrompt?, tools?, subagents?, memory?, skills?, backend? }) — returns a derived runner with agent configuration baked in. Tests customize the agent without rebuilding runners.
  • EVAL_RUNNER env var selects the model. 6 runners registered out of the box: sonnet-4-5, sonnet-4-5-thinking, opus-4-6, gpt-4.1, gpt-4.1-mini, o3-mini.

Trajectory parsing walks LangGraph message arrays and groups AIMessageToolMessage pairs into structured steps. 4 custom vitest matchers (toHaveAgentSteps, toHaveToolCallRequests, toHaveToolCallInStep, toHaveFinalTextContaining) double as LangSmith feedback loggers — every assertion also pushes a numeric score to the experiment dashboard.

Eval suites

Suite Tests What it covers
basic 2 System prompt adherence, avoiding unnecessary tool calls
files 15 read, write, edit, ls, grep, glob, parallel I/O, deep-nested search, multi-file reasoning
subagents 2 task tool routing to named and general-purpose subagents
hitl 3 Human-in-the-loop interrupts — pause on configured tools, present correct review configs, resume after approval. Tests both direct agent and subagent HITL (bypasses EvalRunner, uses createDeepAgent + MemorySaver + Command directly)
memory 6 AGENTS.md memory injection — recall, guided naming conventions, code style influence, multiple sources, graceful missing files, avoiding redundant reads
skills 6 SKILL.md discovery and usage — read by description match, read by name, combine two skills in parallel, edit skills (with and without prior read), resolve correct skill across multiple source paths
tool-usage-relational 18 Ported from Python test_tool_usage_relational.py. Relational data with fake users/locations/foods connected by IDs. Tests sequential tool chaining (1→2→3→4 hops) and parallel fan-out (e.g. resolve 3 food IDs in parallel after a sequential lookup chain)

Skills

  • eval-creator (.agents/skills/eval-creator/) — comprehensive guide for scaffolding new eval suites. Covers inline, fixture, external dataset, and LangSmith dataset patterns; sandbox-backed evals; LLM-as-judge scoring; published benchmark implementation.
  • langsmith-trace (.agents/skills/langsmith-trace/) — replaces the trace skill from the deleted langsmith-skills subtree, installed via skills-lock.json.

@changeset-bot
Copy link

changeset-bot bot commented Mar 13, 2026

⚠️ No Changeset found

Latest commit: bbdf186

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR


### Custom matchers

The harness provides vitest matchers that also log LangSmith feedback:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we sure that we want efficiency metrics to induce failures? in the Python version, agent_steps tool_call_requests and tool_call checks are EfficiencyAssertions that never fail the test. They're logged to LangSmith for tracking but don't block CI. I believe this is intentional bc a model that takes 3 steps instead of 2 still answered correctly. We would call that an efficiency regression, not an explicit failure. This could cause a problem bc swapping models will likely cause mass "failures" even if correctness is fine. But I can see a case for the other side too just wanted to flag.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants