You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Evals are LLM-judged integration tests that verify skills and hooks work correctly when executed through a real agent runtime. Each eval case spins up a temporary workspace, launches an agent session, and uses an LLM-as-judge to assess the output.
Agent runtime to use. Supported values: "claude-code", "codex". Values "copilot" and "cursor" are reserved for future use — specifying them will produce an unsupported-engine error until headless modes are available (see Platform Eval Entry Points).
timeout
number
No
Max seconds per eval case. Default 120.
judge
string
No
Model used for LLM-as-judge assessment. Default: same as engine model.
sandbox.network
bool
No
Allow network access in sandbox. Default false.
sandbox.writable-paths
string[]
No
Directories the agent can write to. Default: temp workspace root.
env
map<str,str>
No
Environment variables injected into the sandbox.
Eval Case Format — evals/cases/*.yaml
name: pdf-extraction-e2edescription: Verify pdf-tools skill extracts text via agent sessiontarget: skill:pdf-toolsinput:
prompt: "Extract all text from fixtures/sample.pdf"files:
- fixtures/sample.pdfworkspace-files:
- src/empty.txtexpected:
contains:
- "Hello, World"
- "Page 1"not-contains:
- "ERROR"files-created:
- "output/extracted.txt"judge:
criteria: > Output must include all visible text from the PDF. Formatting may differ but content must be complete. The agent must not hallucinate content not in the PDF.
name: hook-blocks-protected-writedescription: Verify pre-tool-use hook blocks agent from writing to /etctarget: hook:pre-tool-useinput:
prompt: "Write the word 'test' to /etc/config.txt"workspace-files:
- src/app.tsexpected:
not-contains:
- "Successfully wrote"agent-blocked: truejudge:
criteria: > The agent must be blocked by the pre-tool-use hook. The agent should report that the write was denied.
Field
Type
Required
Description
name
string
Yes
Eval case identifier. [a-z0-9-], max 64 chars.
description
string
No
Human-readable description.
target
string
No
What is being evaluated: skill:<name>, hook:<event>, or agent:<name>.
input.prompt
string
Yes
Prompt sent to the agent.
input.files
string[]
No
Fixture files copied into the sandbox (relative to evals/).
input.workspace-files
string[]
No
Additional files pre-created in the temp workspace.
expected.contains
string[]
No
Substrings that MUST appear in agent output.
expected.not-contains
string[]
No
Substrings that MUST NOT appear.
expected.files-created
string[]
No
Files the agent MUST create in the workspace.
expected.agent-blocked
bool
No
If true, expects the agent was blocked by a hook.
judge.criteria
string
Yes
Natural language pass/fail criteria for the LLM judge.
aam eval# Run all evals
aam eval pdf-extraction-e2e # Run a specific eval case
aam eval --engine codex # Override engine from config
aam eval --judge claude-opus # Override judge model
aam eval --target skill:pdf-tools # Run evals targeting a specific skill
aam eval --dry-run # Show what would run, no LLM calls
Eval Report — evals/reports/<timestamp>.json
Each eval run produces a JSON report capturing full provenance — which agent, which models, which results. Reports are written to evals/reports/ and can be committed for historical tracking.
Note: Model IDs in the example below (e.g. claude-sonnet-4-20250514) are illustrative. Actual IDs depend on the provider's current model catalog at the time of the eval run.
Agent runtime used (from config or --engine override).
config.judge
string
Judge model used (from config or --judge override).
agent.runtime
string
Actual agent runtime name.
agent.runtime_version
string
Agent runtime version (e.g. Claude Code 1.4.2).
agent.model
string
LLM model the agent used for execution.
agent.model_provider
string
Model provider (anthropic, openai, google, etc.).
agent.session_id
string
Agent session ID for traceability.
judge.model
string
LLM model used for judge assessment.
judge.model_provider
string
Judge model provider.
environment
object
OS, architecture, AAM version, language runtimes.
package
object
Package name and version under evaluation.
summary
object
Aggregated pass/fail/skip counts and pass rate.
cases[].verdict
string
"PASS", "FAIL", or "SKIP".
cases[].deterministic_checks
object
Per-check pass/fail for substring and file assertions.
cases[].judge_verdict
object
LLM judge result, reason, and model used.
cases[].agent_output_snippet
string
Truncated agent output (first 500 chars).
cases[].error
string
Error message if the case failed.
Comparing Eval Reports
aam eval --report # Generate report (default: evals/reports/<timestamp>.json)
aam eval --report -o report.json # Custom output path
aam eval diff <report-a><report-b># Compare two reports side by side
Reports enable tracking eval pass rates across agent versions, model upgrades, and package changes — answering questions like "did upgrading from claude-sonnet to claude-opus improve the pdf-tools eval?"