Skip to content

docs: add JSON Schema and documentation for evals#60

Open
galElmalah wants to merge 3 commits intomainfrom
docs/evals-documentation-and-schema
Open

docs: add JSON Schema and documentation for evals#60
galElmalah wants to merge 3 commits intomainfrom
docs/evals-documentation-and-schema

Conversation

@galElmalah
Copy link
Collaborator

@galElmalah galElmalah commented Feb 4, 2026

Summary

Add IDE support and comprehensive documentation for the skills eval framework:

  • Create eval.schema.json with definitions for all 9 check types
  • Update README.md with detailed check type reference and examples
  • Add $schema reference to all 7 fixture eval.json files

The Evals System

The evals framework tests Claude's ability to correctly use skills when responding to prompts. It validates that Claude can:

  • Identify and invoke relevant skills for a given prompt
  • Generate code and configuration files that meet specifications
  • Produce output containing expected content

Architecture

evals/
├── fixtures/                    # Test cases (7 fixtures)
│   ├── skill-check/            # Simple skill detection
│   ├── react-task-manager/     # SDK usage with React
│   ├── nextjs-todo/            # Entity creation
│   ├── create-support-agent/   # Agent configuration
│   ├── existing-app-add-feature/  # Backend functions
│   ├── agent-chat-component/   # Agents SDK usage
│   └── agent-with-function/    # Agent + function combo
├── experiments/                 # CLAUDE.md variants for A/B testing
├── src/
│   ├── checks/                 # 9 check type implementations
│   ├── agents/                 # Claude Code CLI interface
│   └── reporters/              # Markdown report generators
├── results/                    # Generated reports
└── eval.schema.json           # NEW: JSON Schema for eval.json

Running Evals

# Run all fixtures
npm run eval

# Filter to specific fixtures
npm run eval -- --filter nextjs --verbose

# Compare experiments (A/B testing skill variations)
npm run eval:compare

# Run multiple times for consistency check
npm run eval:compare -- --runs 3

Check Types

Type Description Config
contains Search agent output for text value
file-exists Verify file was created filePath
file-content Match content with regex filePath, pattern or value
valid-json Validate JSON/JSONC syntax filePath
command-passes Run command, check exit 0 command
json-schema Validate against JSON Schema schema
entity-config Validate Base44 entity filePath, target
agent-config Validate Base44 agent filePath, target
function-def Validate Base44 function filePath, target

Example eval.json

With the new schema reference, VS Code provides autocomplete and validation:

{
  "$schema": "../../eval.schema.json",
  "name": "create-support-agent",
  "description": "Test creating an AI agent configuration with entity access",
  "prompt": "Create a support agent called 'support_agent' that helps users...",
  "expectedSkills": ["base44-cli"],
  "checks": [
    {
      "type": "file-exists",
      "description": "Agent config file created",
      "filePath": "base44/agents/support_agent.jsonc"
    },
    {
      "type": "valid-json",
      "description": "Agent config is valid JSON",
      "filePath": "base44/agents/support_agent.jsonc"
    },
    {
      "type": "file-content",
      "description": "Has correct agent name",
      "filePath": "base44/agents/support_agent.jsonc",
      "pattern": "\"name\"\\s*:\\s*\"support_agent\""
    },
    {
      "type": "contains",
      "description": "Mentions agents push command",
      "value": "agents push"
    }
  ]
}

Example Run Output

Single Eval Run

$ npm run eval -- --name baseline --verbose

Starting eval run: baseline
Agent: claude-code
Fixtures: fixtures

Found 7 fixtures

Running suite: skill-check
  Running fixture: skill-check
  Skills invoked: (none)
  ✅ skill-check

Running suite: react-task-manager
  Running fixture: react-task-manager-sdk
  Skills invoked: base44-sdk
  ✅ react-task-manager-sdk

Running suite: nextjs-todo
  Running fixture: nextjs-todo-entities
  Skills invoked: base44-cli
  ✅ nextjs-todo-entities

==================================================
Eval Run Complete
==================================================
Passed: 7
Failed: 0
Report: results/run-2026-02-04-065143-baseline.md

Comparison Run (A/B Testing)

$ npm run eval:compare -- --runs 3

Starting comparison: comparison
Agent: claude-code
Runs: 3
Experiments: experiments
Fixtures: fixtures

##################################################
# Run 1/3
##################################################
Run 1 Results:
  ✅ explicit-read: 7/7 (100%)
  ✅ strict-read: 7/7 (100%)
  ❌ baseline: 6/7 (86%)

##################################################
# Run 2/3
##################################################
...

============================================================
AGGREGATED RESULTS (3 runs)
============================================================

| Experiment   | Avg Pass Rate | Consistency | Perfect Runs |
|--------------|---------------|-------------|--------------|
| explicit-read| 100%          | 3/3         | 3/3 ⭐        |
| strict-read  | 100%          | 3/3         | 3/3 ⭐        |
| baseline     | 90%           | 1/3         | 1/3          |

Legend: ⭐ = 100% consistency (all runs passed)

Generated Report (Markdown)

# Experiment Comparison: comparison-run1
**Date**: 2026-02-04 06:51:43
**Agent**: claude-code
**Experiments**: 3 | **Fixtures**: 7

## Summary
| Experiment | Passed | Failed | Pass Rate |
|------------|--------|--------|-----------|
| **explicit-read** | 7 | 0 | 100% |
| strict-read | 7 | 0 | 100% |
| baseline | 6 | 1 | 86% |

## Comparison Matrix
| Fixture | baseline | explicit-read | strict-read |
|---|---|---|---|
| add-notification-function ||||
| agent-chat-component ||||
| agent-with-function ||||
| create-support-agent ||||
| nextjs-todo-entities ||||
| react-task-manager-sdk ||||
| skill-check ||||

## Inconsistent Results
- **agent-chat-component**: Passed in explicit-read, strict-read, Failed in baseline

Test plan

  • Schema validates as valid JSON
  • Open any eval.json in VS Code - should get autocomplete for check types
  • README renders correctly in GitHub
  • Existing evals still pass: npm run eval

🤖 Generated with Claude Code

Add IDE support and comprehensive documentation for the eval test framework:

- Create eval.schema.json with definitions for all 9 check types
- Update README.md with detailed check type reference and examples
- Add $schema reference to all 7 fixture eval.json files

The schema enables autocomplete and validation in VS Code and other
editors when editing eval.json files.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@claude
Copy link
Contributor

claude bot commented Feb 4, 2026

No issues found. Checked for bugs and CLAUDE.md compliance.

galElmalah and others added 2 commits February 4, 2026 10:39
… support

Adds a complete evaluation system for testing skills:
- CLI runner with support for specific fixture selection via positional args
- Multiple check types (file-exists, file-content, contains, json-schema, etc.)
- Markdown report generation
- Claude Code agent integration
- 7 test fixtures covering various Base44 SDK scenarios

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@claude
Copy link
Contributor

claude bot commented Feb 4, 2026

Code review

No issues found. Checked for bugs and CLAUDE.md compliance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant