docs: add JSON Schema and documentation for evals by galElmalah · Pull Request #60 · base44/skills

galElmalah · 2026-02-04T07:34:14Z

Summary

Add IDE support and comprehensive documentation for the skills eval framework:

Create eval.schema.json with definitions for all 9 check types
Update README.md with detailed check type reference and examples
Add $schema reference to all 7 fixture eval.json files

The Evals System

The evals framework tests Claude's ability to correctly use skills when responding to prompts. It validates that Claude can:

Identify and invoke relevant skills for a given prompt
Generate code and configuration files that meet specifications
Produce output containing expected content

Architecture

evals/
├── fixtures/                    # Test cases (7 fixtures)
│   ├── skill-check/            # Simple skill detection
│   ├── react-task-manager/     # SDK usage with React
│   ├── nextjs-todo/            # Entity creation
│   ├── create-support-agent/   # Agent configuration
│   ├── existing-app-add-feature/  # Backend functions
│   ├── agent-chat-component/   # Agents SDK usage
│   └── agent-with-function/    # Agent + function combo
├── experiments/                 # CLAUDE.md variants for A/B testing
├── src/
│   ├── checks/                 # 9 check type implementations
│   ├── agents/                 # Claude Code CLI interface
│   └── reporters/              # Markdown report generators
├── results/                    # Generated reports
└── eval.schema.json           # NEW: JSON Schema for eval.json

Running Evals

# Run all fixtures
npm run eval

# Filter to specific fixtures
npm run eval -- --filter nextjs --verbose

# Compare experiments (A/B testing skill variations)
npm run eval:compare

# Run multiple times for consistency check
npm run eval:compare -- --runs 3

Check Types

Type	Description	Config
`contains`	Search agent output for text	`value`
`file-exists`	Verify file was created	`filePath`
`file-content`	Match content with regex	`filePath`, `pattern` or `value`
`valid-json`	Validate JSON/JSONC syntax	`filePath`
`command-passes`	Run command, check exit 0	`command`
`json-schema`	Validate against JSON Schema	`schema`
`entity-config`	Validate Base44 entity	`filePath`, `target`
`agent-config`	Validate Base44 agent	`filePath`, `target`
`function-def`	Validate Base44 function	`filePath`, `target`

Example eval.json

With the new schema reference, VS Code provides autocomplete and validation:

{
  "$schema": "../../eval.schema.json",
  "name": "create-support-agent",
  "description": "Test creating an AI agent configuration with entity access",
  "prompt": "Create a support agent called 'support_agent' that helps users...",
  "expectedSkills": ["base44-cli"],
  "checks": [
    {
      "type": "file-exists",
      "description": "Agent config file created",
      "filePath": "base44/agents/support_agent.jsonc"
    },
    {
      "type": "valid-json",
      "description": "Agent config is valid JSON",
      "filePath": "base44/agents/support_agent.jsonc"
    },
    {
      "type": "file-content",
      "description": "Has correct agent name",
      "filePath": "base44/agents/support_agent.jsonc",
      "pattern": "\"name\"\\s*:\\s*\"support_agent\""
    },
    {
      "type": "contains",
      "description": "Mentions agents push command",
      "value": "agents push"
    }
  ]
}

Example Run Output

Single Eval Run

$ npm run eval -- --name baseline --verbose

Starting eval run: baseline
Agent: claude-code
Fixtures: fixtures

Found 7 fixtures

Running suite: skill-check
  Running fixture: skill-check
  Skills invoked: (none)
  ✅ skill-check

Running suite: react-task-manager
  Running fixture: react-task-manager-sdk
  Skills invoked: base44-sdk
  ✅ react-task-manager-sdk

Running suite: nextjs-todo
  Running fixture: nextjs-todo-entities
  Skills invoked: base44-cli
  ✅ nextjs-todo-entities

==================================================
Eval Run Complete
==================================================
Passed: 7
Failed: 0
Report: results/run-2026-02-04-065143-baseline.md

Comparison Run (A/B Testing)

$ npm run eval:compare -- --runs 3

Starting comparison: comparison
Agent: claude-code
Runs: 3
Experiments: experiments
Fixtures: fixtures

##################################################
# Run 1/3
##################################################
Run 1 Results:
  ✅ explicit-read: 7/7 (100%)
  ✅ strict-read: 7/7 (100%)
  ❌ baseline: 6/7 (86%)

##################################################
# Run 2/3
##################################################
...

============================================================
AGGREGATED RESULTS (3 runs)
============================================================

| Experiment   | Avg Pass Rate | Consistency | Perfect Runs |
|--------------|---------------|-------------|--------------|
| explicit-read| 100%          | 3/3         | 3/3 ⭐        |
| strict-read  | 100%          | 3/3         | 3/3 ⭐        |
| baseline     | 90%           | 1/3         | 1/3          |

Legend: ⭐ = 100% consistency (all runs passed)

Generated Report (Markdown)

# Experiment Comparison: comparison-run1
**Date**: 2026-02-04 06:51:43
**Agent**: claude-code
**Experiments**: 3 | **Fixtures**: 7

## Summary
| Experiment | Passed | Failed | Pass Rate |
|------------|--------|--------|-----------|
| **explicit-read** | 7 | 0 | 100% |
| strict-read | 7 | 0 | 100% |
| baseline | 6 | 1 | 86% |

## Comparison Matrix
| Fixture | baseline | explicit-read | strict-read |
|---|---|---|---|
| add-notification-function | ✅ | ✅ | ✅ |
| agent-chat-component | ❌ | ✅ | ✅ |
| agent-with-function | ✅ | ✅ | ✅ |
| create-support-agent | ✅ | ✅ | ✅ |
| nextjs-todo-entities | ✅ | ✅ | ✅ |
| react-task-manager-sdk | ✅ | ✅ | ✅ |
| skill-check | ✅ | ✅ | ✅ |

## Inconsistent Results
- **agent-chat-component**: Passed in explicit-read, strict-read, Failed in baseline

Test plan

Schema validates as valid JSON
Open any eval.json in VS Code - should get autocomplete for check types
README renders correctly in GitHub
Existing evals still pass: npm run eval

🤖 Generated with Claude Code

Add IDE support and comprehensive documentation for the eval test framework: - Create eval.schema.json with definitions for all 9 check types - Update README.md with detailed check type reference and examples - Add $schema reference to all 7 fixture eval.json files The schema enables autocomplete and validation in VS Code and other editors when editing eval.json files. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

claude · 2026-02-04T07:38:22Z

No issues found. Checked for bugs and CLAUDE.md compliance.

… support Adds a complete evaluation system for testing skills: - CLI runner with support for specific fixture selection via positional args - Multiple check types (file-exists, file-content, contains, json-schema, etc.) - Markdown report generation - Claude Code agent integration - 7 test fixtures covering various Base44 SDK scenarios Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

claude · 2026-02-04T08:49:20Z

Code review

No issues found. Checked for bugs and CLAUDE.md compliance.

galElmalah and others added 2 commits February 4, 2026 10:39

push skills level experiments

d387274

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: add JSON Schema and documentation for evals#60

docs: add JSON Schema and documentation for evals#60
galElmalah wants to merge 3 commits intomainfrom
docs/evals-documentation-and-schema

galElmalah commented Feb 4, 2026 •

edited

Loading

Uh oh!

claude bot commented Feb 4, 2026

Uh oh!

claude bot commented Feb 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

galElmalah commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

The Evals System

Architecture

Running Evals

Check Types

Example eval.json

Example Run Output

Single Eval Run

Comparison Run (A/B Testing)

Generated Report (Markdown)

Test plan

Uh oh!

claude bot commented Feb 4, 2026

Uh oh!

claude bot commented Feb 4, 2026

Code review

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

galElmalah commented Feb 4, 2026 •

edited

Loading