skill-eval

Evaluation framework for AI agent skills — measure whether capabilities actually work before they ship.

Why

ClawHub has 13,729 skills. 67% of them fail in practice. The ecosystem needs a way to answer: does this skill actually do what it claims?

skill-eval is a framework for writing, running, and scoring evaluations against agent skills. It draws on:

The ClawHub corpus analysis — empirical failure patterns across 13K+ skills
Anthropic's skill evaluation patterns in Claude Code
mgechev/skill-eval — open-source skill assessment methodology
The effector-spec type system for capability interfaces

What It Measures

Structural Quality (static, no execution)

Metric	What it checks	Weight
`frontmatter_completeness`	All required YAML fields present and valid	0.15
`section_coverage`	Purpose, When to Use, When NOT to Use, Setup, Commands, Examples, Notes	0.15
`type_declaration`	Has `effector.toml` with `[effector.interface]` input/output/context	0.10
`permission_alignment`	Declared permissions match detected behavior (via `effector-audit`)	0.10
`description_quality`	Length, specificity, starts with verb, avoids vague language	0.05
`install_completeness`	At least one install method with all required fields	0.05

Functional Quality (requires execution sandbox)

Metric	What it checks	Weight
`prerequisite_resolution`	All declared `requires.bins` and `requires.env` resolve	0.10
`invocation_success`	Skill produces parseable output for a reference input	0.15
`output_type_match`	Actual output matches declared output type shape	0.10
`error_handling`	Graceful behavior on invalid input (no crash, clear message)	0.05

Composite Score

score = Σ(metric_score × weight)

Scale: 0.0 (broken) → 1.0 (production-ready). Thresholds:

Grade	Range	Meaning
A	0.85–1.00	Production-ready, publish to ClawHub
B	0.70–0.84	Functional, needs polish
C	0.50–0.69	Partially working, significant gaps
D	0.25–0.49	Fundamentally broken
F	0.00–0.24	Non-functional

Eval File Format

Evals are YAML files describing expected behavior:

# evals/linear.eval.yml
skill: linear
version: ">=1.0.0"

prerequisites:
  env:
    - LINEAR_API_KEY
  bins:
    - curl
    - jq

cases:
  - name: list-open-issues
    input: "What are my open Linear issues?"
    expect:
      output_type: JSON
      contains_fields: ["id", "title", "state"]
      no_error: true

  - name: create-issue
    input: "Create a Linear issue titled 'Test from skill-eval'"
    expect:
      output_type: JSON
      contains_fields: ["id", "identifier"]
      no_error: true
    teardown: "Delete the created issue"

  - name: invalid-key
    input: "List my issues"
    env_override:
      LINEAR_API_KEY: "invalid_key"
    expect:
      no_crash: true
      error_message_contains: ["unauthorized", "401", "invalid"]

scoring:
  pass_threshold: 0.70
  weights:
    invocation_success: 0.4
    output_type_match: 0.3
    error_handling: 0.2
    prerequisite_resolution: 0.1

Directory Structure

skill-eval/
├── src/
│   ├── runner.js          # Eval execution engine
│   ├── scorer.js          # Metric computation + grading
│   ├── reporter.js        # Output formatting (terminal, JSON, markdown)
│   └── static-analyzer.js # Structural quality checks (no execution)
├── evals/
│   ├── linear.eval.yml    # Reference eval for linear-skill
│   └── README.md          # How to write evals
├── fixtures/
│   ├── passing-skill/     # A skill that scores A
│   └── failing-skill/     # A skill that scores F (for testing the framework)
├── scripts/
│   └── run-eval.js        # CLI entry point
├── package.json
└── README.md

Usage

# Evaluate a single skill (structural only — no execution)
npx skill-eval ./path/to/skill --static-only

# Evaluate with execution (requires sandbox + prerequisites)
npx skill-eval ./path/to/skill --eval evals/linear.eval.yml

# Evaluate all skills in a directory
npx skill-eval ./skills/ --static-only --report markdown > report.md

# JSON output for CI
npx skill-eval ./path/to/skill --format json

Output Example

skill-eval v0.1.0 — linear-skill

Structural Quality
  ✓ frontmatter_completeness    1.00  (all fields present)
  ✓ section_coverage            1.00  (7/7 sections)
  ✓ type_declaration            1.00  (effector.toml with typed interface)
  ✓ permission_alignment        1.00  (no drift detected)
  ✓ description_quality         0.90  (good length, specific)
  ✓ install_completeness        1.00  (manual install with steps)

Functional Quality
  ✓ prerequisite_resolution     1.00  (curl, jq found; LINEAR_API_KEY set)
  ✓ invocation_success          1.00  (3/3 cases passed)
  ✓ output_type_match           1.00  (JSON output matches declaration)
  ✓ error_handling              0.80  (graceful on invalid key, no crash)

──────────────────────────────────
Score: 0.97 / 1.00  →  Grade A
──────────────────────────────────

Integration with effectorHQ

skill-lint checks syntax and structure → skill-eval checks behavior and quality
effector-audit checks permission drift → skill-eval uses that as one metric
effector-types defines the type vocabulary → skill-eval checks output against declared types
clawhub-analysis provides corpus baselines → skill-eval grades against ecosystem norms

Roadmap

v0.1.0 — Static analyzer + reference eval for linear-skill
v0.2.0 — Execution sandbox (Docker-based) + functional metrics
v0.3.0 — CI integration (skill-eval-action for GitHub Actions)
v0.4.0 — Batch mode for ClawHub-wide audits
v1.0.0 — Stable API, published to npm as @effectorhq/skill-eval

Prior Art

mgechev/skill-eval — Skill evaluation methodology
Anthropic Claude Code Skills — Skill authoring + evaluation patterns
OpenClaw SKILL.md spec — The skill format this evaluates
effectorHQ/clawhub-analysis — Empirical data backing this framework

MIT License — effectorHQ Contributors

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

skill-eval

Why

What It Measures

Structural Quality (static, no execution)

Functional Quality (requires execution sandbox)

Composite Score

Eval File Format

Directory Structure

Usage

Output Example

Integration with effectorHQ

Roadmap

Prior Art

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
evals		evals
scripts		scripts
src		src
LICENSE		LICENSE
README.md		README.md
package.json		package.json

Folders and files

Latest commit

History

Repository files navigation

skill-eval

Why

What It Measures

Structural Quality (static, no execution)

Functional Quality (requires execution sandbox)

Composite Score

Eval File Format

Directory Structure

Usage

Output Example

Integration with effectorHQ

Roadmap

Prior Art

About

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages