-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
Summary
Research and design a production-grade benchmark framework for evaluating Agent Skills in Claude Code. Current pytest harness works but needs LLM-as-judge scoring for meaningful evaluation.
Problem
- No off-the-shelf eval framework supports CLI-based AI agents (Inspect AI, DeepEval, Promptfoo all assume direct API access)
- Current string-matching assertions are brittle and don't capture semantic correctness
- Need reproducible, trackable benchmarks for skill quality over time
Research Areas
1. LLM-as-Judge Implementation
- Research LLM-as-judge patterns (G-Eval, Prometheus, etc.)
- Design scoring rubrics for skill evaluation
- Evaluate cost/latency tradeoffs (Haiku vs Sonnet for judging)
- Handle judge consistency/calibration
2. Benchmark Infrastructure
- Result persistence and tracking over time
- Retry logic for flaky LLM behavior
- Parallel execution for speed
- CI/CD integration (GitHub Actions)
3. Evaluation Dimensions
- Correctness: Does the skill provide accurate information?
- Relevance: Does the response use skill knowledge vs hallucinating?
- Tool Use: Does Claude invoke appropriate tools?
- Completeness: Are all aspects of the query addressed?
4. Framework Options to Evaluate
| Framework | Approach | Notes |
|---|---|---|
| Custom pytest + LLM judge | Wrap CLI, add semantic scoring | Current direction |
| Inspect AI custom solver | Write solver that shells out to CLI | May be overkill |
| Promptfoo custom provider | YAML config, CLI wrapper | Good for CI |
Proposed Design
@scorer
async def skill_relevance_judge():
"""LLM-as-judge for skill evaluation."""
async def score(state, target):
prompt = f"""
Evaluate if this response demonstrates knowledge from the skill.
Question: {state.input}
Response: {state.output}
Expected concepts: {target}
Score 1-5:
1 = No skill knowledge used, hallucinated
2 = Partial, mostly generic knowledge
3 = Adequate skill usage
4 = Good skill usage with specifics
5 = Excellent, comprehensive skill application
Respond with just the score and one sentence reason.
"""
result = await run_claude_async(prompt, model="haiku")
score = extract_score(result)
return Score(value=score, explanation=result)
return scoreSuccess Criteria
- Benchmark runs in < 5 minutes for full suite
- Results are reproducible (within LLM variance)
- Clear pass/fail with semantic reasoning
- Historical tracking shows skill quality trends
- Works in CI without manual intervention
References
- Inspect AI - UK AISI eval framework
- G-Eval - LLM-as-judge with chain-of-thought
- Prometheus - Fine-tuned judge models
- Current benchmark harness
Labels
enhancement, research, testing
Metadata
Metadata
Assignees
Labels
No labels