Research: Agent Skills Benchmark Framework with LLM-as-Judge

## Summary

Research and design a production-grade benchmark framework for evaluating Agent Skills in Claude Code. Current pytest harness works but needs LLM-as-judge scoring for meaningful evaluation.

## Problem

- No off-the-shelf eval framework supports CLI-based AI agents (Inspect AI, DeepEval, Promptfoo all assume direct API access)
- Current string-matching assertions are brittle and don't capture semantic correctness
- Need reproducible, trackable benchmarks for skill quality over time

## Research Areas

### 1. LLM-as-Judge Implementation
- [ ] Research LLM-as-judge patterns (G-Eval, Prometheus, etc.)
- [ ] Design scoring rubrics for skill evaluation
- [ ] Evaluate cost/latency tradeoffs (Haiku vs Sonnet for judging)
- [ ] Handle judge consistency/calibration

### 2. Benchmark Infrastructure
- [ ] Result persistence and tracking over time
- [ ] Retry logic for flaky LLM behavior
- [ ] Parallel execution for speed
- [ ] CI/CD integration (GitHub Actions)

### 3. Evaluation Dimensions
- [ ] **Correctness**: Does the skill provide accurate information?
- [ ] **Relevance**: Does the response use skill knowledge vs hallucinating?
- [ ] **Tool Use**: Does Claude invoke appropriate tools?
- [ ] **Completeness**: Are all aspects of the query addressed?

### 4. Framework Options to Evaluate
| Framework | Approach | Notes |
|-----------|----------|-------|
| Custom pytest + LLM judge | Wrap CLI, add semantic scoring | Current direction |
| Inspect AI custom solver | Write solver that shells out to CLI | May be overkill |
| Promptfoo custom provider | YAML config, CLI wrapper | Good for CI |

## Proposed Design

```python
@scorer
async def skill_relevance_judge():
    """LLM-as-judge for skill evaluation."""
    async def score(state, target):
        prompt = f"""
        Evaluate if this response demonstrates knowledge from the skill.
        
        Question: {state.input}
        Response: {state.output}
        Expected concepts: {target}
        
        Score 1-5:
        1 = No skill knowledge used, hallucinated
        2 = Partial, mostly generic knowledge  
        3 = Adequate skill usage
        4 = Good skill usage with specifics
        5 = Excellent, comprehensive skill application
        
        Respond with just the score and one sentence reason.
        """
        result = await run_claude_async(prompt, model="haiku")
        score = extract_score(result)
        return Score(value=score, explanation=result)
    return score
```

## Success Criteria

- [ ] Benchmark runs in < 5 minutes for full suite
- [ ] Results are reproducible (within LLM variance)
- [ ] Clear pass/fail with semantic reasoning
- [ ] Historical tracking shows skill quality trends
- [ ] Works in CI without manual intervention

## References

- [Inspect AI](https://inspect.ai-safety-institute.org.uk/) - UK AISI eval framework
- [G-Eval](https://arxiv.org/abs/2303.16634) - LLM-as-judge with chain-of-thought
- [Prometheus](https://arxiv.org/abs/2310.08491) - Fine-tuned judge models
- [Current benchmark harness](./benchmarks/)

## Labels

enhancement, research, testing

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Research: Agent Skills Benchmark Framework with LLM-as-Judge #20

Summary

Problem

Research Areas

1. LLM-as-Judge Implementation

2. Benchmark Infrastructure

3. Evaluation Dimensions

4. Framework Options to Evaluate

Proposed Design

Success Criteria

References

Labels

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Framework	Approach	Notes
Custom pytest + LLM judge	Wrap CLI, add semantic scoring	Current direction
Inspect AI custom solver	Write solver that shells out to CLI	May be overkill
Promptfoo custom provider	YAML config, CLI wrapper	Good for CI

Research: Agent Skills Benchmark Framework with LLM-as-Judge #20

Description

Summary

Problem

Research Areas

1. LLM-as-Judge Implementation

2. Benchmark Infrastructure

3. Evaluation Dimensions

4. Framework Options to Evaluate

Proposed Design

Success Criteria

References

Labels

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions