Skip to content

Research: Agent Skills Benchmark Framework with LLM-as-Judge #20

@killerapp

Description

@killerapp

Summary

Research and design a production-grade benchmark framework for evaluating Agent Skills in Claude Code. Current pytest harness works but needs LLM-as-judge scoring for meaningful evaluation.

Problem

  • No off-the-shelf eval framework supports CLI-based AI agents (Inspect AI, DeepEval, Promptfoo all assume direct API access)
  • Current string-matching assertions are brittle and don't capture semantic correctness
  • Need reproducible, trackable benchmarks for skill quality over time

Research Areas

1. LLM-as-Judge Implementation

  • Research LLM-as-judge patterns (G-Eval, Prometheus, etc.)
  • Design scoring rubrics for skill evaluation
  • Evaluate cost/latency tradeoffs (Haiku vs Sonnet for judging)
  • Handle judge consistency/calibration

2. Benchmark Infrastructure

  • Result persistence and tracking over time
  • Retry logic for flaky LLM behavior
  • Parallel execution for speed
  • CI/CD integration (GitHub Actions)

3. Evaluation Dimensions

  • Correctness: Does the skill provide accurate information?
  • Relevance: Does the response use skill knowledge vs hallucinating?
  • Tool Use: Does Claude invoke appropriate tools?
  • Completeness: Are all aspects of the query addressed?

4. Framework Options to Evaluate

Framework Approach Notes
Custom pytest + LLM judge Wrap CLI, add semantic scoring Current direction
Inspect AI custom solver Write solver that shells out to CLI May be overkill
Promptfoo custom provider YAML config, CLI wrapper Good for CI

Proposed Design

@scorer
async def skill_relevance_judge():
    """LLM-as-judge for skill evaluation."""
    async def score(state, target):
        prompt = f"""
        Evaluate if this response demonstrates knowledge from the skill.
        
        Question: {state.input}
        Response: {state.output}
        Expected concepts: {target}
        
        Score 1-5:
        1 = No skill knowledge used, hallucinated
        2 = Partial, mostly generic knowledge  
        3 = Adequate skill usage
        4 = Good skill usage with specifics
        5 = Excellent, comprehensive skill application
        
        Respond with just the score and one sentence reason.
        """
        result = await run_claude_async(prompt, model="haiku")
        score = extract_score(result)
        return Score(value=score, explanation=result)
    return score

Success Criteria

  • Benchmark runs in < 5 minutes for full suite
  • Results are reproducible (within LLM variance)
  • Clear pass/fail with semantic reasoning
  • Historical tracking shows skill quality trends
  • Works in CI without manual intervention

References

Labels

enhancement, research, testing

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions