Skip to content

3.3 Evaluation & Evals #11

@philberryman

Description

@philberryman

Goal

Build an evaluation harness for an LLM feature.

Learn

  • Deterministic vs LLM-as-judge evals
  • Metrics for different use cases
  • Test set creation
  • Regression testing for prompts

Deliverable

  • Eval suite for the AI shopping assistant
  • Test set with labeled examples

Proof Point

Can measure whether a prompt change helped or hurt.

Links

→ Feeds into: FasterShops AI Merchant Assistant

Directory

agent-patterns/03-evaluation-evals/

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions