Unify LLM SDK usage in eval frameworks

## Summary

We currently have two separate LLM SDKs as devDependencies:

- **`@anthropic-ai/sdk`** — used in `test/skill-eval/` (judge + agent via Claude models)
- **`openai`** — used in `test/init-eval/` (judge via GPT-4o)

These serve similar purposes (LLM-as-judge for evaluation) but use different providers, which means two API keys to manage, two SDKs to maintain, and inconsistent eval infrastructure.

## Proposal

Consolidate on a single LLM SDK (likely `@anthropic-ai/sdk` since it's already used more extensively) and migrate `test/init-eval/helpers/judge.ts` from OpenAI GPT-4o to a Claude model.

## Files to update

- `test/init-eval/helpers/judge.ts` — replace OpenAI client with Anthropic client
- `test/init-eval/helpers/create-eval-suite.ts` — update if needed
- `package.json` — remove `openai` dependency

## Benefits

- One fewer devDependency to maintain
- Single API key requirement (`ANTHROPIC_API_KEY`) for all evals
- Consistent eval infrastructure across test suites

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Unify LLM SDK usage in eval frameworks #669

Summary

Proposal

Files to update

Benefits

Metadata

Assignees

Labels

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Unify LLM SDK usage in eval frameworks #669

Description

Summary

Proposal

Files to update

Benefits

Metadata

Metadata

Assignees

Labels

Fields

Projects

Milestone

Relationships

Development

Issue actions