Summary
We currently have two separate LLM SDKs as devDependencies:
@anthropic-ai/sdk — used in test/skill-eval/ (judge + agent via Claude models)
openai — used in test/init-eval/ (judge via GPT-4o)
These serve similar purposes (LLM-as-judge for evaluation) but use different providers, which means two API keys to manage, two SDKs to maintain, and inconsistent eval infrastructure.
Proposal
Consolidate on a single LLM SDK (likely @anthropic-ai/sdk since it's already used more extensively) and migrate test/init-eval/helpers/judge.ts from OpenAI GPT-4o to a Claude model.
Files to update
test/init-eval/helpers/judge.ts — replace OpenAI client with Anthropic client
test/init-eval/helpers/create-eval-suite.ts — update if needed
package.json — remove openai dependency
Benefits
- One fewer devDependency to maintain
- Single API key requirement (
ANTHROPIC_API_KEY) for all evals
- Consistent eval infrastructure across test suites
Summary
We currently have two separate LLM SDKs as devDependencies:
@anthropic-ai/sdk— used intest/skill-eval/(judge + agent via Claude models)openai— used intest/init-eval/(judge via GPT-4o)These serve similar purposes (LLM-as-judge for evaluation) but use different providers, which means two API keys to manage, two SDKs to maintain, and inconsistent eval infrastructure.
Proposal
Consolidate on a single LLM SDK (likely
@anthropic-ai/sdksince it's already used more extensively) and migratetest/init-eval/helpers/judge.tsfrom OpenAI GPT-4o to a Claude model.Files to update
test/init-eval/helpers/judge.ts— replace OpenAI client with Anthropic clienttest/init-eval/helpers/create-eval-suite.ts— update if neededpackage.json— removeopenaidependencyBenefits
ANTHROPIC_API_KEY) for all evals