Skip to content

Commit 79e72c1

Browse files
betegonclaude
andauthored
refactor(eval): replace OpenAI with Anthropic SDK in init-eval judge (#683)
## Summary Standardizes all evals on the Anthropic SDK. The skill-eval already used `@anthropic-ai/sdk`; this switches the init-eval judge from OpenAI (`gpt-4o`) to Anthropic (`claude-sonnet-4-6`) and drops the `openai` dependency. ## Changes - `test/init-eval/helpers/judge.ts`: swap OpenAI client/API for Anthropic Messages API - `package.json`: remove `openai` from devDependencies - `OPENAI_API_KEY` → `ANTHROPIC_API_KEY` env var (already required by skill-eval) ## Test plan - [x] `bun eval:skill` passes (sonnet 100%, opus 87.5%) - [x] `bun test:init-eval` — judge calls succeed with Anthropic (wizard auth is a separate issue) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 584ec0e commit 79e72c1

File tree

3 files changed

+10
-13
lines changed

3 files changed

+10
-13
lines changed

bun.lock

Lines changed: 0 additions & 3 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

package.json

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,6 @@
3030
"http-cache-semantics": "^4.2.0",
3131
"ignore": "^7.0.5",
3232
"marked": "^15",
33-
"openai": "^6.22.0",
3433
"p-limit": "^7.2.0",
3534
"picomatch": "^4.0.3",
3635
"pretty-ms": "^9.3.0",

test/init-eval/helpers/judge.ts

Lines changed: 10 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ export type JudgeVerdict = {
1717

1818
/**
1919
* Use an LLM judge to evaluate whether a **single feature** was correctly set
20-
* up by the wizard. Returns null if OPENAI_API_KEY is not set.
20+
* up by the wizard. Returns null if ANTHROPIC_API_KEY is not set.
2121
*
2222
* `docsContent` is the pre-fetched plain-text documentation to include as
2323
* ground truth in the prompt.
@@ -28,25 +28,25 @@ export async function judgeFeature(
2828
feature: FeatureDoc,
2929
docsContent: string
3030
): Promise<JudgeVerdict | null> {
31-
const apiKey = process.env.OPENAI_API_KEY;
31+
const apiKey = process.env.ANTHROPIC_API_KEY;
3232
if (!apiKey) {
3333
console.log(
34-
` [judge:${feature.feature}] Skipping LLM judge (no OPENAI_API_KEY set)`
34+
` [judge:${feature.feature}] Skipping LLM judge (no ANTHROPIC_API_KEY set)`
3535
);
3636
return null;
3737
}
3838

3939
// Restore real fetch — test preload mocks it to catch accidental network
40-
// calls, but we need real HTTP for the OpenAI API.
40+
// calls, but we need real HTTP for the Anthropic API.
4141
const realFetch = (globalThis as { __originalFetch?: typeof fetch })
4242
.__originalFetch;
4343
if (realFetch) {
4444
globalThis.fetch = realFetch;
4545
}
4646

4747
// Dynamic import so we don't fail when the package isn't installed
48-
const { default: OpenAI } = await import("openai");
49-
const client = new OpenAI({ apiKey });
48+
const { default: Anthropic } = await import("@anthropic-ai/sdk");
49+
const client = new Anthropic({ apiKey });
5050

5151
const newFilesSection = Object.entries(result.newFiles)
5252
.map(([path, content]) => `### ${path}\n\`\`\`\n${content}\n\`\`\``)
@@ -86,13 +86,14 @@ Return ONLY valid JSON with this structure:
8686
"summary": "Brief overall assessment of ${feature.feature} setup"
8787
}`;
8888

89-
const response = await client.chat.completions.create({
90-
model: "gpt-4o",
89+
const response = await client.messages.create({
90+
model: "claude-sonnet-4-6",
9191
max_tokens: 1024,
9292
messages: [{ role: "user", content: prompt }],
9393
});
9494

95-
const text = response.choices[0]?.message?.content ?? "";
95+
const textBlock = response.content.find((b) => b.type === "text");
96+
const text = textBlock?.text ?? "";
9697

9798
// Extract JSON from response (handle markdown code blocks)
9899
const jsonMatch = text.match(/\{[\s\S]*\}/);

0 commit comments

Comments
 (0)