Add direct API judge backend via --judge flag by juppytt · Pull Request #87 · pinchbench/skill

juppytt · 2026-04-01T11:10:25Z

Summary

When --judge is specified with a model ID (e.g. openai/gpt-4o, anthropic/claude-sonnet-4-5-20250514, claude), the judge calls the model API directly instead of running an OpenClaw agent session
Fixes llm_judge tasks scoring 0 due to OpenClaw personality files (SOUL.md, IDENTITY.md) overriding the judge's JSON-only instructions
Without --judge, behavior is unchanged (OpenClaw agent session with default model)

Supported prefixes: openrouter/*, anthropic/*, openai/*, claude

Test plan

Run benchmark without --judge — verify OpenClaw judge still works as default
Run with --judge openai/gpt-4o — verify direct API judge returns valid scores
Run with --judge anthropic/claude-sonnet-4-6 — verify Anthropic backend
Run with --judge claude — verify headless CLI backend
Verify llm_judge tasks no longer score 0 when using --judge

🤖 Generated with Claude Code

When --judge is specified with a model ID, the judge calls the model API directly instead of running an OpenClaw agent session. This avoids OpenClaw personality files (SOUL.md, IDENTITY.md) overriding the judge's JSON-only grading instructions, which caused all llm_judge tasks to score 0. Supported model prefixes: - openrouter/* -> OpenRouter API (OPENROUTER_API_KEY) - anthropic/* -> Anthropic Messages API (ANTHROPIC_API_KEY) - openai/* -> OpenAI chat completions (OPENAI_API_KEY) - claude -> headless Claude CLI (claude -p) Without --judge, behavior is unchanged (OpenClaw agent session). Also fixes pre-existing duplicate _remove_readonly function definition in lib_agent.py that caused an IndentationError. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

juppytt force-pushed the fix/judge-api-backend branch from ba64cb9 to ef38b4a Compare April 1, 2026 11:19

juppytt force-pushed the fix/judge-api-backend branch from ef38b4a to 19d3e7a Compare April 1, 2026 11:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add direct API judge backend via --judge flag#87

Add direct API judge backend via --judge flag#87
juppytt wants to merge 1 commit intopinchbench:mainfrom
juppytt:fix/judge-api-backend

juppytt commented Apr 1, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

juppytt commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

juppytt commented Apr 1, 2026 •

edited

Loading