Skip to content

Add direct API judge backend via --judge flag#87

Open
juppytt wants to merge 1 commit intopinchbench:mainfrom
juppytt:fix/judge-api-backend
Open

Add direct API judge backend via --judge flag#87
juppytt wants to merge 1 commit intopinchbench:mainfrom
juppytt:fix/judge-api-backend

Conversation

@juppytt
Copy link
Copy Markdown

@juppytt juppytt commented Apr 1, 2026

Summary

  • When --judge is specified with a model ID (e.g. openai/gpt-4o, anthropic/claude-sonnet-4-5-20250514, claude), the judge calls the model API directly instead of running an OpenClaw agent session
  • Fixes llm_judge tasks scoring 0 due to OpenClaw personality files (SOUL.md, IDENTITY.md) overriding the judge's JSON-only instructions
  • Without --judge, behavior is unchanged (OpenClaw agent session with default model)

Supported prefixes: openrouter/*, anthropic/*, openai/*, claude

Test plan

  • Run benchmark without --judge — verify OpenClaw judge still works as default
  • Run with --judge openai/gpt-4o — verify direct API judge returns valid scores
  • Run with --judge anthropic/claude-sonnet-4-6 — verify Anthropic backend
  • Run with --judge claude — verify headless CLI backend
  • Verify llm_judge tasks no longer score 0 when using --judge

🤖 Generated with Claude Code

@juppytt juppytt force-pushed the fix/judge-api-backend branch from ba64cb9 to ef38b4a Compare April 1, 2026 11:19
When --judge is specified with a model ID, the judge calls the model
API directly instead of running an OpenClaw agent session. This avoids
OpenClaw personality files (SOUL.md, IDENTITY.md) overriding the
judge's JSON-only grading instructions, which caused all llm_judge
tasks to score 0.

Supported model prefixes:
  - openrouter/* -> OpenRouter API (OPENROUTER_API_KEY)
  - anthropic/*  -> Anthropic Messages API (ANTHROPIC_API_KEY)
  - openai/*     -> OpenAI chat completions (OPENAI_API_KEY)
  - claude       -> headless Claude CLI (claude -p)

Without --judge, behavior is unchanged (OpenClaw agent session).

Also fixes pre-existing duplicate _remove_readonly function definition
in lib_agent.py that caused an IndentationError.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@juppytt juppytt force-pushed the fix/judge-api-backend branch from ef38b4a to 19d3e7a Compare April 1, 2026 11:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant