Skip to content

Normalize judge totals to 0-1 scale#73

Merged
olearycrew merged 1 commit intopinchbench:mainfrom
luccathescientist:fix-judge-total-normalization
Mar 24, 2026
Merged

Normalize judge totals to 0-1 scale#73
olearycrew merged 1 commit intopinchbench:mainfrom
luccathescientist:fix-judge-total-normalization

Conversation

@luccathescientist
Copy link
Copy Markdown

Summary

This fixes an issue where LLM-judge scores could exceed 1.0 when the judge returned a total that was the sum of per-criterion scores instead of
their mean.

The grading pipeline already expects scores on a 0..1 scale, but some recent runs showed impossible values like 3.85/1.0, 5.0/1.0, and hybrid
scores above 1.0.

What changed

  • Clarified the judge prompt so total must be the arithmetic mean of criterion scores and must remain in 0..1

  • Normalized judge responses so that when:

    • per-criterion scores are in 0..1, and
    • the returned total is greater than 1.0

    we treat that total as a summed score and convert it back to the mean

  • Added regression tests for:

    • summed LLM-judge totals being normalized back to 0..1
    • hybrid scoring staying within range after normalization

Evidence

Observed in saved benchmark runs:

  • results/0009_local-openai-gpt-oss-20b.json
    • task_15_daily_summary: 3.85/1.0
    • task_16_email_triage: 2.8925/1.0
  • results/0010_local-openai-gpt-oss-20b.json
    • task_15_daily_summary: 5.0/1.0
    • task_17_email_search: 3.39/1.0

In these cases, the criterion breakdowns were already on a 0..1 scale, so the overflowing totals were consistent with summed judge outputs leaking
into aggregate scoring.

Validation

python3 -m py_compile scripts/lib_grading.py tests/test_lib_grading.py
python3 -m unittest tests/test_lib_grading.py

@kilo-code-bot
Copy link
Copy Markdown
Contributor

kilo-code-bot bot commented Mar 21, 2026

Code Review Summary

Status: No Issues Found | Recommendation: Merge

The normalization logic is well-guarded: it only fires when all criterion scores are individually within [0.0, 1.0] AND the reported total exceeds 1.0, which correctly distinguishes summed totals from legitimately high scores on a wider scale. The prompt clarification is a sensible belt-and-suspenders approach alongside the runtime normalization.

The regression tests cover the primary fix path. One minor gap is the absence of a negative test (e.g., criteria scored on a 0..5 scale with total > 1.0 should NOT be normalized), but the guard condition all(0.0 <= float(v) <= 1.0 for v in values) handles that correctly.

Files Reviewed (2 files)
  • scripts/lib_grading.py - 0 issues
  • tests/test_lib_grading.py - 0 issues

Reviewed by claude-4.6-opus-20260205 · 173,890 tokens

Copy link
Copy Markdown
Member

@olearycrew olearycrew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@luccathescientist thanks for this contribution!

@olearycrew olearycrew merged commit 1e2ba6b into pinchbench:main Mar 24, 2026
1 of 2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants