Normalize judge totals to 0-1 scale by luccathescientist · Pull Request #73 · pinchbench/skill

luccathescientist · 2026-03-21T16:23:12Z

Summary

This fixes an issue where LLM-judge scores could exceed 1.0 when the judge returned a total that was the sum of per-criterion scores instead of
their mean.

The grading pipeline already expects scores on a 0..1 scale, but some recent runs showed impossible values like 3.85/1.0, 5.0/1.0, and hybrid
scores above 1.0.

What changed

Clarified the judge prompt so total must be the arithmetic mean of criterion scores and must remain in 0..1
Normalized judge responses so that when:
- per-criterion scores are in 0..1, and
- the returned total is greater than 1.0
we treat that total as a summed score and convert it back to the mean
Added regression tests for:
- summed LLM-judge totals being normalized back to 0..1
- hybrid scoring staying within range after normalization

Evidence

Observed in saved benchmark runs:

results/0009_local-openai-gpt-oss-20b.json
- task_15_daily_summary: 3.85/1.0
- task_16_email_triage: 2.8925/1.0
results/0010_local-openai-gpt-oss-20b.json
- task_15_daily_summary: 5.0/1.0
- task_17_email_search: 3.39/1.0

In these cases, the criterion breakdowns were already on a 0..1 scale, so the overflowing totals were consistent with summed judge outputs leaking
into aggregate scoring.

Validation

python3 -m py_compile scripts/lib_grading.py tests/test_lib_grading.py
python3 -m unittest tests/test_lib_grading.py

kilo-code-bot · 2026-03-21T16:24:44Z

Code Review Summary

Status: No Issues Found | Recommendation: Merge

The normalization logic is well-guarded: it only fires when all criterion scores are individually within [0.0, 1.0] AND the reported total exceeds 1.0, which correctly distinguishes summed totals from legitimately high scores on a wider scale. The prompt clarification is a sensible belt-and-suspenders approach alongside the runtime normalization.

The regression tests cover the primary fix path. One minor gap is the absence of a negative test (e.g., criteria scored on a 0..5 scale with total > 1.0 should NOT be normalized), but the guard condition all(0.0 <= float(v) <= 1.0 for v in values) handles that correctly.

Files Reviewed (2 files)

scripts/lib_grading.py - 0 issues
tests/test_lib_grading.py - 0 issues

_{Reviewed by claude-4.6-opus-20260205 · 173,890 tokens}

olearycrew

@luccathescientist thanks for this contribution!

Normalize judge totals to 0-1 scale

4359719

olearycrew approved these changes Mar 24, 2026

View reviewed changes

olearycrew merged commit 1e2ba6b into pinchbench:main Mar 24, 2026
1 of 2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Normalize judge totals to 0-1 scale#73

Normalize judge totals to 0-1 scale#73
olearycrew merged 1 commit intopinchbench:mainfrom
luccathescientist:fix-judge-total-normalization

luccathescientist commented Mar 21, 2026

Uh oh!

kilo-code-bot bot commented Mar 21, 2026 •

edited

Loading

Uh oh!

olearycrew left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

luccathescientist commented Mar 21, 2026

Summary

What changed

Evidence

Validation

Uh oh!

kilo-code-bot bot commented Mar 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review Summary

Uh oh!

olearycrew left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kilo-code-bot bot commented Mar 21, 2026 •

edited

Loading