Normalize judge totals to 0-1 scale#73
Conversation
Code Review SummaryStatus: No Issues Found | Recommendation: Merge The normalization logic is well-guarded: it only fires when all criterion scores are individually within The regression tests cover the primary fix path. One minor gap is the absence of a negative test (e.g., criteria scored on a Files Reviewed (2 files)
Reviewed by claude-4.6-opus-20260205 · 173,890 tokens |
olearycrew
left a comment
There was a problem hiding this comment.
@luccathescientist thanks for this contribution!
Summary
This fixes an issue where LLM-judge scores could exceed
1.0when the judge returned atotalthat was the sum of per-criterion scores instead oftheir mean.
The grading pipeline already expects scores on a
0..1scale, but some recent runs showed impossible values like3.85/1.0,5.0/1.0, and hybridscores above
1.0.What changed
Clarified the judge prompt so
totalmust be the arithmetic mean of criterion scores and must remain in0..1Normalized judge responses so that when:
0..1, andtotalis greater than1.0we treat that total as a summed score and convert it back to the mean
Added regression tests for:
0..1Evidence
Observed in saved benchmark runs:
results/0009_local-openai-gpt-oss-20b.jsontask_15_daily_summary:3.85/1.0task_16_email_triage:2.8925/1.0results/0010_local-openai-gpt-oss-20b.jsontask_15_daily_summary:5.0/1.0task_17_email_search:3.39/1.0In these cases, the criterion breakdowns were already on a
0..1scale, so the overflowing totals were consistent with summed judge outputs leakinginto aggregate scoring.
Validation