[low-priority] mmlu_pro: Allow using LLM-as-judge as parser by tomtseng · Pull Request #101 · criticalml-uw/TamperBench

tomtseng · 2026-02-24T13:45:06Z

Changes

In PR #99 we found that LLM-as-a-judge for MMLU-Pro grading works better than the regex parser, though it doesn't change our top-line results much. This PR edits MMLU-Pro to allow configuring MMLU-Pro to use LLM-as-a-judge grading (as opposed to #99 which invokes the LLM-as-a-judge post-hoc), but still defaults to the regex parser.

Testing

new unit tests. I haven't tried running the code fully end-to-end

tomtseng mentioned this pull request Feb 24, 2026

scripts mmlu_pro: Analyze using gpt-5-mini as llm-as-judge #99

Merged

tomtseng force-pushed the tomtseng/mmlu-pro-llm-judge-impl branch from 60d56b7 to 8cb0e39 Compare February 24, 2026 14:58

mmlu_pro: Allow using LLM-as-judge as parser

77eaa35

tomtseng force-pushed the tomtseng/mmlu-pro-llm-judge-impl branch from 8cb0e39 to 77eaa35 Compare February 24, 2026 15:14

tomtseng changed the title ~~mmlu_pro: Allow using LLM-as-judge as parser~~ [low-priority] mmlu_pro: Allow using LLM-as-judge as parser Feb 24, 2026

tomtseng marked this pull request as ready for review February 24, 2026 15:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[low-priority] mmlu_pro: Allow using LLM-as-judge as parser#101

[low-priority] mmlu_pro: Allow using LLM-as-judge as parser#101
tomtseng wants to merge 1 commit intomainfrom
tomtseng/mmlu-pro-llm-judge-impl

tomtseng commented Feb 24, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tomtseng commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

tomtseng commented Feb 24, 2026 •

edited

Loading