Course Exam Benchmark: Add Evaluation Infrastructure #61

tareknaser · 2026-01-10T00:06:45Z

Description

The PR adds infrastructure to run courseexam benchmark evaluation using the updated data format

Copilot

Pull request overview

This pull request adds complete evaluation infrastructure for the courseexam benchmark, implementing a modernized data format and evaluation pipeline using the Inspect framework.

Key Changes:

Implements evaluation infrastructure with task definition, dataset loading, custom metrics, and LLM-as-judge scoring for both ExactMatch and Freeform question types
Migrates data format from flat structure with instance_id to nested structure with input/target/metadata fields following Inspect conventions
Updates test suite to validate the new data schema and adds automated dataset generation as part of test workflow

Reviewed changes

Copilot reviewed 14 out of 18 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
`courseexam/__init__.py`	Package initialization exposing main API
`courseexam/courseexam.py`	Main task definition with conditional solver for different question types
`courseexam/dataset.py`	Dataset loading with filtering by exam, type, and tags, plus reference material injection
`courseexam/metrics.py`	Custom metrics for points-based evaluation (accuracy, mean, totals)
`courseexam/scorer.py`	Hybrid scorer using exact match for multiple choice and LLM judge for freeform
`courseexam/prepare.py`	Data preparation script to convert markdown exams to JSONL format
`run_eval.py`	Convenience script for running evaluations with configurable parameters
`prepare_dataset.py`	Entry point script to prepare dataset from raw markdown files
`pyproject.toml`	Project configuration with dependencies and build settings
`tests/test_data_schema.py`	Updated schema validation tests for new data format
`data/raw/example_course_2024_midterm/exam.md`	Example exam updated with new format (choices field, answer as letter index)
`data/raw/example_course_2024_midterm/raft_basics.md`	New reference material file
`data/raw/6_1810_operating_system_engineering_fall_2024_final/exam.md`	Real exam updated with new format
`data/raw/6_1810_operating_system_engineering_fall_2024_final/mmap.md`	New reference material file
`README.md`	Comprehensive documentation updates covering evaluation, data format, and usage
`.github/workflows/test.yml`	CI updates for Python 3.10 and pyproject.toml-based installations

Comments suppressed due to low confidence (8)

benchmarks/courseexam_bench/data/raw/6_1810_operating_system_engineering_fall_2024_final/exam.md:134

The "choices" field contains ["A", "B", "C", "D"] instead of the actual option text. For consistency with the documented format (see example_course_2024_midterm/exam.md), this should contain the actual text of choices like ["an error because 'b' does not exist", "an error because the symlink was already visited", "an error because 'b' points to itself", "nothing because xv6 will panic"].
benchmarks/courseexam_bench/data/raw/6_1810_operating_system_engineering_fall_2024_final/exam.md:661
The "choices" field contains ["A", "B", "C", "D"] instead of the actual option text. For consistency with the documented format, this should contain the actual text of the four choices presented in the question.
benchmarks/courseexam_bench/data/raw/6_1810_operating_system_engineering_fall_2024_final/exam.md:166
The "choices" field contains ["A", "B", "C", "D", "E"] instead of the actual option text. For consistency with the documented format, this should contain the actual text of the five choices presented in the question.
benchmarks/courseexam_bench/data/raw/6_1810_operating_system_engineering_fall_2024_final/exam.md:288
The "choices" field contains ["A", "B", "C", "D"] instead of the actual option text. For consistency with the documented format, this should contain the actual text of the four choices presented in the question.
benchmarks/courseexam_bench/data/raw/6_1810_operating_system_engineering_fall_2024_final/exam.md:368
The "choices" field contains ["A", "B", "C", "D", "E"] instead of the actual option text. For consistency with the documented format, this should contain the actual text of the five choices presented in the question.
benchmarks/courseexam_bench/data/raw/6_1810_operating_system_engineering_fall_2024_final/exam.md:465
The "choices" field contains ["A", "B", "C", "D", "E"] instead of the actual option text. For consistency with the documented format, this should contain the actual text of the five choices presented in the question.
benchmarks/courseexam_bench/data/raw/6_1810_operating_system_engineering_fall_2024_final/exam.md:46
In the exam data, the "choices" field contains ["A", "B", "C", "D"] but these are just the option letters, not the actual choice content. According to the format in other questions like Question 1 of the example exam (lines 31-32), the choices should contain the actual text options like ["Running", "Ready", "Blocked", "Terminated"]. The current format is inconsistent with the documented structure.
benchmarks/courseexam_bench/data/raw/6_1810_operating_system_engineering_fall_2024_final/exam.md:96
The same issue with "choices" exists here - it contains ["A", "B", "C", "D"] instead of the actual option text. For consistency with the documented format, this should be ["an inode number", "a block number", "file data", "a bitmap"].

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

benchmarks/courseexam_bench/courseexam/dataset.py

Signed-off-by: Tarek <tareknaser360@gmail.com>

xuafeng

Overall is great. I left a few small comments. Thanks.

xuafeng · 2026-01-13T01:00:04Z

...hmarks/courseexam_bench/data/raw/6_1810_operating_system_engineering_fall_2024_final/exam.md

 B. 1

 C. 2



Should I ignore all these exam data files?

Yes I mostly just moved them to a new raw/ subfolder

xuafeng · 2026-01-13T21:09:21Z

benchmarks/courseexam_bench/courseexam/courseexam.py

+
+@solver
+def conditional_solver() -> Solver:
+    mc_solver = multiple_choice()


do we know what's the concrete prompt this multiple_choice func sent to LLM?

Yes we can view the full LLM query and response in the UI

xuafeng · 2026-01-13T21:09:48Z

benchmarks/courseexam_bench/courseexam/courseexam.py

+
+
+@solver
+def conditional_solver() -> Solver:


I assume that we can implement any our own solver, right?

Yeah. I actually started by implementing a custom solver before I realized this is more convenient

Docs for custom solver

xuafeng · 2026-01-13T21:11:25Z

benchmarks/courseexam_bench/courseexam/scorer.py

+
+
+@scorer(metrics=[accuracy(), stderr(), mean()])
+def exam_scorer(judge_model: str = "openai/gpt-4o-mini") -> Scorer:


Maybe, we can have stronger judge model?

This is just the default one. It's configurable through the CLI we expose

xuafeng · 2026-01-13T21:13:59Z

benchmarks/courseexam_bench/run_eval.py

+
+from courseexam.courseexam import courseexam
+
+# Configuration - modify these constants as needed


Should we move it to config file later?

I think users will mainly interact with the benchmark through the CLI. I wrote this simple script as an alternative for more fine-grained control if needed.

We could also add a separate configuration file as a third way to run the benchmark but I don’t think it adds much value right now. What do you think?

xuafeng · 2026-01-13T21:20:28Z

benchmarks/courseexam_bench/README.md

+- Explanation questions
+- Multiple choice with partial credit (using custom rubric)
+
+## No Images/Figures Policy


I think many models, such as ones from openai, anthropic and google, already support multimodal (picture, txt). I am wondering why we need to change picture to text.

Images put language models at a disadvantage and most of the models we benchmark (as of now) are language-only so for now I think questions that rely on images shouldn’t be included.

Later, when we support agents, we can decide how to handle image-based questions, either with clear flags or in a separate standalone benchmark.
What do you think?

The model, gpt-4o, cluade-connet-4.5. I think they support image. please double check.

they do yes. Do you think we should put questions with images' references in a separate benchmark?

can we have some tag or label for these tasks? if the model support image, we test all. if not, we only run the ones without picture. They are all inside the system exam. But, with additional labels.

what do you think?

xuafeng · 2026-01-13T21:27:25Z

benchmarks/courseexam_bench/run_eval.py

@@ -0,0 +1,23 @@
+#!/usr/bin/env python3
+


Do you remember that we have some install.sh and run.sh script? the goal is to make us can easily deploy the repo on a new machine and quick test many benchmarks on both open-source model or our own trained models.

I am thinking that we will have many benchmarks in the future. How can we quickly test a bunch of benchmarks on a new model? Does the inspect already support it?

Maybe we still need some one-click scripts? Np urgent for this PR

I’m not sure if inspect can run multiple benchmarks at once (as a single experiment with one log file) but we can definitely writer scripts to run inspect eval for each separate benchmark

tareknaser requested a review from Copilot January 10, 2026 00:20

Copilot started reviewing on behalf of tareknaser January 10, 2026 00:20 View session

Copilot AI reviewed Jan 10, 2026

View reviewed changes

benchmarks/courseexam_bench/courseexam/dataset.py Show resolved Hide resolved

tareknaser force-pushed the courseexam_infra branch from 61a0469 to 9df28a5 Compare January 10, 2026 01:08

This was referenced Jan 10, 2026

[exam] Add UW Madison CS537 Fall 2021 Exam #55

Merged

Add 6.5840 Spring 2025 Exam 1 #57

Merged

feat(courseexam): add support for evaluation infrastructure

440b165

Signed-off-by: Tarek <tareknaser360@gmail.com>

tareknaser force-pushed the courseexam_infra branch from 9df28a5 to 440b165 Compare January 13, 2026 00:22

feat(courseexam): deprecate points_mean metric

9e83df1

Signed-off-by: Tarek <tareknaser360@gmail.com>

xuafeng approved these changes Jan 13, 2026

View reviewed changes

tareknaser merged commit cacf763 into main Jan 13, 2026
28 checks passed

tareknaser deleted the courseexam_infra branch January 13, 2026 22:10



		@scorer(metrics=[accuracy(), stderr(), mean()])
		def exam_scorer(judge_model: str = "openai/gpt-4o-mini") -> Scorer:


		from courseexam.courseexam import courseexam

		# Configuration - modify these constants as needed

Course Exam Benchmark: Add Evaluation Infrastructure #61

Course Exam Benchmark: Add Evaluation Infrastructure #61

Uh oh!

Conversation

tareknaser commented Jan 10, 2026

Description

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

xuafeng left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xuafeng Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

xuafeng Jan 13, 2026 •

edited

Loading