diff --git a/.github/workflows/test.yml b/.github/workflows/test.yml
index 5221486..3b1021a 100644
--- a/.github/workflows/test.yml
+++ b/.github/workflows/test.yml
@@ -33,7 +33,7 @@ jobs:
     - name: Set up Python
       uses: actions/setup-python@v5
       with:
-        python-version: '3.9'
+        python-version: '3.10'
 
     - name: Install dependencies
       working-directory: benchmarks/${{ matrix.benchmark }}
@@ -45,6 +45,9 @@ jobs:
         if [ -f requirements.txt ]; then
           pip install -r requirements.txt
         fi
+        if [ -f pyproject.toml ]; then
+          pip install -e ".[dev]"
+        fi
         deactivate
 
     - name: Run tests
diff --git a/benchmarks/courseexam_bench/README.md b/benchmarks/courseexam_bench/README.md
index 0b6c564..873d6c4 100644
--- a/benchmarks/courseexam_bench/README.md
+++ b/benchmarks/courseexam_bench/README.md
@@ -2,11 +2,9 @@
 
 A benchmark for evaluating AI agents on systems course exam questions. This benchmark tests AI agents' ability to answer technical exam questions covering topics like operating systems, distributed systems, databases, and networking.
 
-> Note: The infrastructure to run this benchmark is still work in progress. For now, we are focused on defining the data format and collecting exam questions.
+## What is this benchmark?
 
-## Scope: Standalone Questions Only
-
-This benchmark is designed for standalone exam questions similar to traditional paper exams. Think of this as a closed-book exam with an optional cheatsheet where students solve questions using only pen and paper (no code execution, no file system access, no web browsing, or any other agent actions)
+This benchmark is designed for standalone exam questions similar to traditional paper exams. Think of this as a closed-book exam with an optional cheatsheet where students solve questions using only pen and paper (no code execution, no file system access, no web browsing, or any other agent actions).
 
 What's allowed:
 
@@ -19,11 +17,92 @@ What's not allowed:
 - Questions requiring file system interaction
 - Questions requiring web access or external tools
 - Any form of agent execution beyond reading and answering
-- Questions with images/figures (see below for details)
+- Questions with images/figures (see policy below)
+
+> We understand that different professors and universities use different terminology (e.g., "take-home exam," "assignments" instead of "labs"), but the line we draw is that the exam benchmark only includes plain text questions with optional reference markdown files. If a question requires any form of execution, environment setup, or interactive problem-solving, it belongs in the [Course Lab Benchmark](../courselab_bench/) instead.
+
+## Running Evaluation
+
+This benchmark uses [Inspect](https://inspect.ai-safety-institute.org.uk/) for evaluation. After preparing the dataset, you can run evaluations with different models and configurations.
+
+```bash
+# Install the benchmark
+pip install -e .
+
+# Prepare the dataset (run this once after adding new exams)
+python prepare_dataset.py
+
+inspect eval courseexam --model openai/gpt-4
+
+# Filter by exam
+inspect eval courseexam -T exam_ids=example_course_2024_midterm --model openai/gpt-4
+
+# Filter by question type (ExactMatch or Freeform)
+inspect eval courseexam -T question_types=ExactMatch --model openai/gpt-4
+
+# Filter by tags
+inspect eval courseexam -T tags=operating-systems --model openai/gpt-4
+
+# Use a different judge model for Freeform questions
+inspect eval courseexam -T judge_model=openai/gpt-4o --model openai/gpt-4
+
+# If you prefer to modify the configuration in code, edit run_eval.py then run:
+python run_eval.py
+```
+
+After running evaluations, view results with:
+
+```bash
+inspect view
+```
+
+## Adding New Exams
 
-> We understand that different professors and universities use different terminology (e.g., "take-home exam," "assignments" instead of "labs"), but the line we draw is that the exam benchmark only includes plain text questions with optional reference markdown files. If a question requires any form of execution, environment setup, or interactive problem-solving, it belongs in the [Course Lab Benchmark](../courselab_bench/) instead, which offers more flexibility for coding tasks and agent interactions.
+Exams are written in markdown format and automatically converted using `prepare_dataset.py`.
 
-### No Images/Figures Policy
+### Markdown Format
+
+See [data/raw/example_course_2024_midterm/exam.md](data/raw/example_course_2024_midterm/exam.md) for a complete example with comments explaining each question type.
+
+Format overview:
+
+- JSON block at the top for exam metadata
+- Questions separated by `---`
+- Question text in markdown
+- JSON block for question metadata and answer
+- Optional `comments` field for notes about the question
+
+### Steps to Add an Exam
+
+1. Create a new folder in `data/raw/`: `data/raw/your_exam_name/`
+2. Create `exam.md` in that folder following the format
+3. (Optional) Add reference materials (`.md` files) to the same folder
+4. Run the preparation script: `python prepare_dataset.py`
+
+The script will automatically parse your markdown and generate `data/questions.jsonl` and `data/exams_metadata.json`. It will also copy any reference materials to `data/reference_materials/`.
+
+### Handling Multi-Part Questions
+
+Each question should be standalone. The `problem_id` field is flexible and supports values like `2.1` or `4a` for subquestions.
+
+If multiple subquestions share the same context, you have two options:
+
+- Separate questions: Create individual questions with their own JSON blocks. Make sure to include the shared context in each question's text.
+- Combined question: Merge subquestions into one question (JSON). Ask for a comma-separated list of answers (e.g., `"A, B, D"`) for `ExactMatch`, or use `Freeform` with an LLM judge for more complex grading.
+
+## Question Types
+
+ExactMatch: Questions with one correct answer, graded by exact string match
+
+- Single choice (A/B/C/D)
+- True/False
+
+Freeform: Free-form text questions graded by LLM-as-judge
+
+- Explanation questions
+- Multiple choice with partial credit (using custom rubric)
+
+## No Images/Figures Policy
 
 We don't allow pictures or images in questions to avoid penalizing text-only agents. The rule is that figures must have a textual ground truth. No task should rely on visual interpretation alone.
 
@@ -41,13 +120,14 @@ The benchmark consists of exam questions stored in a structured format:
 
 ```
 data/
+├── raw/                      # Source exam.md files
+│   └── example_course_2024_midterm/
+│       └── exam.md
 ├── exams_metadata.json       # Exam-level metadata (generated)
 ├── questions.jsonl           # Question data (generated)
-├── reference_materials/      # Cheatsheets and reference documents
-│   ├── raft_basics.md
-│   └── ...
-└── example_course_2024_midterm/
-    └── exam.md               # Exam in markdown format
+└── reference_materials/      # Cheatsheets and reference documents
+    ├── raft_basics.md
+    └── ...
 ```
 
 ### Exam Metadata Schema
@@ -82,85 +162,40 @@ Optional fields:
 
 - `institution` (string): University or institution name
 
-### Question Types
-
-ExactMatch: Questions with one correct answer, graded by exact string match
-
-- Single choice (A/B/C/D)
-- True/False
-
-> **Note:** For ExactMatch questions, always include explicit instructions about the expected answer format in the question text. For example: "Your answer should be one letter only (A, B, C, D, or E). No extra text."
-
-Freeform: Free-form text questions graded by LLM-as-judge
-
-- Explanation questions
-- Multiple choice with partial credit (using custom rubric)
-
 ### Question Schema
 
-Each line in `questions.jsonl` is a JSON object:
-
 ```json
 {
-  "instance_id": 1,
-  "exam_id": "example_course_2024_midterm",
-  "problem_id": "1",
-  "points": 5,
-  "problem": "What state is a process in when it is waiting for I/O?...",
-  "answer": "C",
-  "type": "ExactMatch",
-  "tags": ["operating-systems", "processes"]
+  "input": "What state is a process in when it is waiting for I/O?",
+  "target": "C",
+  "choices": ["Running", "Ready", "Blocked", "Terminated"],
+  "id": "example_course_2024_midterm_1",
+  "metadata": {
+    "exam_id": "example_course_2024_midterm",
+    "problem_id": "1",
+    "points": 5,
+    "type": "ExactMatch",
+    "tags": ["operating-systems", "processes"]
+  }
 }
 ```
 
 Required fields:
 
-- `instance_id` (integer): Globally unique identifier (incremental by 1)
-- `exam_id` (string): Links to exam in exams_metadata.json
-- `problem_id` (string): Question identifier within exam (e.g., "1", "2.1", "4a")
-- `points` (integer): Points allocated to this question
-- `problem` (string): Question text in markdown format
-- `answer` (string): Correct answer
-- `type` (string): Either "ExactMatch" or "Freeform"
-- `tags` (array): Topic tags (minimum 1, lowercase with hyphens)
+- `input` (string): Question text in markdown format
+- `target` (string): Correct answer
+- `id` (string): Unique identifier
+- `metadata`:
+  - `exam_id` (string): Links to exam in exams_metadata.json
+  - `problem_id` (string): Question identifier within exam (e.g., "1", "2.1", "4a")
+  - `points` (integer): Points allocated to this question
+  - `type` (string): Either "ExactMatch" or "Freeform"
+  - `tags` (array): Topic tags (minimum 1, lowercase with hyphens)
 
 Optional fields:
 
-- `reference_materials` (array): Paths to reference markdown files (e.g., `["reference_materials/raft_basics.md"]`)
-- `llm_judge_instructions` (string): Custom grading rubric for Freeform questions
-
-## Adding New Exams
-
-Exams are written in markdown format and automatically converted to the JSON format using `prepare_dataset.py`.
-
-### Markdown Format
-
-See [data/example_course_2024_midterm/exam.md](data/example_course_2024_midterm/exam.md) for a complete example with comments explaining each question type.
-
-Overview of the format:
-
-- JSON block at the top for exam metadata
-- Questions separated by `---`
-- Question text in markdown
-- JSON block for question metadata and answer
-- Optional `comments` field for notes about the question
-
-#### Handling Multi-Part Questions
-
-Each question should be standalone. The `problem_id` field is flexible and supports values like `2.1` or `4a` for subquestions.
-
-If multiple subquestions share the same context, you have two options:
-
-- Separate questions: Create individual questions with their own JSON blocks. Make sure to include the shared context in each question's text.
-- Combined question: Merge subquestions into one question (JSON). Ask for a comma-separated list of answers (e.g., `"A, B, D"`) for `ExactMatch`, or use `Freeform` with an LLM judge for more complex grading.
-
-### Steps to Add an Exam
-
-1. Create a new folder in `data/`: `data/your_exam_name/`
-2. Create `exam.md` in that folder following the format (see [example_course_2024_midterm/exam.md](data/example_course_2024_midterm/exam.md))
-3. (Optional) Add reference materials (`.md` files) to the same folder
-4. Run the preparation script: `python prepare_dataset.py`
-
-The script will automatically parse your markdown, assign incremental `instance_id` values, and generate `data/questions.jsonl` and `data/exams_metadata.json`. It will also copy any reference materials to `data/reference_materials/`.
+- `choices` (array): Answer choices for multiple choice/True-False questions. Required for ExactMatch questions with options. Target must be a letter (A, B, C, etc.) corresponding to the choice index.
+- `metadata.reference_materials` (array): Paths to reference markdown files (e.g., `["reference_materials/raft_basics.md"]`)
+- `metadata.llm_judge_instructions` (string): Custom grading rubric for Freeform questions
 
 > If anything in this documentation or the example exam is unclear, please [open an issue](https://github.com/sys-intelligence/system-intelligence-benchmark/issues) with details about what needs clarification.
diff --git a/benchmarks/courseexam_bench/courseexam/__init__.py b/benchmarks/courseexam_bench/courseexam/__init__.py
new file mode 100644
index 0000000..a4c8506
--- /dev/null
+++ b/benchmarks/courseexam_bench/courseexam/__init__.py
@@ -0,0 +1,20 @@
+from .courseexam import courseexam
+from .dataset import load_dataset, load_exam_metadata
+from .metrics import (
+    points_accuracy,
+    total_points_earned,
+    total_points_possible,
+)
+from .scorer import exam_scorer
+
+__all__ = [
+    "courseexam",
+    "load_dataset",
+    "load_exam_metadata",
+    "exam_scorer",
+    "points_accuracy",
+    "total_points_earned",
+    "total_points_possible",
+]
+
+__version__ = "0.1.0"
diff --git a/benchmarks/courseexam_bench/courseexam/courseexam.py b/benchmarks/courseexam_bench/courseexam/courseexam.py
new file mode 100644
index 0000000..da198fc
--- /dev/null
+++ b/benchmarks/courseexam_bench/courseexam/courseexam.py
@@ -0,0 +1,73 @@
+from inspect_ai import Task, task
+from inspect_ai.model import GenerateConfig
+from inspect_ai.solver import (
+    Generate,
+    Solver,
+    TaskState,
+    generate,
+    multiple_choice,
+    solver,
+)
+
+from courseexam.dataset import load_dataset, load_exam_metadata
+from courseexam.metrics import (
+    points_accuracy,
+    total_points_earned,
+    total_points_possible,
+)
+from courseexam.scorer import exam_scorer
+
+
+@solver
+def conditional_solver() -> Solver:
+    mc_solver = multiple_choice()
+    gen_solver = generate()
+
+    async def solve(state: TaskState, generate_fn: Generate) -> TaskState:
+        question_type = state.metadata.get("type", "Freeform")
+
+        if question_type == "ExactMatch":
+            return await mc_solver(state, generate_fn)
+        else:
+            return await gen_solver(state, generate_fn)
+
+    return solve
+
+
+@task
+def courseexam(
+    exam_ids: str | list[str] | None = None,
+    question_types: str | list[str] | None = None,
+    tags: str | list[str] | None = None,
+    shuffle: bool = False,
+    max_tokens: int = 2048,
+    judge_model: str = "openai/gpt-4o-mini",
+) -> Task:
+    dataset = load_dataset(
+        exam_ids=exam_ids,
+        question_types=question_types,
+        tags=tags,
+        shuffle=shuffle,
+    )
+
+    metadata = load_exam_metadata()
+
+    exam_info = {
+        "num_exams": len(metadata["exams"]),
+        "exams": metadata["exams"],
+    }
+
+    return Task(
+        dataset=dataset,
+        solver=conditional_solver(),
+        scorer=exam_scorer(judge_model=judge_model),
+        metrics=[
+            points_accuracy(),
+            total_points_earned(),
+            total_points_possible(),
+        ],
+        config=GenerateConfig(max_tokens=max_tokens, temperature=0.0),
+        metadata=exam_info,
+        name="courseexam",
+        version="0.1.0",
+    )
diff --git a/benchmarks/courseexam_bench/courseexam/dataset.py b/benchmarks/courseexam_bench/courseexam/dataset.py
new file mode 100644
index 0000000..cb54c74
--- /dev/null
+++ b/benchmarks/courseexam_bench/courseexam/dataset.py
@@ -0,0 +1,102 @@
+import json
+from pathlib import Path
+from typing import Any
+
+from inspect_ai.dataset import Dataset, Sample, json_dataset
+
+
+def get_data_dir() -> Path:
+    return Path(__file__).parent.parent / "data"
+
+
+def load_exam_metadata() -> dict[str, Any]:
+    data_dir = get_data_dir()
+    metadata_file = data_dir / "exams_metadata.json"
+
+    if not metadata_file.exists():
+        raise FileNotFoundError(f"Exam metadata not found at {metadata_file}")
+
+    return json.loads(metadata_file.read_text())
+
+
+def load_reference_material(ref_path: str) -> str:
+    data_dir = get_data_dir()
+    full_path = data_dir / ref_path
+
+    if not full_path.exists():
+        raise FileNotFoundError(f"Reference material not found: {full_path}")
+
+    return full_path.read_text()
+
+
+def record_to_sample(record: dict[str, Any]) -> Sample:
+    input_text = record["input"]
+
+    ref_materials = record.get("metadata", {}).get("reference_materials", [])
+    if ref_materials:
+        refs_content = []
+        for ref_path in ref_materials:
+            try:
+                content = load_reference_material(ref_path)
+                refs_content.append(content)
+            except FileNotFoundError:
+                pass
+
+        if refs_content:
+            ref_section = "\n\n---\n\n## REFERENCE MATERIALS\n\n"
+            ref_section += "\n\n---\n\n".join(refs_content)
+            ref_section += "\n\n---\n\n"
+            input_text = ref_section + input_text
+
+    return Sample(
+        id=record.get("id"),
+        input=input_text,
+        target=record["target"],
+        choices=record.get("choices"),
+        metadata=record.get("metadata", {}),
+    )
+
+
+def load_dataset(
+    exam_ids: str | list[str] | None = None,
+    question_types: str | list[str] | None = None,
+    tags: str | list[str] | None = None,
+    shuffle: bool = False,
+) -> Dataset:
+    data_dir = get_data_dir()
+    questions_file = data_dir / "questions.jsonl"
+
+    if not questions_file.exists():
+        raise FileNotFoundError(f"Questions file not found at {questions_file}")
+
+    mtime = questions_file.stat().st_mtime
+    dataset = json_dataset(
+        str(questions_file),
+        sample_fields=record_to_sample,
+        shuffle=shuffle,
+        seed=42,
+        name=f"courseexam_{mtime}",
+    )
+
+    if exam_ids:
+        exam_ids = [exam_ids] if isinstance(exam_ids, str) else exam_ids
+        dataset = dataset.filter(
+            lambda s: s.metadata and s.metadata.get("exam_id") in exam_ids
+        )
+
+    if question_types:
+        question_types = (
+            [question_types] if isinstance(question_types, str) else question_types
+        )
+        dataset = dataset.filter(
+            lambda s: s.metadata and s.metadata.get("type") in question_types
+        )
+
+    if tags:
+        tags = [tags] if isinstance(tags, str) else tags
+        dataset = dataset.filter(
+            lambda s: s.metadata
+            and any(tag in s.metadata.get("tags", []) for tag in tags)
+        )
+
+    return dataset
diff --git a/benchmarks/courseexam_bench/courseexam/metrics.py b/benchmarks/courseexam_bench/courseexam/metrics.py
new file mode 100644
index 0000000..97f1873
--- /dev/null
+++ b/benchmarks/courseexam_bench/courseexam/metrics.py
@@ -0,0 +1,48 @@
+from inspect_ai.scorer import Metric, SampleScore, metric
+
+
+@metric
+def points_accuracy() -> Metric:
+    def metric_fn(scores: list[SampleScore]) -> float:
+        total_earned = 0.0
+        total_possible = 0.0
+
+        for sample_score in scores:
+            metadata = sample_score.score.metadata or {}
+            total_earned += metadata.get("points_earned", 0)
+            total_possible += metadata.get("points_possible", 0)
+
+        if total_possible == 0:
+            return 0.0
+
+        return total_earned / total_possible
+
+    return metric_fn
+
+
+@metric
+def total_points_earned() -> Metric:
+    def metric_fn(scores: list[SampleScore]) -> float:
+        total = 0.0
+
+        for sample_score in scores:
+            metadata = sample_score.score.metadata or {}
+            total += metadata.get("points_earned", 0)
+
+        return total
+
+    return metric_fn
+
+
+@metric
+def total_points_possible() -> Metric:
+    def metric_fn(scores: list[SampleScore]) -> float:
+        total = 0.0
+
+        for sample_score in scores:
+            metadata = sample_score.score.metadata or {}
+            total += metadata.get("points_possible", 0)
+
+        return total
+
+    return metric_fn
diff --git a/benchmarks/courseexam_bench/courseexam/prepare.py b/benchmarks/courseexam_bench/courseexam/prepare.py
new file mode 100644
index 0000000..42de180
--- /dev/null
+++ b/benchmarks/courseexam_bench/courseexam/prepare.py
@@ -0,0 +1,107 @@
+import json
+import re
+from pathlib import Path
+from typing import Any
+
+
+def parse_exam_markdown(
+    exam_md_path: Path,
+) -> tuple[dict[str, Any], list[dict[str, Any]]]:
+    content = exam_md_path.read_text()
+
+    exam_metadata_match = re.search(r"```json\n(\{.*?\})\n```", content, re.DOTALL)
+    if not exam_metadata_match:
+        raise ValueError(f"No exam metadata found in {exam_md_path}")
+
+    exam_metadata = json.loads(exam_metadata_match.group(1))
+    sections = content.split("\n---\n")[1:]
+    samples = []
+
+    for section in sections:
+        json_blocks = re.findall(r"```json\n(\{.*?\})\n```", section, re.DOTALL)
+
+        for json_block in json_blocks:
+            question_data = json.loads(json_block)
+
+            if "problem_id" not in question_data:
+                continue
+
+            question_start = section.find("##")
+            question_end = section.find("```json")
+            problem_text = section[question_start:question_end].strip()
+            problem_text = re.sub(
+                r"^## Question.*?\n+", "", problem_text, flags=re.MULTILINE
+            ).strip()
+
+            sample = {
+                "input": problem_text,
+                "target": question_data["answer"],
+                "id": f"{exam_metadata['exam_id']}_{question_data['problem_id']}",
+                "metadata": {
+                    "exam_id": exam_metadata["exam_id"],
+                    "problem_id": question_data["problem_id"],
+                    "points": question_data["points"],
+                    "type": question_data["type"],
+                    "tags": question_data["tags"],
+                },
+            }
+
+            if "choices" in question_data:
+                sample["choices"] = question_data["choices"]
+
+            if "reference_materials" in question_data:
+                exam_folder = exam_md_path.parent
+                data_dir = exam_folder.parent.parent
+                ref_dir = data_dir / "reference_materials"
+                ref_dir.mkdir(exist_ok=True)
+
+                ref_materials = []
+                for ref in question_data["reference_materials"]:
+                    ref_path = exam_folder / ref
+                    if ref_path.exists():
+                        dest = ref_dir / ref
+                        dest.write_text(ref_path.read_text())
+                        ref_materials.append(f"reference_materials/{ref}")
+
+                sample["metadata"]["reference_materials"] = ref_materials
+
+            if "llm_judge_instructions" in question_data:
+                sample["metadata"]["llm_judge_instructions"] = question_data[
+                    "llm_judge_instructions"
+                ]
+
+            samples.append(sample)
+
+    return exam_metadata, samples
+
+
+def prepare_dataset(data_dir: Path) -> None:
+    raw_dir = data_dir / "raw"
+    all_exams_metadata = []
+    all_samples = []
+
+    for exam_folder in raw_dir.iterdir():
+        if not exam_folder.is_dir():
+            continue
+
+        exam_md = exam_folder / "exam.md"
+        if not exam_md.exists():
+            continue
+
+        exam_metadata, samples = parse_exam_markdown(exam_md)
+        all_exams_metadata.append(exam_metadata)
+        all_samples.extend(samples)
+
+    questions_output = data_dir / "questions.jsonl"
+    with questions_output.open("w") as f:
+        for sample in all_samples:
+            f.write(json.dumps(sample) + "\n")
+
+    metadata_output = data_dir / "exams_metadata.json"
+    metadata_output.write_text(json.dumps({"exams": all_exams_metadata}, indent=2))
+
+
+if __name__ == "__main__":
+    script_dir = Path(__file__).parent
+    data_dir = script_dir.parent / "data"
+    prepare_dataset(data_dir)
diff --git a/benchmarks/courseexam_bench/courseexam/scorer.py b/benchmarks/courseexam_bench/courseexam/scorer.py
new file mode 100644
index 0000000..0eaa0a3
--- /dev/null
+++ b/benchmarks/courseexam_bench/courseexam/scorer.py
@@ -0,0 +1,103 @@
+import re
+
+from inspect_ai.model import get_model
+from inspect_ai.scorer import (
+    Score,
+    Scorer,
+    Target,
+    accuracy,
+    choice,
+    mean,
+    scorer,
+    stderr,
+)
+from inspect_ai.solver import TaskState
+
+
+@scorer(metrics=[accuracy(), stderr(), mean()])
+def exam_scorer(judge_model: str = "openai/gpt-4o-mini") -> Scorer:
+    async def score(state: TaskState, target: Target) -> Score:
+        question_type = state.metadata.get("type", "ExactMatch")
+        points_possible = state.metadata.get("points", 1)
+
+        if question_type == "ExactMatch":
+            choice_scorer = choice()
+            result = await choice_scorer(state, target)
+
+            is_correct = result.value == "C"
+            points_earned = points_possible if is_correct else 0
+
+            return Score(
+                value=result.value,
+                answer=result.answer,
+                explanation=result.explanation,
+                metadata={
+                    "points_earned": points_earned,
+                    "points_possible": points_possible,
+                },
+            )
+
+        model_answer = state.output.completion.strip()
+
+        custom_instructions = state.metadata.get("llm_judge_instructions", "")
+
+        grading_prompt = f"""You are grading a student's answer to an exam question.
+
+Question: {state.input_text}
+
+Correct Answer: {target.text}
+
+Student's Answer: {model_answer}
+
+Grading Instructions:
+This question is worth {points_possible} points. Evaluate the student's answer for correctness, completeness, and clarity.
+
+{custom_instructions}
+
+Provide your grading in the following format:
+1. First, explain your reasoning
+2. Then provide the final score in the exact format: POINTS: X/{points_possible}
+
+Where X is the number of points earned (0 to {points_possible}).
+"""
+
+        grader_model = get_model(judge_model)
+        result = await grader_model.generate(grading_prompt)
+        judge_output = result.completion
+
+        points_pattern = r"POINTS:\s*(\d+(?:\.\d+)?)\s*/\s*(\d+)"
+        match = re.search(points_pattern, judge_output)
+
+        if match:
+            points_earned = float(match.group(1))
+            points_earned = max(0, min(points_earned, points_possible))
+
+            percentage = points_earned / points_possible if points_possible > 0 else 0
+            if percentage >= 1.0:
+                value = "C"
+            elif percentage == 0:
+                value = "I"
+            else:
+                value = "P"
+
+            return Score(
+                value=value,
+                answer=model_answer,
+                explanation=judge_output,
+                metadata={
+                    "points_earned": points_earned,
+                    "points_possible": points_possible,
+                },
+            )
+
+        return Score(
+            value="I",
+            answer=model_answer,
+            explanation=f"Failed to parse judge output: {judge_output}",
+            metadata={
+                "points_earned": 0,
+                "points_possible": points_possible,
+            },
+        )
+
+    return score
diff --git a/benchmarks/courseexam_bench/data/6_1810_operating_system_engineering_fall_2024_final/exam.md b/benchmarks/courseexam_bench/data/raw/6_1810_operating_system_engineering_fall_2024_final/exam.md
similarity index 95%
rename from benchmarks/courseexam_bench/data/6_1810_operating_system_engineering_fall_2024_final/exam.md
rename to benchmarks/courseexam_bench/data/raw/6_1810_operating_system_engineering_fall_2024_final/exam.md
index 354298f..047d884 100644
--- a/benchmarks/courseexam_bench/data/6_1810_operating_system_engineering_fall_2024_final/exam.md
+++ b/benchmarks/courseexam_bench/data/raw/6_1810_operating_system_engineering_fall_2024_final/exam.md
@@ -37,14 +37,13 @@ C. 2
 
 D. 3
 
-Your answer should be one of: A, B, C, D only (no extra text).
-
 ```json
 {
   "problem_id": "1",
   "points": 5,
   "type": "ExactMatch",
   "tags": ["file-systems", "xv6", "inodes"],
+  "choices": ["A", "B", "C", "D"],
   "answer": "C",
   "comments": "C. A directory is implemented using an inode, and one inode is created for directory a and one is created for directory b"
 }
@@ -88,14 +87,13 @@ C. file data
 
 D. a bitmap
 
-Your answer should be one of: A, B, C, D only (no extra text).
-
 ```json
 {
   "problem_id": "2",
   "points": 5,
   "type": "ExactMatch",
   "tags": ["file-systems", "xv6", "logging"],
+  "choices": ["A", "B", "C", "D"],
   "answer": "A",
   "comments": "A. Block 770 is a data block and data blocks of directories contain directory entries."
 }
@@ -127,14 +125,13 @@ C. an error because 'b' points to itself
 
 D. nothing because xv6 will panic
 
-Your answer should be one of: A, B, C, D only (no extra text).
-
 ```json
 {
   "problem_id": "3",
   "points": 5,
   "type": "ExactMatch",
   "tags": ["file-systems", "xv6", "symlinks"],
+  "choices": ["A", "B", "C", "D"],
   "answer": "C",
   "comments": "C. When the kernel resolves the symlink 'b' in open, it will find the symlink 'b'. The fs lab requires your solution detects this cycle and return an error."
 }
@@ -160,14 +157,13 @@ D. New system calls cannot start until the just-closed transaction has finished
 
 E. New system calls cannot start until all updated buffers from the just-closed transaction have been synced to their homes on disk.
 
-Your answer should be one of: A, B, C, D, E only (no extra text).
-
 ```json
 {
   "problem_id": "4",
   "points": 5,
   "type": "ExactMatch",
   "tags": ["file-systems", "journaling", "ext3"],
+  "choices": ["A", "B", "C", "D", "E"],
   "answer": "B",
   "comments": "B. The delay serves to prevent partial modifications made by system calls in the next transaction from being seen by system calls that are finishing in the first transaction."
 }
@@ -193,6 +189,7 @@ Hatshepsut runs this program. Both calls to mkdir() return success. Hatshepsut c
 crash just after the program exits. Then she re-starts the computer, which runs the EXT3 recovery program.
 
 What could Hatshepsut see after recovery? (Choose all that apply.)
+
 A. She might see neither /aaa nor /zzz.
 
 B. She might see /aaa but not /zzz.
@@ -203,13 +200,13 @@ D. She might see both /zzz and /aaa.
 
 E. None of the above.
 
-Your answer should be a comma-separated list of letters only (no extra text). For example: "B, C"
+Your answer should be a comma-separated list of letters. For example: "A, C"
 
 ```json
 {
   "problem_id": "5",
   "points": 5,
-  "type": "ExactMatch",
+  "type": "Freeform",
   "tags": ["file-systems", "journaling", "ext3", "atomicity"],
   "answer": "A, B, D",
   "comments": "A, B, and D. A can occur if the system crashed before the transaction(s) reflecting the mkdirs finished committing. B can occur if the two mkdirs are in different transactions, and only the first manages to finish committing. C cannot occur because system calls are placed in transactions in order, and the transactions are also replayed in order during recovery"
@@ -281,8 +278,6 @@ C. 1000
 
 D. ((1 ≪ 27) \* 8) / 4096
 
-Your answer should be one of: A, B, C, D only (no extra text).
-
 ```json
 {
   "problem_id": "6",
@@ -290,6 +285,7 @@ Your answer should be one of: A, B, C, D only (no extra text).
   "type": "ExactMatch",
   "tags": ["virtual-memory", "linux", "page-faults"],
   "reference_materials": ["mmap.md"],
+  "choices": ["A", "B", "C", "D"],
   "answer": "B",
   "comments": "B. The page fault handler uses only 1 page. If it maps a new page, it unmaps the old page."
 }
@@ -362,8 +358,6 @@ D. 16
 
 E. 8192
 
-Your answer should be one of: A, B, C, D, E only (no extra text).
-
 ```json
 {
   "problem_id": "7",
@@ -371,6 +365,7 @@ Your answer should be one of: A, B, C, D, E only (no extra text).
   "type": "ExactMatch",
   "tags": ["virtual-memory", "linux", "page-faults"],
   "reference_materials": ["mmap.md"],
+  "choices": ["A", "B", "C", "D", "E"],
   "answer": "D",
   "comments": "D. The loop goes through the first 8192 entries in the sqrts table. A double is 8 bytes and 512 fit on a single page of 4096 bytes (4096/8 = 512). Thus, the total number of virtual pages referenced in the loop is 8192/512 = 16. The page fault handler will be invoked once for each of the 16 pages."
 }
@@ -395,13 +390,13 @@ D. Copy-on-write fork() is not possble for L4Linux because the CPU delivers page
 
 E. None of the above.
 
-Your answer should be a comma-separated list of letters only (no extra text). For example: "B, C"
+Your answer should be a comma-separated list of letters. For example: "B, C"
 
 ```json
 {
   "problem_id": "8",
   "points": 5,
-  "type": "ExactMatch",
+  "type": "Freeform",
   "tags": ["microkernels", "l4linux", "fork"],
   "answer": "E",
   "comments": "E. Not A: fork is a Linux system call, implemented by the Linux kernel server, not by the L4 kernel. Not B: Linux processes communicate with the Linux server via IPC messages, not by system call traps. Not C: The Linux kernel server is not privileged, and cannot modify the page table register; only the L4 kernel can do this. Not D: the L4 kernel forwards page faults to the Linux kernel server."
@@ -428,13 +423,13 @@ E. A divide-by-zero error in the network domain won't crash the rv6 file system.
 
 F. None of the above.
 
-Your answer should be a comma-separated list of letters only (no extra text). For example: "B, C"
+Your answer should be a comma-separated list of letters. For example: "B, C"
 
 ```json
 {
   "problem_id": "9",
   "points": 5,
-  "type": "ExactMatch",
+  "type": "Freeform",
   "tags": ["redleaf"],
   "answer": "C, E",
   "comments": "C and E. A is false because RedLeaf uses language techniques that disallow domains to dereferences arbitrary addresses. B is false, because RedLeaf explicitly disallows this; only pointers in the shared heap can be passed to other domains. C is true, because RedLeaf follows the Rust borrow rules, which allow two domains to have an immutable reference to the same object. D is false, because RedLeaf doesn't use page tables but relies on language techniques for isolation; the mmap lab requires the use of page tables. E is true, because RedLeaf is designed to catch errors like these and clean up a domain that experience such an error."
@@ -461,14 +456,13 @@ D. He will see livelock, because at high enough arrival rates the CPU will spend
 
 E. He will see livelock, because the polling thread can only process packets at some finite rate, and the input rate could be higher than that.
 
-Your answer should be one of: A, B, C, D, E only (no extra text).
-
 ```json
 {
   "problem_id": "10",
   "points": 5,
   "type": "ExactMatch",
   "tags": ["operating-systems", "networking"],
+  "choices": ["A", "B", "C", "D", "E"],
   "answer": "D"
 }
 ```
@@ -538,13 +532,13 @@ E. The Meltdown attack on xv6 wouldn't be able to dump all of xv6 kernel memory
 
 F. None of the above.
 
-Your answer should be a comma-separated list of letters only (no extra text). For example: "B, C"
+Your answer should be a comma-separated list of letters. For example: "B, C"
 
 ```json
 {
   "problem_id": "12",
   "points": 5,
-  "type": "ExactMatch",
+  "type": "Freeform",
   "tags": ["meltdown"],
   "answer": "A, B, C, D, E",
   "comments": "A, B, C, D, and E. B is true because eventually the CPU will generate an exception, perhaps after speculating on a few instructions. E is true, because xv6 has separate kernel and user page tables, and the user page tables don't map all of kernel memory."
@@ -658,14 +652,13 @@ C. Ben's change could cause invoke syscall() to dereference a pointer to freed m
 
 D. Ben's change could cause retract table() to dereference a pointer to freed memory.
 
-Your answer should be one of: A, B, C, D only (no extra text).
-
 ```json
 {
   "problem_id": "14",
   "points": 5,
   "type": "ExactMatch",
   "tags": ["operating-systems", "concurrency", "rcu"],
+  "choices": ["A", "B", "C", "D"],
   "answer": "C",
   "comments": "C. Ben's modified retract table() frees local_table before the call to synchronize_rcu(). An execution of invoke_syscall() might be active at the same time on another CPU, and might read the old value of local_table just after it has been freed and re-used for something else."
 }
diff --git a/benchmarks/courseexam_bench/data/6_1810_operating_system_engineering_fall_2024_final/mmap.md b/benchmarks/courseexam_bench/data/raw/6_1810_operating_system_engineering_fall_2024_final/mmap.md
similarity index 100%
rename from benchmarks/courseexam_bench/data/6_1810_operating_system_engineering_fall_2024_final/mmap.md
rename to benchmarks/courseexam_bench/data/raw/6_1810_operating_system_engineering_fall_2024_final/mmap.md
diff --git a/benchmarks/courseexam_bench/data/6_1810_operating_system_engineering_fall_2024_final/q24-2-sol.pdf b/benchmarks/courseexam_bench/data/raw/6_1810_operating_system_engineering_fall_2024_final/q24-2-sol.pdf
similarity index 100%
rename from benchmarks/courseexam_bench/data/6_1810_operating_system_engineering_fall_2024_final/q24-2-sol.pdf
rename to benchmarks/courseexam_bench/data/raw/6_1810_operating_system_engineering_fall_2024_final/q24-2-sol.pdf
diff --git a/benchmarks/courseexam_bench/data/6_5840_distributed_system_engineering_spring_2025_exam_i/exam.md b/benchmarks/courseexam_bench/data/raw/6_5840_distributed_system_engineering_spring_2025_exam_i/exam.md
similarity index 100%
rename from benchmarks/courseexam_bench/data/6_5840_distributed_system_engineering_spring_2025_exam_i/exam.md
rename to benchmarks/courseexam_bench/data/raw/6_5840_distributed_system_engineering_spring_2025_exam_i/exam.md
diff --git a/benchmarks/courseexam_bench/data/6_5840_distributed_system_engineering_spring_2025_exam_i/q25-1-sol.pdf b/benchmarks/courseexam_bench/data/raw/6_5840_distributed_system_engineering_spring_2025_exam_i/q25-1-sol.pdf
similarity index 100%
rename from benchmarks/courseexam_bench/data/6_5840_distributed_system_engineering_spring_2025_exam_i/q25-1-sol.pdf
rename to benchmarks/courseexam_bench/data/raw/6_5840_distributed_system_engineering_spring_2025_exam_i/q25-1-sol.pdf
diff --git a/benchmarks/courseexam_bench/data/cs107_computer_organization_&_systems_fall_2023_final/cs107_computer_organization_&_systems_fall_2023_final.pdf b/benchmarks/courseexam_bench/data/raw/cs107_computer_organization_&_systems_fall_2023_final/cs107_computer_organization_&_systems_fall_2023_final.pdf
similarity index 100%
rename from benchmarks/courseexam_bench/data/cs107_computer_organization_&_systems_fall_2023_final/cs107_computer_organization_&_systems_fall_2023_final.pdf
rename to benchmarks/courseexam_bench/data/raw/cs107_computer_organization_&_systems_fall_2023_final/cs107_computer_organization_&_systems_fall_2023_final.pdf
diff --git a/benchmarks/courseexam_bench/data/cs107_computer_organization_&_systems_fall_2023_final/cs107_computer_organization_&_systems_fall_2023_final_solutions.pdf b/benchmarks/courseexam_bench/data/raw/cs107_computer_organization_&_systems_fall_2023_final/cs107_computer_organization_&_systems_fall_2023_final_solutions.pdf
similarity index 100%
rename from benchmarks/courseexam_bench/data/cs107_computer_organization_&_systems_fall_2023_final/cs107_computer_organization_&_systems_fall_2023_final_solutions.pdf
rename to benchmarks/courseexam_bench/data/raw/cs107_computer_organization_&_systems_fall_2023_final/cs107_computer_organization_&_systems_fall_2023_final_solutions.pdf
diff --git a/benchmarks/courseexam_bench/data/cs107_computer_organization_&_systems_fall_2023_final/exam.md b/benchmarks/courseexam_bench/data/raw/cs107_computer_organization_&_systems_fall_2023_final/exam.md
similarity index 100%
rename from benchmarks/courseexam_bench/data/cs107_computer_organization_&_systems_fall_2023_final/exam.md
rename to benchmarks/courseexam_bench/data/raw/cs107_computer_organization_&_systems_fall_2023_final/exam.md
diff --git a/benchmarks/courseexam_bench/data/cs537_fall_2021_final/cs537-21-fall-final-solutions.pdf b/benchmarks/courseexam_bench/data/raw/cs537_fall_2021_final/cs537-21-fall-final-solutions.pdf
similarity index 100%
rename from benchmarks/courseexam_bench/data/cs537_fall_2021_final/cs537-21-fall-final-solutions.pdf
rename to benchmarks/courseexam_bench/data/raw/cs537_fall_2021_final/cs537-21-fall-final-solutions.pdf
diff --git a/benchmarks/courseexam_bench/data/cs537_fall_2021_final/exam.md b/benchmarks/courseexam_bench/data/raw/cs537_fall_2021_final/exam.md
similarity index 100%
rename from benchmarks/courseexam_bench/data/cs537_fall_2021_final/exam.md
rename to benchmarks/courseexam_bench/data/raw/cs537_fall_2021_final/exam.md
diff --git a/benchmarks/courseexam_bench/data/cs_423_operating_systems_design_fall_2025_final/MP2.md b/benchmarks/courseexam_bench/data/raw/cs_423_operating_systems_design_fall_2025_final/MP2.md
similarity index 100%
rename from benchmarks/courseexam_bench/data/cs_423_operating_systems_design_fall_2025_final/MP2.md
rename to benchmarks/courseexam_bench/data/raw/cs_423_operating_systems_design_fall_2025_final/MP2.md
diff --git a/benchmarks/courseexam_bench/data/cs_423_operating_systems_design_fall_2025_final/MP3.md b/benchmarks/courseexam_bench/data/raw/cs_423_operating_systems_design_fall_2025_final/MP3.md
similarity index 100%
rename from benchmarks/courseexam_bench/data/cs_423_operating_systems_design_fall_2025_final/MP3.md
rename to benchmarks/courseexam_bench/data/raw/cs_423_operating_systems_design_fall_2025_final/MP3.md
diff --git a/benchmarks/courseexam_bench/data/cs_423_operating_systems_design_fall_2025_final/MP4.md b/benchmarks/courseexam_bench/data/raw/cs_423_operating_systems_design_fall_2025_final/MP4.md
similarity index 100%
rename from benchmarks/courseexam_bench/data/cs_423_operating_systems_design_fall_2025_final/MP4.md
rename to benchmarks/courseexam_bench/data/raw/cs_423_operating_systems_design_fall_2025_final/MP4.md
diff --git a/benchmarks/courseexam_bench/data/cs_423_operating_systems_design_fall_2025_final/exam.md b/benchmarks/courseexam_bench/data/raw/cs_423_operating_systems_design_fall_2025_final/exam.md
similarity index 100%
rename from benchmarks/courseexam_bench/data/cs_423_operating_systems_design_fall_2025_final/exam.md
rename to benchmarks/courseexam_bench/data/raw/cs_423_operating_systems_design_fall_2025_final/exam.md
diff --git a/benchmarks/courseexam_bench/data/cs_423_operating_systems_design_fall_2025_midterm/MP1.md b/benchmarks/courseexam_bench/data/raw/cs_423_operating_systems_design_fall_2025_midterm/MP1.md
similarity index 100%
rename from benchmarks/courseexam_bench/data/cs_423_operating_systems_design_fall_2025_midterm/MP1.md
rename to benchmarks/courseexam_bench/data/raw/cs_423_operating_systems_design_fall_2025_midterm/MP1.md
diff --git a/benchmarks/courseexam_bench/data/cs_423_operating_systems_design_fall_2025_midterm/MP2.md b/benchmarks/courseexam_bench/data/raw/cs_423_operating_systems_design_fall_2025_midterm/MP2.md
similarity index 100%
rename from benchmarks/courseexam_bench/data/cs_423_operating_systems_design_fall_2025_midterm/MP2.md
rename to benchmarks/courseexam_bench/data/raw/cs_423_operating_systems_design_fall_2025_midterm/MP2.md
diff --git a/benchmarks/courseexam_bench/data/cs_423_operating_systems_design_fall_2025_midterm/exam.md b/benchmarks/courseexam_bench/data/raw/cs_423_operating_systems_design_fall_2025_midterm/exam.md
similarity index 100%
rename from benchmarks/courseexam_bench/data/cs_423_operating_systems_design_fall_2025_midterm/exam.md
rename to benchmarks/courseexam_bench/data/raw/cs_423_operating_systems_design_fall_2025_midterm/exam.md
diff --git a/benchmarks/courseexam_bench/data/example_course_2024_midterm/exam.md b/benchmarks/courseexam_bench/data/raw/example_course_2024_midterm/exam.md
similarity index 86%
rename from benchmarks/courseexam_bench/data/example_course_2024_midterm/exam.md
rename to benchmarks/courseexam_bench/data/raw/example_course_2024_midterm/exam.md
index b16c259..b197de5 100644
--- a/benchmarks/courseexam_bench/data/example_course_2024_midterm/exam.md
+++ b/benchmarks/courseexam_bench/data/raw/example_course_2024_midterm/exam.md
@@ -22,24 +22,15 @@
 
 What state is a process in when it is waiting for I/O to complete?
 
-A. Running
-
-B. Ready
-
-C. Blocked
-
-D. Terminated
-
-Your answer should be a single letter only (A, B, C, or D).
-
 ```json
 {
   "problem_id": "1",
   "points": 5,
   "type": "ExactMatch",
   "tags": ["operating-systems", "processes"],
+  "choices": ["Running", "Ready", "Blocked", "Terminated"],
   "answer": "C",
-  "comments": "Standard single choice with one correct answer. Graded by exact string match."
+  "comments": "Multiple choice question. The choices field is required for ExactMatch questions with options."
 }
 ```
 
@@ -49,16 +40,14 @@ Your answer should be a single letter only (A, B, C, or D).
 
 True or False: A race condition occurs when multiple threads access shared data concurrently and at least one thread modifies the data.
 
-Your answer should be either "True" or "False" only. No extra text.
-
 ```json
 {
   "problem_id": "2",
   "points": 5,
   "type": "ExactMatch",
   "tags": ["operating-systems", "concurrency"],
-  "answer": "True",
-  "comments": "True/False question. Answer must be exactly \"True\" or \"False\". Multi-part True/False questions must be split into separate questions with problem_ids (\"2.1\", \"2.2\")."
+  "choices": ["True", "False"],
+  "answer": "A"
 }
 ```
 
@@ -113,16 +102,20 @@ B. Reading file contents
 
 C. Changing file size
 
-Your answer should be a comma-separated list of letters only (no extra text). For example: "A,B"
-
 ```json
 {
   "problem_id": "5",
   "points": 5,
-  "type": "ExactMatch",
+  "type": "Freeform",
   "tags": ["operating-systems", "file-systems"],
-  "answer": "A, C",
-  "comments": "Multiple correct answers but NO partial credit. Only exact match \"A, C\" gets 5 points."
+  "choices": [
+    "Changing file permissions",
+    "Reading file contents",
+    "Changing file size"
+  ],
+  "answer": "The operations that modify the inode are: A (Changing file permissions) and C (Changing file size). Reading file contents (B) does not modify the inode.",
+  "llm_judge_instructions": "Award full 5 points if the student correctly identifies both A and C. Award 0 points for any other answer. Acceptable formats include: 'A, C', 'A and C', 'A & C', or explaining that A and C are correct.",
+  "comments": "Multiple choice with multiple correct answers. Since we can't use ExactMatch for multi-select, we use Freeform with LLM judge."
 }
 ```
 
@@ -188,7 +181,8 @@ Your answer should be either "True" or "False" only. No extra text.
   "type": "ExactMatch",
   "tags": ["distributed-systems", "raft"],
   "reference_materials": ["raft_basics.md"],
-  "answer": "True",
+  "choices": ["True", "False"],
+  "answer": "A",
   "comments": "ExactMatch with reference material. The reference .md file is provided to the LLM like a cheatsheet."
 }
 ```
@@ -230,7 +224,8 @@ Your answer should be either "True" or "False" only. No extra text.
   "points": 2,
   "type": "ExactMatch",
   "tags": ["distributed-systems", "paxos", "consensus"],
-  "answer": "True",
+  "choices": ["True", "False"],
+  "answer": "A",
   "comments": "Multi-part True/False question. Must be split into separate questions with sub-problem IDs."
 }
 ```
@@ -249,7 +244,8 @@ Your answer should be either "True" or "False" only. No extra text.
   "points": 2,
   "type": "ExactMatch",
   "tags": ["distributed-systems", "cap-theorem"],
-  "answer": "False",
+  "choices": ["True", "False"],
+  "answer": "B",
   "comments": "Multi-part True/False question. Must be split into separate questions with sub-problem IDs."
 }
 ```
diff --git a/benchmarks/courseexam_bench/data/example_course_2024_midterm/raft.png b/benchmarks/courseexam_bench/data/raw/example_course_2024_midterm/raft.png
similarity index 100%
rename from benchmarks/courseexam_bench/data/example_course_2024_midterm/raft.png
rename to benchmarks/courseexam_bench/data/raw/example_course_2024_midterm/raft.png
diff --git a/benchmarks/courseexam_bench/data/example_course_2024_midterm/raft_basics.md b/benchmarks/courseexam_bench/data/raw/example_course_2024_midterm/raft_basics.md
similarity index 100%
rename from benchmarks/courseexam_bench/data/example_course_2024_midterm/raft_basics.md
rename to benchmarks/courseexam_bench/data/raw/example_course_2024_midterm/raft_basics.md
diff --git a/benchmarks/courseexam_bench/prepare_dataset.py b/benchmarks/courseexam_bench/prepare_dataset.py
old mode 100644
new mode 100755
index 672aabb..4d8672a
--- a/benchmarks/courseexam_bench/prepare_dataset.py
+++ b/benchmarks/courseexam_bench/prepare_dataset.py
@@ -1,108 +1,23 @@
-import json
-import re
+#!/usr/bin/env python3
 from pathlib import Path
 
-
-def parse_exam_markdown(exam_md_path, start_instance_id):
-    content = exam_md_path.read_text()
-
-    exam_metadata_match = re.search(r"```json\n(\{.*?\})\n```", content, re.DOTALL)
-    exam_metadata = json.loads(exam_metadata_match.group(1))
-
-    sections = content.split("\n---\n")[1:]
-
-    questions = []
-    instance_id = start_instance_id
-
-    for section in sections:
-        json_blocks = re.findall(r"```json\n(\{.*?\})\n```", section, re.DOTALL)
-
-        for json_block in json_blocks:
-            question_data = json.loads(json_block)
-
-            if "problem_id" not in question_data:
-                continue
-
-            question_start = section.find("##")
-            question_end = section.find("```json")
-            problem_text = section[question_start:question_end].strip()
-            problem_text = re.sub(
-                r"^## Question.*?\n+", "", problem_text, flags=re.MULTILINE
-            )
-            problem_text = problem_text.strip()
-
-            question = {
-                "instance_id": instance_id,
-                "exam_id": exam_metadata["exam_id"],
-                "problem_id": question_data["problem_id"],
-                "points": question_data["points"],
-                "problem": problem_text,
-                "answer": question_data["answer"],
-                "type": question_data["type"],
-                "tags": question_data["tags"],
-            }
-
-            if "reference_materials" in question_data:
-                ref_materials = []
-                exam_folder = exam_md_path.parent
-                data_dir = exam_folder.parent
-                ref_dir = data_dir / "reference_materials"
-                ref_dir.mkdir(exist_ok=True)
-                for ref in question_data["reference_materials"]:
-                    ref_path = exam_folder / ref
-                    if ref_path.exists():
-                        dest = ref_dir / ref
-                        dest.write_text(ref_path.read_text())
-                    ref_materials.append(f"reference_materials/{ref}")
-                question["reference_materials"] = ref_materials
-
-            if "llm_judge_instructions" in question_data:
-                question["llm_judge_instructions"] = question_data[
-                    "llm_judge_instructions"
-                ]
-
-            if "comments" in question_data:
-                question["comments"] = question_data["comments"]
-
-            questions.append(question)
-            instance_id += 1
-
-    return exam_metadata, questions, instance_id
+from courseexam.prepare import prepare_dataset
 
 
 def main():
     script_dir = Path(__file__).parent
     data_dir = script_dir / "data"
 
-    all_exams_metadata = []
-    all_questions = []
-    next_instance_id = 1
-
-    for exam_folder in data_dir.iterdir():
-        if not exam_folder.is_dir() or exam_folder.name in [
-            "reference_materials",
-            "example_exam",
-        ]:
-            continue
-
-        exam_md = exam_folder / "exam.md"
-        if not exam_md.exists():
-            continue
-
-        exam_metadata, questions, next_instance_id = parse_exam_markdown(
-            exam_md, next_instance_id
-        )
-        all_exams_metadata.append(exam_metadata)
-        all_questions.extend(questions)
-
-    metadata_output = data_dir / "exams_metadata.json"
-    metadata_output.write_text(json.dumps({"exams": all_exams_metadata}, indent=2))
+    if not (data_dir / "raw").exists():
+        raise FileNotFoundError(f"data/raw directory not found at {data_dir / 'raw'}")
 
-    questions_output = data_dir / "questions.jsonl"
-    with questions_output.open("w") as f:
-        for q in all_questions:
-            f.write(json.dumps(q) + "\n")
+    try:
+        prepare_dataset(data_dir)
+        return 0
+    except Exception as e:
+        print(f"Error: {e}")
+        return 1
 
 
 if __name__ == "__main__":
-    main()
+    exit(main())
diff --git a/benchmarks/courseexam_bench/pyproject.toml b/benchmarks/courseexam_bench/pyproject.toml
new file mode 100644
index 0000000..fe69866
--- /dev/null
+++ b/benchmarks/courseexam_bench/pyproject.toml
@@ -0,0 +1,18 @@
+[project]
+name = "courseexam-bench"
+version = "0.1.0"
+description = "Course Exam Benchmark - Evaluate LLMs on systems course exam questions"
+requires-python = ">=3.10"
+dependencies = ["inspect-ai>=0.3.0"]
+license = { text = "MIT" }
+authors = [{ name = "Tarek Elsayed", email = "tareknaser360@gmail.com" }]
+
+[project.optional-dependencies]
+dev = ["pytest>=7.0.0", "pytest-asyncio>=0.21.0"]
+
+[build-system]
+requires = ["hatchling"]
+build-backend = "hatchling.build"
+
+[tool.hatch.build.targets.wheel]
+packages = ["courseexam"]
diff --git a/benchmarks/courseexam_bench/run_eval.py b/benchmarks/courseexam_bench/run_eval.py
new file mode 100644
index 0000000..d44a40c
--- /dev/null
+++ b/benchmarks/courseexam_bench/run_eval.py
@@ -0,0 +1,23 @@
+#!/usr/bin/env python3
+
+from inspect_ai import eval
+
+from courseexam.courseexam import courseexam
+
+# Configuration - modify these constants as needed
+MODEL = "anthropic/claude-haiku-4-5"
+JUDGE_MODEL = "anthropic/claude-haiku-4-5"
+LIMIT = None
+EXAM_IDS = None
+QUESTION_TYPES = None
+TAGS = None
+
+if __name__ == "__main__":
+    task = courseexam(
+        exam_ids=EXAM_IDS,
+        question_types=QUESTION_TYPES,
+        tags=TAGS,
+        judge_model=JUDGE_MODEL,
+    )
+
+    eval(tasks=task, model=MODEL, limit=LIMIT)
diff --git a/benchmarks/courseexam_bench/tests/test_data_schema.py b/benchmarks/courseexam_bench/tests/test_data_schema.py
index f505bca..767d5f3 100644
--- a/benchmarks/courseexam_bench/tests/test_data_schema.py
+++ b/benchmarks/courseexam_bench/tests/test_data_schema.py
@@ -7,10 +7,13 @@
 
 DATA_DIR = Path(__file__).parent.parent / "data"
 
+
 @pytest.fixture(scope="session", autouse=True)
 def generate_dataset():
     script_path = Path(__file__).parent.parent / "prepare_dataset.py"
-    result = subprocess.run([sys.executable, str(script_path)], capture_output=True, text=True)
+    result = subprocess.run(
+        [sys.executable, str(script_path)], capture_output=True, text=True
+    )
     if result.returncode != 0:
         pytest.fail(f"prepare_dataset.py failed: {result.stderr}")
 
@@ -57,59 +60,38 @@ def test_questions_valid_json(self):
     def test_questions_required_fields(self):
         questions = load_questions()
         required_fields = [
-            "instance_id",
-            "exam_id",
-            "problem_id",
-            "points",
-            "problem",
-            "answer",
-            "type",
-            "tags",
+            "input",
+            "target",
+            "id",
+            "metadata",
         ]
 
         for q in questions:
             for field in required_fields:
                 assert (
                     field in q
-                ), f"Question {q.get('instance_id', '?')} missing required field: {field}"
-
-    def test_instance_ids_unique(self):
-        questions = load_questions()
-        instance_ids = [q["instance_id"] for q in questions]
-        assert len(instance_ids) == len(
-            set(instance_ids)
-        ), "Duplicate instance_ids found"
-
-    def test_instance_ids_incremental(self):
-        questions = load_questions()
-        instance_ids = sorted([q["instance_id"] for q in questions])
+                ), f"Question {q.get('id', '?')} missing required field: {field}"
 
-        for i in range(len(instance_ids) - 1):
-            expected_next = instance_ids[i] + 1
-            actual_next = instance_ids[i + 1]
-            assert (
-                actual_next == expected_next
-            ), f"instance_ids not incremental: found {instance_ids[i]} then {actual_next}, expected {expected_next}"
+            metadata_fields = ["exam_id", "problem_id", "points", "type", "tags"]
+            for field in metadata_fields:
+                assert (
+                    field in q["metadata"]
+                ), f"Question {q.get('id', '?')} missing metadata field: {field}"
 
-    def test_instance_id_type(self):
+    def test_ids_unique(self):
         questions = load_questions()
-        for q in questions:
-            assert isinstance(
-                q["instance_id"], int
-            ), f"instance_id must be integer: {q['instance_id']}"
-            assert (
-                q["instance_id"] > 0
-            ), f"instance_id must be positive: {q['instance_id']}"
+        ids = [q["id"] for q in questions]
+        assert len(ids) == len(set(ids)), "Duplicate IDs found"
 
     def test_exam_id_type(self):
         questions = load_questions()
         for q in questions:
             assert isinstance(
-                q["exam_id"], str
-            ), f"exam_id must be string in question {q['instance_id']}"
+                q["metadata"]["exam_id"], str
+            ), f"exam_id must be string in question {q['id']}"
             assert (
-                len(q["exam_id"]) > 0
-            ), f"exam_id cannot be empty in question {q['instance_id']}"
+                len(q["metadata"]["exam_id"]) > 0
+            ), f"exam_id cannot be empty in question {q['id']}"
 
     def test_exam_id_exists_in_metadata(self):
         questions = load_questions()
@@ -118,40 +100,38 @@ def test_exam_id_exists_in_metadata(self):
         exam_ids_in_metadata = {e["exam_id"] for e in exams}
 
         for q in questions:
-            exam_id = q["exam_id"]
+            exam_id = q["metadata"]["exam_id"]
             assert (
                 exam_id in exam_ids_in_metadata
-            ), f"Question {q['instance_id']}: exam_id '{exam_id}' not found in exams_metadata.json"
+            ), f"Question {q['id']}: exam_id '{exam_id}' not found in exams_metadata.json"
 
     def test_points_type(self):
         questions = load_questions()
         for q in questions:
             assert isinstance(
-                q["points"], int
-            ), f"points must be integer in question {q['instance_id']}"
+                q["metadata"]["points"], int
+            ), f"points must be integer in question {q['id']}"
             assert (
-                q["points"] > 0
-            ), f"points must be positive in question {q['instance_id']}"
+                q["metadata"]["points"] > 0
+            ), f"points must be positive in question {q['id']}"
 
     def test_problem_type(self):
         questions = load_questions()
         for q in questions:
             assert isinstance(
-                q["problem"], str
-            ), f"problem must be string in question {q['instance_id']}"
+                q["metadata"]["problem_id"], str
+            ), f"problem_id must be string in question {q['id']}"
             assert (
-                len(q["problem"]) > 0
-            ), f"problem cannot be empty in question {q['instance_id']}"
+                len(q["metadata"]["problem_id"]) > 0
+            ), f"problem_id cannot be empty in question {q['id']}"
 
     def test_answer_type(self):
         questions = load_questions()
         for q in questions:
             assert isinstance(
-                q["answer"], str
-            ), f"answer must be string in question {q['instance_id']}"
-            assert (
-                len(q["answer"]) > 0
-            ), f"answer cannot be empty in question {q['instance_id']}"
+                q["target"], str
+            ), f"target must be string in question {q['id']}"
+            assert len(q["target"]) > 0, f"target cannot be empty in question {q['id']}"
 
     def test_question_type_valid(self):
         questions = load_questions()
@@ -159,73 +139,73 @@ def test_question_type_valid(self):
 
         for q in questions:
             assert (
-                q["type"] in valid_types
-            ), f"Invalid type '{q['type']}' in question {q['instance_id']}. Must be one of {valid_types}"
+                q["metadata"]["type"] in valid_types
+            ), f"Invalid type '{q['metadata']['type']}' in question {q['id']}. Must be one of {valid_types}"
 
     def test_tags_format(self):
         questions = load_questions()
 
         for q in questions:
             assert isinstance(
-                q["tags"], list
-            ), f"tags must be list in question {q['instance_id']}"
+                q["metadata"]["tags"], list
+            ), f"tags must be list in question {q['id']}"
             assert (
-                len(q["tags"]) >= 1
-            ), f"At least one tag required in question {q['instance_id']}"
+                len(q["metadata"]["tags"]) >= 1
+            ), f"At least one tag required in question {q['id']}"
 
-            for tag in q["tags"]:
+            for tag in q["metadata"]["tags"]:
                 assert isinstance(
                     tag, str
-                ), f"Each tag must be string in question {q['instance_id']}"
+                ), f"Each tag must be string in question {q['id']}"
                 assert re.match(
                     r"^[a-z0-9-]+$", tag
-                ), f"Invalid tag format '{tag}' in question {q['instance_id']}. Use lowercase with hyphens"
+                ), f"Invalid tag format '{tag}' in question {q['id']}. Use lowercase with hyphens"
 
     def test_reference_materials_format(self):
         questions = load_questions()
 
         for q in questions:
-            if "reference_materials" in q:
+            if "reference_materials" in q.get("metadata", {}):
                 assert isinstance(
-                    q["reference_materials"], list
-                ), f"reference_materials must be list in question {q['instance_id']}"
+                    q["metadata"]["reference_materials"], list
+                ), f"reference_materials must be list in question {q['id']}"
 
-                for ref in q["reference_materials"]:
+                for ref in q["metadata"]["reference_materials"]:
                     assert isinstance(
                         ref, str
-                    ), f"Each reference_material must be string in question {q['instance_id']}"
+                    ), f"Each reference_material must be string in question {q['id']}"
                     assert ref.startswith(
                         "reference_materials/"
-                    ), f"Reference path must start with 'reference_materials/' in question {q['instance_id']}"
+                    ), f"Reference path must start with 'reference_materials/' in question {q['id']}"
                     assert ref.endswith(
                         ".md"
-                    ), f"Reference material must be .md file in question {q['instance_id']}"
+                    ), f"Reference material must be .md file in question {q['id']}"
 
     def test_reference_materials_exist(self):
         questions = load_questions()
 
         for q in questions:
-            if "reference_materials" in q:
-                for ref in q["reference_materials"]:
+            if "reference_materials" in q.get("metadata", {}):
+                for ref in q["metadata"]["reference_materials"]:
                     ref_path = DATA_DIR / ref
                     assert (
                         ref_path.exists()
-                    ), f"Question {q['instance_id']}: reference material not found: {ref}"
+                    ), f"Question {q['id']}: reference material not found: {ref}"
                     assert (
                         ref_path.is_file()
-                    ), f"Question {q['instance_id']}: reference material is not a file: {ref}"
+                    ), f"Question {q['id']}: reference material is not a file: {ref}"
 
     def test_llm_judge_instructions_format(self):
         questions = load_questions()
 
         for q in questions:
-            if "llm_judge_instructions" in q:
+            if "llm_judge_instructions" in q.get("metadata", {}):
                 assert isinstance(
-                    q["llm_judge_instructions"], str
-                ), f"llm_judge_instructions must be string in question {q['instance_id']}"
+                    q["metadata"]["llm_judge_instructions"], str
+                ), f"llm_judge_instructions must be string in question {q['id']}"
                 assert (
-                    len(q["llm_judge_instructions"]) >= 20
-                ), f"llm_judge_instructions too short in question {q['instance_id']}"
+                    len(q["metadata"]["llm_judge_instructions"]) >= 20
+                ), f"llm_judge_instructions too short in question {q['id']}"
 
 
 class TestExamsMetadata:
@@ -291,9 +271,11 @@ def test_points_sum_matches_metadata(self):
 
         for exam in exams:
             exam_id = exam["exam_id"]
-            exam_questions = [q for q in questions if q["exam_id"] == exam_id]
+            exam_questions = [
+                q for q in questions if q["metadata"]["exam_id"] == exam_id
+            ]
 
-            total_points = sum(q["points"] for q in exam_questions)
+            total_points = sum(q["metadata"]["points"] for q in exam_questions)
             expected_points = exam["score_total"]
 
             assert (
@@ -307,7 +289,9 @@ def test_question_count_matches_metadata(self):
 
         for exam in exams:
             exam_id = exam["exam_id"]
-            exam_questions = [q for q in questions if q["exam_id"] == exam_id]
+            exam_questions = [
+                q for q in questions if q["metadata"]["exam_id"] == exam_id
+            ]
 
             actual_count = len(exam_questions)
             expected_count = exam["num_questions"]