Using AI to Grade Student Exams: Does It Work?

CITS5553 Data Science Capstone Unit - Group 20

Members: Koid Xian Ting (24292249), Alexandra Harrison (22581066), Bhavesh Agarwal (23933845), Dhana Violeta Wurttele (24428786), Feng Sai (24445942), Harshitha Rajulapati (24628526) Client: Dr. Sirui Li

This project fine-tunes an OpenAI model to automatically grade student exam answers and evaluate its performance against human markers.

Prompt-engineering and finetuning have been optimised for input data from the 2022 CITS5504 Data Warehousing exam, which is a CSV file containing answers to 2 exam questions from 198 students. A second DOCX input contains the exam marking scheme.

Project structure

project-root/
├── app/
│   ├── backend/                    # FastAPI backend (API endpoints + grading adapter)
│   └── frontend/                   # React (Vite) web UI
├── data/
│   ├── CITS5553 Project Dataset.csv
│   └── 2022 Data Warehousing Final Exam (Partial).docx
├── finetune/                       # Will be populated with files after running the finetuning steps
│   ├── train.jsonl
│   ├── val.jsonl
│   └── train.tiny.jsonl            # Can produce this optional subset
├── eval/
│   └── test.csv                    # Held-out test split
├── Prompts/
│   └── prompt_templates.csv
├── src/
│   ├── data_loader.py
│   ├── ai_grader.py
│   ├── subquestion_utils.py
│   └── config.py                   # Pipeline config (model, reasoning, paths)
├── main.py                         # Entry point for baseline grading pipeline
├── data_pipeline.py                # Data prep, grading, JSONL generation
├── fine_tune.py                    # Fine-tuning script
├── eval_finetune_vs_baseline.py    # Evaluates baseline vs fine-tuned
├── requirements.txt
└── README.md

Prerequisites

Python 3.10+
OpenAI Python SDK
A valid OPENAI_API_KEY stored in a .env file in the project root:

OPENAI_API_KEY="your_api_key_here"

Install requirements:

pip install -r requirements.txt

Execution Pathways

Pathway 1: Baseline Grading

This will run the grading pipeline on the dataset using prompt-based grading. Run:

python main.py

main.py
 ├── src/data_loader.py → loads CSV + DOCX
 ├── src/subquestion_utils.py → splits answers into subquestions
 ├── src/ai_grader.py → calls OpenAI model with baseline prompt
 ├── Aggregates subquestion grades → question-level
 └── Writes final results CSV (configured in src/config.py)

Pathway 2: Data Preparation & Fine-Tuning

This will create datasets for fine-tuning, or optional small tests for grading or validating the JSONL formatting

Step 1: Generate files for fine-tuning

Generate JSONL files for finetuning:

python data_pipeline.py --make-jsonl

This step will:

Extract student answers, rubrics, and questions
Stratify the data into train/val/test splits
Generate:
finetune/train.jsonl
finetune/val.jsonl
eval/test.csv - this is a test set held back for evaluation

The generated .jsonl files follow OpenAI's supervised fine-tuning format:

{
  "messages": [
    { "role": "system", "content": "..." },
    { "role": "user", "content": "..." },
    { "role": "assistant", "content": "Grade: 10/20\nFeedback: ..." }
  ]
}

Optional

Generate a tiny subset to make sure everything is ready for fine-tuning:

python data_pipeline.py --make-jsonl --ft-out-suffix tiny --ft-sample 60 --ft-val-sample 12

Step 2: Run fine-tuning

Once you've generated the JSONL files, you can start a fine-tuning job with the OpenAI API:

python fine_tune.py

This step will:

Upload the training and validation JSONL files
Start a fine-tuning job using a snapshot model (e.g. gpt-4o-mini-2024-07-18)
Poll the job regularly until is completed (several minutes)
Print the model name once finished - paste this into the .env file (e.g. FT_MODEL_ID="ft:gpt-4o-mini:grader-v1")

Step 3: Run the baseline grading

python data_pipeline.py --grade

By default, this runs in test mode on 1 answer and saves results to:

results_single_test.csv

This verifies that the pipeline and API key are working

Step 4: Evaluate Fine-Tuned vs Baseline

Use the evaluation script to compare your fine-tuned model against the baseline on the held-out test set (eval/test.csv):

python eval_finetune_vs_baseline.py

This will:

Run both models on the test set
Compute metrics:
- Mean absolute error (MAE)
- Pearson correlation
- Intraclass corerelation coefficient (ICC)
Save results to results/ for analysis

Tips

Validate the JSONL before finetuning:

python -c "from data_pipeline import validate_jsonl; validate_jsonl('finetune/train.jsonl')"

Note that only snapshot models like gpt-4o-mini-YYYY-MM-DD are currently supported for fine-tuning.

Workflow for using GitHub

Once you've cloned the repository...

Create a branch

Create a branch to work on:

# Create branch
git branch <branch-name>

# Move onto the branch you've just created
git checkout <branch-name>

Then create/edit your code.

Pull changes to your branch

If any changes are made to the main branch while you're still working on your branch, it's best practice to regularly pull those changes to your local version so you can keep your code up to date and accomodate any changes that interact with your code. This will prevent any merge conflicts when you eventually make a pull request.

Check the status of your branch to see whether it's ahead or behind the main branch.

git status

If you're behind main, then pull those changes.

git pull origin main

Commit your changes

While on the branch you've created, write your code and make changes, then when you're satisfied, commit that code. Best practice is to make commits often, whenever you've completed a meaningful change.

First, stage your changes to select all or some of the files you're ready to commit. To stage changes in all files:

git add .

Or to stage changes in a specific file:

git add <file-name>

Then commit the files and add a meaningful commit message:

git commit -m "this is a commit message"

Your git commits will be stored locally on your computer until you push them to the remote repository, aka GitHub. Pushing your commits will allow other people to be able to view and checkout the latest commits on your branch (using git checkout).

git push

Merge your changes onto the main branch

Once you're happy with your code and you want to merge the changes from your branch onto the main branch, you should create a pull request, which can be done using GitHub. A banner like this will appear on GitHub after you've pushed your changes:

The pull request will show the differences between the code on your branch and the code on the branch you want to merge your changes onto (usually main). You should write a description about your code as part of the pull request, and select a couple of group members to review your code.

Web App (FastAPI + React) quick guide

This repository also includes a web UI and backend to run grading interactively (file uploads → model scoring → results/visualisations).

Backend setup

python -m venv .venv
source .venv/bin/activate        # Windows: .venv\Scripts\activate
pip install -r requirements.txt

Choosing slow vs fast path at startup

The backend has two execution paths:

Slow path (default)
- Calls src/ai_grader.py in a worker thread (asyncio.to_thread).
- Uses the OpenAI Responses API if available (applies reasoning settings from src/config.py), otherwise falls back to Chat Completions.
- Start (macOS/Linux):
```
export MOCK_MODE=0
export FAST_PIPELINE=0
export PIPELINE_CONCURRENCY=0
uvicorn app.backend.app:app --reload --port 8000
```
- Start (Windows PowerShell):
```
$Env:MOCK_MODE = 0
$Env:FAST_PIPELINE = 0
$Env:PIPELINE_CONCURRENCY = 0
uvicorn app.backend.app:app --reload --port 8000
```

Fast path (concurrent previews)

Uses Async Chat Completions with a semaphore; does not apply Responses‑specific reasoning.

Start (macOS/Linux):

export MOCK_MODE=0
export OPENAI_MODEL="gpt-4o-mini"   # Chat‑compatible model for fast path
export FAST_PIPELINE=1
export PIPELINE_CONCURRENCY=8       # tune parallelism
uvicorn app.backend.app:app --reload --port 8000

Start (Windows PowerShell):

$Env:MOCK_MODE = 0
$Env:OPENAI_MODEL = "gpt-4o-mini"
$Env:FAST_PIPELINE = 1
$Env:PIPELINE_CONCURRENCY = 8
uvicorn app.backend.app:app --reload --port 8000

Notes

File‑upload UI uses /api/grade_file and relies only on environment variables for the path choice.
JSON endpoint /api/grade can also set concurrency per request: { "params": { "concurrency": 8 } }.
Optional: PIPELINE_SAMPLE_LIMIT caps sub‑question rows before either path runs.

Troubleshooting: fast path “max_tokens” error

Symptom: 400 invalid_request_error: Unsupported parameter 'max_tokens' … use 'max_completion_tokens'.
Cause: Running fast path (Chat Completions) with a Responses‑only model such as gpt-5-mini.
Fix:
- Use the slow path (see above), or
- Keep fast path but switch to a Chat‑compatible model, e.g. export OPENAI_MODEL="gpt-4o-mini".

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Using AI to Grade Student Exams: Does It Work?

CITS5553 Data Science Capstone Unit - Group 20

Project structure

Prerequisites

Execution Pathways

Pathway 1: Baseline Grading

Pathway 2: Data Preparation & Fine-Tuning

Step 1: Generate files for fine-tuning

Optional

Step 2: Run fine-tuning

Step 3: Run the baseline grading

Step 4: Evaluate Fine-Tuned vs Baseline

Tips

Workflow for using GitHub

Create a branch

Pull changes to your branch

Commit your changes

Merge your changes onto the main branch

Web App (FastAPI + React) quick guide

Backend setup

Choosing slow vs fast path at startup

Troubleshooting: fast path “max_tokens” error

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 64 Commits
Prompts		Prompts
app		app
data		data
eval		eval
finetune		finetune
prompt		prompt
prompt_templates		prompt_templates
result		result
scripts		scripts
src		src
.gitignore		.gitignore
README.md		README.md
data_pipeline.py		data_pipeline.py
eval_finetune_vs_baseline.py		eval_finetune_vs_baseline.py
fine_tune.py		fine_tune.py
group-20-capstone-project.code-workspace		group-20-capstone-project.code-workspace
main.py		main.py
merge_final_results.py		merge_final_results.py
prompts.py		prompts.py
requirements.txt		requirements.txt
results_single_test.csv		results_single_test.csv
run_prompt_variation2		run_prompt_variation2
run_prompt_variations.py		run_prompt_variations.py
subquestion_results.csv		subquestion_results.csv

Folders and files

Latest commit

History

Repository files navigation

Using AI to Grade Student Exams: Does It Work?

CITS5553 Data Science Capstone Unit - Group 20

Project structure

Prerequisites

Execution Pathways

Pathway 1: Baseline Grading

Pathway 2: Data Preparation & Fine-Tuning

Step 1: Generate files for fine-tuning

Optional

Step 2: Run fine-tuning

Step 3: Run the baseline grading

Step 4: Evaluate Fine-Tuned vs Baseline

Tips

Workflow for using GitHub

Create a branch

Pull changes to your branch

Commit your changes

Merge your changes onto the main branch

Web App (FastAPI + React) quick guide

Backend setup

Choosing slow vs fast path at startup

Troubleshooting: fast path “max_tokens” error

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages