Members: Koid Xian Ting (24292249), Alexandra Harrison (22581066), Bhavesh Agarwal (23933845), Dhana Violeta Wurttele (24428786), Feng Sai (24445942), Harshitha Rajulapati (24628526) Client: Dr. Sirui Li
This project fine-tunes an OpenAI model to automatically grade student exam answers and evaluate its performance against human markers.
Prompt-engineering and finetuning have been optimised for input data from the 2022 CITS5504 Data Warehousing exam, which is a CSV file containing answers to 2 exam questions from 198 students. A second DOCX input contains the exam marking scheme.
project-root/
├── app/
│ ├── backend/ # FastAPI backend (API endpoints + grading adapter)
│ └── frontend/ # React (Vite) web UI
├── data/
│ ├── CITS5553 Project Dataset.csv
│ └── 2022 Data Warehousing Final Exam (Partial).docx
├── finetune/ # Will be populated with files after running the finetuning steps
│ ├── train.jsonl
│ ├── val.jsonl
│ └── train.tiny.jsonl # Can produce this optional subset
├── eval/
│ └── test.csv # Held-out test split
├── Prompts/
│ └── prompt_templates.csv
├── src/
│ ├── data_loader.py
│ ├── ai_grader.py
│ ├── subquestion_utils.py
│ └── config.py # Pipeline config (model, reasoning, paths)
├── main.py # Entry point for baseline grading pipeline
├── data_pipeline.py # Data prep, grading, JSONL generation
├── fine_tune.py # Fine-tuning script
├── eval_finetune_vs_baseline.py # Evaluates baseline vs fine-tuned
├── requirements.txt
└── README.md
- Python 3.10+
- OpenAI Python SDK
- A valid OPENAI_API_KEY stored in a .env file in the project root:
OPENAI_API_KEY="your_api_key_here"
- Install requirements:
pip install -r requirements.txtThis will run the grading pipeline on the dataset using prompt-based grading. Run:
python main.pymain.py
├── src/data_loader.py → loads CSV + DOCX
├── src/subquestion_utils.py → splits answers into subquestions
├── src/ai_grader.py → calls OpenAI model with baseline prompt
├── Aggregates subquestion grades → question-level
└── Writes final results CSV (configured in src/config.py)
This will create datasets for fine-tuning, or optional small tests for grading or validating the JSONL formatting
- Generate JSONL files for finetuning:
python data_pipeline.py --make-jsonl
This step will:
- Extract student answers, rubrics, and questions
- Stratify the data into train/val/test splits
- Generate:
finetune/train.jsonlfinetune/val.jsonleval/test.csv- this is a test set held back for evaluation
The generated .jsonl files follow OpenAI's supervised fine-tuning format:
{
"messages": [
{ "role": "system", "content": "..." },
{ "role": "user", "content": "..." },
{ "role": "assistant", "content": "Grade: 10/20\nFeedback: ..." }
]
}Generate a tiny subset to make sure everything is ready for fine-tuning:
python data_pipeline.py --make-jsonl --ft-out-suffix tiny --ft-sample 60 --ft-val-sample 12
Once you've generated the JSONL files, you can start a fine-tuning job with the OpenAI API:
python fine_tune.py
This step will:
- Upload the training and validation JSONL files
- Start a fine-tuning job using a snapshot model (e.g.
gpt-4o-mini-2024-07-18) - Poll the job regularly until is completed (several minutes)
- Print the model name once finished - paste this into the
.envfile (e.g.FT_MODEL_ID="ft:gpt-4o-mini:grader-v1")
python data_pipeline.py --grade
By default, this runs in test mode on 1 answer and saves results to:
results_single_test.csvThis verifies that the pipeline and API key are working
Use the evaluation script to compare your fine-tuned model against the baseline on the held-out test set (eval/test.csv):
python eval_finetune_vs_baseline.pyThis will:
- Run both models on the test set
- Compute metrics:
- Mean absolute error (MAE)
- Pearson correlation
- Intraclass corerelation coefficient (ICC)
- Save results to
results/for analysis
Validate the JSONL before finetuning:
python -c "from data_pipeline import validate_jsonl; validate_jsonl('finetune/train.jsonl')"Note that only snapshot models like gpt-4o-mini-YYYY-MM-DD are currently supported for fine-tuning.
Once you've cloned the repository...
Create a branch to work on:
# Create branch
git branch <branch-name>
# Move onto the branch you've just created
git checkout <branch-name>
Then create/edit your code.
If any changes are made to the main branch while you're still working on your branch, it's best practice to regularly pull those changes to your local version so you can keep your code up to date and accomodate any changes that interact with your code. This will prevent any merge conflicts when you eventually make a pull request.
Check the status of your branch to see whether it's ahead or behind the main branch.
git status
If you're behind main, then pull those changes.
git pull origin main
While on the branch you've created, write your code and make changes, then when you're satisfied, commit that code. Best practice is to make commits often, whenever you've completed a meaningful change.
First, stage your changes to select all or some of the files you're ready to commit. To stage changes in all files:
git add .
Or to stage changes in a specific file:
git add <file-name>
Then commit the files and add a meaningful commit message:
git commit -m "this is a commit message"
Your git commits will be stored locally on your computer until you push them to the remote repository, aka GitHub. Pushing your commits will allow other people to be able to view and checkout the latest commits on your branch (using git checkout).
git push
Once you're happy with your code and you want to merge the changes from your branch onto the main branch, you should create a pull request, which can be done using GitHub. A banner like this will appear on GitHub after you've pushed your changes:

The pull request will show the differences between the code on your branch and the code on the branch you want to merge your changes onto (usually main). You should write a description about your code as part of the pull request, and select a couple of group members to review your code.
This repository also includes a web UI and backend to run grading interactively (file uploads → model scoring → results/visualisations).
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -r requirements.txtThe backend has two execution paths:
-
Slow path (default)
- Calls
src/ai_grader.pyin a worker thread (asyncio.to_thread). - Uses the OpenAI Responses API if available (applies reasoning settings from
src/config.py), otherwise falls back to Chat Completions. - Start (macOS/Linux):
export MOCK_MODE=0 export FAST_PIPELINE=0 export PIPELINE_CONCURRENCY=0 uvicorn app.backend.app:app --reload --port 8000
- Start (Windows PowerShell):
$Env:MOCK_MODE = 0 $Env:FAST_PIPELINE = 0 $Env:PIPELINE_CONCURRENCY = 0 uvicorn app.backend.app:app --reload --port 8000
- Calls
-
Fast path (concurrent previews)
- Uses Async Chat Completions with a semaphore; does not apply Responses‑specific reasoning.
- Start (macOS/Linux):
export MOCK_MODE=0 export OPENAI_MODEL="gpt-4o-mini" # Chat‑compatible model for fast path export FAST_PIPELINE=1 export PIPELINE_CONCURRENCY=8 # tune parallelism uvicorn app.backend.app:app --reload --port 8000
- Start (Windows PowerShell):
$Env:MOCK_MODE = 0 $Env:OPENAI_MODEL = "gpt-4o-mini" $Env:FAST_PIPELINE = 1 $Env:PIPELINE_CONCURRENCY = 8 uvicorn app.backend.app:app --reload --port 8000
Notes
- File‑upload UI uses
/api/grade_fileand relies only on environment variables for the path choice. - JSON endpoint
/api/gradecan also set concurrency per request:{ "params": { "concurrency": 8 } }. - Optional:
PIPELINE_SAMPLE_LIMITcaps sub‑question rows before either path runs.
- Symptom:
400 invalid_request_error: Unsupported parameter 'max_tokens' … use 'max_completion_tokens'. - Cause: Running fast path (Chat Completions) with a Responses‑only model such as
gpt-5-mini. - Fix:
- Use the slow path (see above), or
- Keep fast path but switch to a Chat‑compatible model, e.g.
export OPENAI_MODEL="gpt-4o-mini".