Quantized LLM Hallucination Detection

🧠 About This Project

This repository contains the full code and results for my undergraduate Engineering Science thesis at the University of Toronto (2025).

Title: Quantized LLM Hallucination Detection
Advisor: Prof. Andreas Moshovos
Dataset: MedHallu (medical hallucination QA dataset)
Goal: Evaluate the effects of k-bit quantization on hallucination detection accuracy, latency, and GPU usage across 4 LLM families (Qwen2.5, Gemma3, LLaMA3, Biomistral).

This repository evaluates the impact of k-bit quantization on the ability of a Large Language Model (LLM) to detect hallucination(s) in answers to medical questions. The run-llm-exp.py script support inference and benchmarking across multiple models and quantization schemes. The currently supported models and quantization schemes are listed below in the 'Experiment Design' section, but more models/quantization schemes can be introduced by adding config files for what you'd like to add, to the conf/ folder in the repo.

Example cli usage of run-llm-exp.py, with gemma3 (instruction tuned) 1B parameter model, using the q4_k_m GGUF quantized version.

python run-llm-exp.py model=gemma3-1b-it quantization=gguf quantization.level=q4_k_m max_samples=10

Features

Hydra-powered CLI interface via run-llm-exp.py
Support binary hallucination classification across 10 models x 6 quantization levels
Outputs include:
- JSON logs of system + model metrics
- CSV of model predictions, ground truths, extracted model judgements
- Confusion matrix metrics, abstention rate, inference latency, and GPU usage statistics

Project Status

✅ Most models completed (see Results section)
⚠️ Some larger models skipped due to sharded GGUF format incompatibility and time constraints
🛠️ Actively running final models to complete benchmarks

Each experiment records a number of metrics:

Confusion matrix: Accuracy, precision, recall, f1 score, support
Abstention rate
Inference latency statistics (e.g., peak, avg, std_dev, variance)
GPU usage statistics (e.g., peak, avg, std_dev for GPU load, utilization)

Experiment Design

Models

💊 biomistral-7b-med
💊 llama3-med
🧠 gemma3-{1b, 4b, 12b}-it
🧠 qwen2.5-{0.5b, 1.5b, 3b, 7b}-it
🧠 qwen2.5-7b-it-med
~~qwen2.5-14b-it~~ – skipped due to unsupported sharded GGUF
~~qwen2.5-32b-it~~ – skipped for same reason

Quantization Schemes (GGUF)

All models (except skipped ones) are evaluated with the following GGUF quantization levels:

q8_0, q6_k, q5_k_m, q4_k_m, q3_k_m, q2_k

Results

The plots below show the tradeoff between accuracy and a system performance metric (e.g., latency, peak GPU memory usage). Pareto dominant model configurations are annotated within each plot.

Accuracy vs. Latency Pareto Frontier

Accuracy vs. Peak Memory Usage

Completed Experiment Runs

BioMistral 7B — 6 quantization levels
LLaMA3-med — 5 quantization levels (missing q2_k)
Gemma3-12B — 6 quantization levels
Gemma3-4B — 6 quantization levels
Gemma3-1B — 6 quantization levels
Qwen2.5-7B-it — 6 quantization levels
Qwen2.5-3B-it — 6 quantization levels
Qwen2.5-1.5B-it — 6 quantization levels
Qwen2.5-0.5B-it — 6 quantization levels

Name		Name	Last commit message	Last commit date
Latest commit History 66 Commits
conf		conf
multirun		multirun
plots		plots
results		results
.gitignore		.gitignore
README.md		README.md
THESIS.pdf		THESIS.pdf
analyze_results.ipynb		analyze_results.ipynb
analyze_results.py		analyze_results.py
do_all_exps.py		do_all_exps.py
env.yaml		env.yaml
exp_runtimes_A40_94.csv		exp_runtimes_A40_94.csv
exp_runtimes_A40_qwen2.5-0.5b.csv		exp_runtimes_A40_qwen2.5-0.5b.csv
llm-classify-run.ipynb		llm-classify-run.ipynb
prompts.yaml		prompts.yaml
requirements.txt		requirements.txt
results_Qwen2.5-7B-Instruct-Q4_K_M.gguf_pqa_labeled.json		results_Qwen2.5-7B-Instruct-Q4_K_M.gguf_pqa_labeled.json
results_qwen2.5-7b-instruct_4bit_bnb_pqa_labeled.json		results_qwen2.5-7b-instruct_4bit_bnb_pqa_labeled.json
results_qwen2.5-7b-instruct_8bit_bnb_pqa_labeled.json		results_qwen2.5-7b-instruct_8bit_bnb_pqa_labeled.json
results_qwen2.5-7b-instruct_uncompressed_pqa_labeled.json		results_qwen2.5-7b-instruct_uncompressed_pqa_labeled.json
run-llm-exp.py		run-llm-exp.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Quantized LLM Hallucination Detection

🧠 About This Project

Features

Project Status

Experiment Design

Models

Quantization Schemes (GGUF)

Results

Completed Experiment Runs

About

Uh oh!

Releases

Packages

Uh oh!

Languages

nabilLearns/quantized-hallucination

Folders and files

Latest commit

History

Repository files navigation

Quantized LLM Hallucination Detection

🧠 About This Project

Features

Project Status

Experiment Design

Models

Quantization Schemes (GGUF)

Results

Completed Experiment Runs

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages