In this project, we conducted a thorough analysis of the performance of large language models (LLMs) with a focus on prediction uncertainty. To quantify uncertainty, we employed conformal prediction techniques. We further delve into the uncertainty sources using AU/EU decomposition based on entropy values.
Compared to other methods, conformal prediction is model-agnostic, distribution-free, and an statistically rigorous estimation of uncertainty. We adapted our code from LLM-Uncertainty-Bench. The pipeline is showed below:
To compare across uncertainty scores and try to explain the sources of uncertainty, we utilized an entropy-based decomposition into Aleatoric Uncertainty (AU, data variations) and Epistemic Uncertainty (EU, model ambiguities). We adapted our code from UQ_ICL. The framework is showed below:
We have used Python 3.10.13 with the following dependencies.
pip install -r requirements.txtTo apply conformal prediction, first prompt LLMs to obtain logit outputs corresponding to all options (i.e. A, B, C, D, E, and F):
python generate_logits.py \
--model={HuggingFace model directory} \
--data_path=data \
--file={name of dataset} \
--prompt_method={base/shared/task} \
--output_dir=outputs_base \
--few_shot={1 for few-shot and 0 for zero-shot}For example,
python generate_logits.py \
--model=Qwen/Qwen-7B \
--data_path=data \
--file=mmlu_10k.json \
--prompt_method=task \
--output_dir=outputs_base \
--few_shot=1Then split each dataset into a calibration set and a test set, and apply conformal prediction to obtain prediction sets for all test set instances:
python uncertainty_quantification_via_cp.py \
--model={model name} \
--raw_data_dir=data \
--logits_data_dir=outputs_base \
--data_names={list of datasets to be evaluated} \
--prompt_methods={list of prompt methods} \
--icl_methods={icl1 for few-shot and icl0 for zero-shot} \
--cal_ratio={calibration data percentage, e.g., 0.5} \
--alpha={error rate, e.g., 0.1} \
--calc-emotion={True if use SA, False otherwise}For example,
python uncertainty_quantification_via_cp.py \
--model=Qwen-7B \
--raw_data_dir=data \
--logits_data_dir=outputs_base \
--data_names=mmlu_10k \
--prompt_methods=task \
--icl_methods=icl1 \
--cal_ratio=0.5 \
--alpha=0.1 \
--calc-emotion=TrueHere we could directly run:
python au_eu_decomposition.py \
--model={model name} \
--data_path=data \
--file={name of dataset} \
--num_data={number of data points to compute}For example,
python au_eu_decomposition.py \
--model=Qwen-7B \
--data_path=data \
--file=emotion_10k_4.json \
--num_data=1000@article{ye2024llm_uq,
title={Benchmarking LLMs via Uncertainty Quantification},
author={Ye, Fanghua and Yang MingMing and Pang, Jianhui and Wang, Longyue and Wong, Derek F and Yilmaz Emine and Shi, Shuming and Tu, Zhaopeng},
journal={arXiv preprint arXiv:2401.12794},
year={2024}
}@inproceedings{
ling2024uncertainty,
title={Uncertainty Decomposition and Quantification for In-Context Learning of Large Language Models},
author={Chen Ling and Xujiang Zhao and Wei Cheng and Yanchi Liu and Yiyou Sun and Xuchao Zhang and Mika Oishi and Takao Osaki and Katsushi Matsuda and Jie Ji and Guangji Bai and Liang Zhao and Haifeng Chen},
booktitle={2024 Annual Conference of the North American Chapter of the Association for Computational Linguistics},
year={2024},
url={https://openreview.net/forum?id=Oq1b1DnUOP}
}
