The official code base for reproducing self-training in Self-Training Large Language Models with Confident Reasoning. The training code involves (1) sampling math/science questions, (2) generating multiple answers with a LoRA-tuned Llama 3.1, (3) scoring them with an internal judge, and (4) running Direct Preference Optimization (DPO).
main.pycore_po/arguments.py__init__.pydata.py– loaders for GSM8K, ARC, MATH, and GPQA with rank-aware sharding.generation.py– prompt templates and sampling.judge.py– confidence scorer that evaluates reasoning and final answers.models.py– loader of LoRA-adapted Llama 3.1 weights (8-bit or bf16).trainer.py– CORE-PO DPO training loop
The training requires four NVIDIA A100 GPUs.
accelerate launch --num_processes 4 main.py \
--save_directory ./dpo_saved \
--save_name ours_run \
--learning_rate 5e-6 \
--batch_size 4@inproceedings{jang-etal-2025-self,
title = {Self-Training Large Language Models with Confident Reasoning},
author = {Jang, Hyosoon and Jang, Yunhui and Lee, Sungjae and Ok, Jungseul and Ahn, Sungsoo},
editor = {Christodoulopoulos, Christos and Chakraborty, Tanmoy and Rose, Carolyn and Peng, Violet},
booktitle = {Findings of the Association for Computational Linguistics: EMNLP 2025},
month = {nov},
year = {2025},
address = {Suzhou, China},
publisher = {Association for Computational Linguistics},
url = {https://aclanthology.org/2025.findings-emnlp.806/},
doi = {10.18653/v1/2025.findings-emnlp.806},
pages = {14925--14939},
isbn = {979-8-89176-335-7}
}