This is the official repository for the paper "Learning a Pessimistic Reward Model in RLHF".
Authors: Yinglun Xu*, Hangoo Kang*, Tarun Suresh, Yuxuan Wan, Gagandeep Singh.
- Pessimistic Reward Finetuning (PET) is a novel reward fine-tuning method. A reward model fine-tuned by PET is pessimistic and robust against reward hacking.
- Based on PET, we also develop a three-step RLHF framework. See figure below.
- We test our method on the TL;DR summarization dataset and the IMDB dataset, and we find that one can learn a high-quality policy on our PET framework without using any regularization.
To use trainers and run the code in this codebase, please install required packages in requirements.txt file. If you have already installed them, you can skip this step and directly jump into Launch Trainers below.
To setup environments correctly, you can run the following command:
conda create -n PET python=3.11 && conda activate PET
pip install -r requirements.txtTo train a SFT model, please run the following command:
bash train_scripts/launch_policy_sft.shWe provide two different reward model trainers in our codebase. Reward Trainer is the standard reward trainer, and Adversarial BON Trainer is our PET trainer.
To launch Reward Trainer, please run the following command:
bash train_scripts/launch_reward_sft.shTo launch Adversarial BON Trainer (PET), please run the following command:
bash train_scripts/launch_pet.shIn addition to above trainers, this codebase also provides PPO Trainer. To launch PPO Trainer, please run the following command:
bash train_scripts/launch_ppo.shWe will upload model checkpoints soon. Stay tuned! 😀
If you find our project helpful, please consider citing our paper:
@misc{xu2025learningpessimisticrewardmodel,
title={Learning a Pessimistic Reward Model in RLHF},
author={Yinglun Xu and Hangoo Kang and Tarun Suresh and Yuxuan Wan and Gagandeep Singh},
year={2025},
eprint={2505.20556},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2505.20556},
}
Yinglun Xu, Hangoo Kang, Gagandeep Singh
This codebase is adapted from DPO codebase. PPO trainer is adapted from TRL's PPO trainer codebase.
