Learning a Pessimistic Reward Model in RLHF: KL Regularization is Not Necessary

This is the official repository for the paper "Learning a Pessimistic Reward Model in RLHF".

Authors: Yinglun Xu*, Hangoo Kang*, Tarun Suresh, Yuxuan Wan, Gagandeep Singh.

Overview

Pessimistic Reward Finetuning (PET) is a novel reward fine-tuning method. A reward model fine-tuned by PET is pessimistic and robust against reward hacking.
Based on PET, we also develop a three-step RLHF framework. See figure below.
We test our method on the TL;DR summarization dataset and the IMDB dataset, and we find that one can learn a high-quality policy on our PET framework without using any regularization.

Our proposed three-step RLHF framework based on PET

Environment Setups

To use trainers and run the code in this codebase, please install required packages in requirements.txt file. If you have already installed them, you can skip this step and directly jump into Launch Trainers below.

To setup environments correctly, you can run the following command:

conda create -n PET python=3.11 && conda activate PET

pip install -r requirements.txt

Launch Trainers

SFT Training

To train a SFT model, please run the following command:

bash train_scripts/launch_policy_sft.sh

Reward Modeling Training

We provide two different reward model trainers in our codebase. Reward Trainer is the standard reward trainer, and Adversarial BON Trainer is our PET trainer.

To launch Reward Trainer, please run the following command:

bash train_scripts/launch_reward_sft.sh

PET Training

To launch Adversarial BON Trainer (PET), please run the following command:

bash train_scripts/launch_pet.sh

PPO Training

In addition to above trainers, this codebase also provides PPO Trainer. To launch PPO Trainer, please run the following command:

bash train_scripts/launch_ppo.sh

Evaluation

Model Weights

We will upload model checkpoints soon. Stay tuned! 😀

Citation

If you find our project helpful, please consider citing our paper:

@misc{xu2025learningpessimisticrewardmodel,
      title={Learning a Pessimistic Reward Model in RLHF}, 
      author={Yinglun Xu and Hangoo Kang and Tarun Suresh and Yuxuan Wan and Gagandeep Singh},
      year={2025},
      eprint={2505.20556},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2505.20556}, 
}

Contact Authors

Yinglun Xu, Hangoo Kang, Gagandeep Singh

Acknowledgement

This codebase is adapted from DPO codebase. PPO trainer is adapted from TRL's PPO trainer codebase.

Name		Name	Last commit message	Last commit date
Latest commit History 92 Commits
assets		assets
code		code
config		config
eval		eval
eval_scripts		eval_scripts
train_scripts		train_scripts
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Learning a Pessimistic Reward Model in RLHF: KL Regularization is Not Necessary

Overview

Environment Setups

Launch Trainers

SFT Training

Reward Modeling Training

PET Training

PPO Training

Evaluation

Model Weights

Citation

Contact Authors

Acknowledgement

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

yuxuanw8/PET

Folders and files

Latest commit

History

Repository files navigation

Learning a Pessimistic Reward Model in RLHF: KL Regularization is Not Necessary

Overview

Environment Setups

Launch Trainers

SFT Training

Reward Modeling Training

PET Training

PPO Training

Evaluation

Model Weights

Citation

Contact Authors

Acknowledgement

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages