Skip to content

yuxuanw8/PET

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

92 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Learning a Pessimistic Reward Model in RLHF: KL Regularization is Not Necessary

arXiv GitHub

This is the official repository for the paper "Learning a Pessimistic Reward Model in RLHF".

Authors: Yinglun Xu*, Hangoo Kang*, Tarun Suresh, Yuxuan Wan, Gagandeep Singh.

Overview

  • Pessimistic Reward Finetuning (PET) is a novel reward fine-tuning method. A reward model fine-tuned by PET is pessimistic and robust against reward hacking.
  • Based on PET, we also develop a three-step RLHF framework. See figure below.
  • We test our method on the TL;DR summarization dataset and the IMDB dataset, and we find that one can learn a high-quality policy on our PET framework without using any regularization.
rlhf-framework

Our proposed three-step RLHF framework based on PET

Environment Setups

To use trainers and run the code in this codebase, please install required packages in requirements.txt file. If you have already installed them, you can skip this step and directly jump into Launch Trainers below.

To setup environments correctly, you can run the following command:

conda create -n PET python=3.11 && conda activate PET

pip install -r requirements.txt

Launch Trainers

SFT Training

To train a SFT model, please run the following command:

bash train_scripts/launch_policy_sft.sh

Reward Modeling Training

We provide two different reward model trainers in our codebase. Reward Trainer is the standard reward trainer, and Adversarial BON Trainer is our PET trainer.

To launch Reward Trainer, please run the following command:

bash train_scripts/launch_reward_sft.sh

PET Training

To launch Adversarial BON Trainer (PET), please run the following command:

bash train_scripts/launch_pet.sh

PPO Training

In addition to above trainers, this codebase also provides PPO Trainer. To launch PPO Trainer, please run the following command:

bash train_scripts/launch_ppo.sh

Evaluation

Model Weights

We will upload model checkpoints soon. Stay tuned! 😀

Citation

If you find our project helpful, please consider citing our paper:

@misc{xu2025learningpessimisticrewardmodel,
      title={Learning a Pessimistic Reward Model in RLHF}, 
      author={Yinglun Xu and Hangoo Kang and Tarun Suresh and Yuxuan Wan and Gagandeep Singh},
      year={2025},
      eprint={2505.20556},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2505.20556}, 
}

Contact Authors

Yinglun Xu, Hangoo Kang, Gagandeep Singh

Acknowledgement

This codebase is adapted from DPO codebase. PPO trainer is adapted from TRL's PPO trainer codebase.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •