This repo provides code for NeurIPS2024 paper Decoding-Time Language Model Alignment with Multiple Objectives. If you have any question, please let me know by creating a issue!
create -n ppo python=3.10.4
conda activate ppo
cd experiment-PPO
pip install -r requirements.txt
pip install torch==2.2.2 torchvision==0.17.2 torchaudio==2.2.2 --index-url https://download.pytorch.org/whl/cu118Here is the guidance of RiC: The trl package currently has an issue with its 'DataCollatorForCompletionOnlyLM' implementation, wherein the first token of the template may be nonsensical and unmatched. Hence, it is necessary to make modifications to the trl/trainer/utils.py file within your trl package (usually under the path {path to conda or python}/lib/python3/site-packages), specifically the 'torch_call' function starting from line 119. Upon installing the trl package based on the requirements.txt, you can easily replace the trl/trainer/utils.py with the utils.py file located in the 'fix_trl' directory.
create -n dpo python=3.10
conda activate dpo
cd experiment-DPO
pip install -r requirements.txt
pip install torch=2.1.0 --index-url https://download.pytorch.org/whl/cu118
pip install flash-attn==2.3.2 --no-build-isolationWe provide two tasks, namely Reddit Summary, Helpful Assistant, for PPO training. The relevant descriptions can be referred to RiC. The training code is borrowed from RiC as well.
cd ./sft
CUDA_VISIBLE_DEVICES=0,1,2,3 accelerate launch sft.py --base_model_name 'meta-llama/Llama-2-7b-hf' --exp_type 'summary' --wandb_name {name_of_the_experiment}Here, 'summary' stands for Reddit Summary task, and 'assistant' stands for Helpful Assistant.
MORLHF optimizes preference weighted rewards using PPO, based on the SFT model.
cd ./ppo
CUDA_VISIBLE_DEVICES=0,1,2,3 accelerate launch morlhf.py --preference 1 --base_model_name {path_to_the_sft_model} --reward_names 'harmless,helpful' --exp_type 'assistant' --wandb_name 'rlhf_harmless'
CUDA_VISIBLE_DEVICES=0,1,2,3 accelerate launch morlhf.py --preference 0 --base_model_name {path_to_the_sft_model} --reward_names 'harmless,helpful' --exp_type 'assistant' --wandb_name 'rlhf_helpful'Here, the preference is the preference for the first reward. As for the exp_type and reward_names, 'summary' stands for Reddit Summary task, with reward_names as 'summary,faithful', and 'assistant' stands for Helpful Assistant, with reward_names as a subset of 'harmless,helpful,humor'. We provide some checkpoints in this link.
We provide the evaluation scripts for multi-objective decoding and rewarded soups in ppo/eval_ours.sh and ppo/eval_rs.sh. Visualization script can be referred to RiC.
We provide one task, namely Safety Alignment, for DPO training. The relevant descriptions can be referred to modpo. The training code is borrowed from modpo as well.
We provide the training scripts for DPO in dpo/run.sh. The divergence_type is implemented for fDPO, and we support 'reverse_kl', 'forward_kl', 'jsd', and '$\alpha$-divergence' where preference stands for the preference dataset training on, where 'better' and 'safer' are supported.
bash scripts/dpo/run.shWe provide some checkpoints in this link.
bash scripts/eval/eval.sh {CUDA_DEVICE} {ALGORITHM: mod or rs} {DIVERGENCE_TYPE} {WEIGHTING_FOR_QUALITY} {WEIGHTING_FOR_SAFETY}Additionaly, we implement a more general multi-objective decoding in src/utils/util_decode_general.py, which supports any kinds of models. You can refer to it for more experiments!
MOD is based on many open-source projects, including trl, modpo, RiC, and Fine-Grained RLHF. And we have referred to fDPO, PackLLM, and DPA. We thank all these authors for their nicely open sourced code and their great contributions to the community.
MOD is licensed under the MIT license. See the LICENSE file for details.
If you find our work useful, please consider citing:
@article{Shi2024MOD,
title={Decoding-Time Language Model Alignment with Multiple Objectives},
author={Ruizhe Shi and Yifang Chen and Yushi Hu and Alisa Liu and Hannaneh Hajishirzi and Noah A. Smith and Simon S. Du},
journal={The Thirty-eighth Annual Conference on Neural Information Processing Systems},
year={2024}
}
