This repository contains the implementation of ResponseRank, a novel approach for preference learning that leverages strength information to improve utility estimation. We compare the method to a loss based on the standard Bradley-Terry model that only considers choice outcomes.
# After optional venv setup
pip install -r requirements-lock.txt
pip install --no-deps -e .
wandb login # LLM experiments require wandb access
# Additionally requires a working LaTeX installation to render text in figuresAll three experiment types consist of three main steps: Running the experiments, collecting the results, and generating the artifacts (plots, tables, and statistics). This repository includes the cached results used in the paper, so you can skip the first two steps and reproduce the artifacts as follows:
# Synthetic
python -m responserank.synthetic.analysis.cli generate results/synthetic paper/figures
# LLM (MultiPref)
python -m responserank.llm.analysis.cli generate results/llm paper/figures
# RL control
python -m responserank.rl_control.analysis.cli generate results/rl_control paper/figuresCached data lives in results/synthetic/, results/llm/, and results/rl_control/. The following sections describe how to reproduce the results from scratch.
No GPU required, experiments run in parallel on a single node. Results are stored locally in the ./outputs directory. See the paper for descriptions of the datasets and learners.
# Main experiments
python -m responserank.synthetic.experiment_runner dataset=deterministic_all
python -m responserank.synthetic.experiment_runner dataset=stochastic
python -m responserank.synthetic.experiment_runner dataset=drift_diffusion
# No-variability ablations
python -m responserank.synthetic.experiment_runner dataset=deterministic_all_no_variability
python -m responserank.synthetic.experiment_runner dataset=stochastic_no_variability
python -m responserank.synthetic.experiment_runner dataset=drift_diffusion_no_variability
# Reduced version (2 trials, fewer dataset sizes) of the entire pipeline above. Use for testing purposes.
# Equivalently apply `num_trials=2 dataset_sizes=[0.5,1.0]` to all experiment_runner calls above.
bash run_full_mini.sh# Find latest experiment directories and collect results (optional, pre-collected results are in ./results)
python -m responserank.synthetic.analysis.cli find outputs/experiment_runner results/synthetic
python -m responserank.synthetic.analysis.cli collect results/synthetic \
--paths-file results/synthetic/experiment_paths.json
# Generate figures
python -m responserank.synthetic.analysis.cli generate results/synthetic paper/figuresTrain LLM-based reward models on real human preference data. Requires GPU (tested on A100-80G/H100). ~1.5h on H100, ~3.25h on A100 at full dataset (fraction 1.0). Each run is executed independently and saves results to wandb.
huggingface-cli login # Or set HF_TOKEN. Llama requires accepting Meta's license on HuggingFace.
python -m responserank.llm.train experiment=bt # baseline
python -m responserank.llm.train experiment=rr_agree # ResponseRank with agreement
# Others: ls conf/llm/experiment/*.yaml
# Override parameters
python -m responserank.llm.train experiment=rr_agree seed=43 experiment.fraction=0.1 experiment.train_partitioner.target_size=4
# Dry run (verify config without training)
python -m responserank.llm.train experiment=rr_agree dry_run=trueResults go to wandb; MultiPref downloads automatically.
For full reproduction, you'll likely want to run experiments on a GPU cluster. The training command can be wrapped in your cluster's job submission system.
Example SLURM job array script (run_experiment.sh):
#!/bin/bash
#SBATCH --job-name=responserank
#SBATCH --partition=gpu
#SBATCH --gres=gpu:1
#SBATCH --time=4:00:00
cd /path/to/code
source venv/bin/activate # Make sure the venv is set up
python -m responserank.llm.train experiment=$1 experiment.fraction=$2 seed=$SLURM_ARRAY_TASK_IDSubmit two experiment configurations, each over two dataset sizes and with 2 seeds (42, 43):
for exp in bt rr_agree; do
for frac in 0.1 1.0; do
sbatch --array=42-43 run_experiment.sh $exp $frac
done
done# If you ran new experiments, manually update results/llm/registry.yaml first.
# Fetch data from wandb (optional, pre-fetched results are in ./results)
python -m responserank.llm.analysis.cli collect results/llm \
--project-name rr-llm \
--registry results/llm/registry.yaml
# Generate figures and tables from cached data
python -m responserank.llm.analysis.cli generate results/llm paper/figuresThe RL control experiments (MuJoCo and Highway environments) can be executed using the responserank_extension branch of ymetz/multi-type-feedback. See that repository's README for installation and replication instructions. This repository only contains the analysis code.
wandb login # Needed to fetch experiment results
# Fetch data from wandb (optional, pre-fetched results are in ./results)
python -m responserank.rl_control.analysis.cli collect results/rl_control \
--mujoco-project rr-rl-control-mujoco \
--highway-project rr-rl-control-highway
# Generate figures and tables from cached data
python -m responserank.rl_control.analysis.cli generate results/rl_control paper/figuresThe main package is src/responserank/ with three submodules: synthetic/ (synthetic experiments), llm/ (LLM reward model training on MultiPref), and rl_control/ (analysis scripts; experiments run via external repo). Hydra configs live in conf/ with synthetic/ (dataset and learner configs) and llm/ (experiment presets like bt.yaml, rr_agree.yaml). Tests are in tests/.
bash ./run_tests # run tests
bash ./run_linters --fix # lint and fix@inproceedings{kaufmann2025responserank,
title={ResponseRank: Data-Efficient Reward Modeling through Preference Strength Learning},
author={Kaufmann, Timo and Metz, Yannick and Keim, Daniel and H{\"u}llermeier, Eyke},
booktitle={Advances in Neural Information Processing Systems},
year={2025}
}