-
Problem: Multi-reward RLHF often suffers an alignment taxβimproving one metric while degrading others.
-
Approach: We introduce two complementary methods:
- MapReduce LoRA: train reward-specific LoRA experts in parallel (Map) and iteratively merge them (Reduce) with configurable weights (default 1:1:1).
- Reward-aware Token Embedding (RaTE): learn reward-aware token embeddings that compose at inference for flexible preference control.
-
Results
- Text-to-Image:
- SD3.5M: +36.1% (GenEval), +4.6% (PickScore), +55.7% (OCR)
- FLUX.1-dev: +32.7% (GenEval), +4.3% (PickScore), +67.1% (OCR)
- Text-to-Video:
- HunyuanVideo +48.1% (visual), +90.0% (motion)
- Language Task:
- Llama-2 7B: Helpful Assistant: +43.4% (helpful), +136.7% (harmless)
- Text-to-Image:
Clone this repo and install environments
# Pre-download the models to prevent repeatedly download the model from huggingface
huggingface-cli login
huggingface-cli download stabilityai/stable-diffusion-3.5-medium
huggingface-cli download black-forest-labs/FLUX.1-dev
# login wandb
wandb login
# install the conda environment
conda create -n mapreduce-lora python=3.12 -y
conda activate mapreduce-lora
pip install diffusers==0.33.1
pip install torch==2.6.0
pip install transformers==4.54.0
pip install protobuf==5.29.5
pip install sentencepiece==0.2.0
pip install accelerate==1.9.0
pip install --no-cache-dir -U packaging ninja==1.11.1.4
pip install flash-attn==2.8.0.post2 --no-build-isolation --no-cache-dir
pip install xformers==0.0.31.post1
pip install absl-py==2.3.1
pip install ml_collections==1.1.0
pip install wandb==0.18.7
pip install peft==0.10.0
# NOTE: for deepspeed
pip install deepspeed==0.17.2
# NOTE: for paddleocr
pip install paddlepaddle-gpu==2.6.2
pip install paddleocr==2.9.1
pip install python-Levenshtein==0.27.1Pre-download the PaddelOCR model
from paddleocr import PaddleOCR
ocr = PaddleOCR(use_angle_cls=False, lang="en", use_gpu=False, show_log=False)Prepare the geneval reward: follow reward-server to install the conda environment for geneval; pickscore and ocr are already included in mapreduce-lora conda env
The parallel automation trains three reward experts (GenEval / PickScore / OCR) in parallel and periodically merges their LoRAs with configurable weights (default 1:1:1). The merged adapter is then used to resume the next cycle.
- The script automatically derives the parallel topology from your environment (e.g., scheduler-provided
RANK/WORLD_SIZEor reachable nodes) and assigns nodes in groups ofNODES_PER_TASKto the three experts (GenEval, PickScore, OCR). - By default it uses
NODES_PER_TASK=4,GPUS_PER_NODE=8,MERGE_STEPS=100, andCYCLES=80. With 12 nodes available this yields a 12-node run (4 per expert). Distinct ports are used per expert:MASTER_PORT,MASTER_PORT+1,MASTER_PORT+2.
Minimal run (uses script defaults):
# simultaneously train 3 jobs, including GenEval, PickScore and OCR
# sd35m
bash scripts/init_scripts/init_parallel_sd35m.sh
# flux.1-dev
bash scripts/init_scripts/init_parallel_flux.sh
# if there is limited GPU nodes, we can train 3 jobs sequentially: GenEval -> PickScore -> OCR -> GenEval -> ...
bash scripts/init_scripts/init_sequential.shOverride common knobs as needed (example):
export CYCLES=5
export MERGE_STEPS=200
export WEIGHTS="1 1 1" # merge weights: GenEval PickScore OCR
export NODES_PER_TASK=4
export GPUS_PER_NODE=8
export MASTER_ADDR=127.0.0.1 # base address; groups use derived ports
export MASTER_PORT=9998
# sd35m
bash scripts/init_scripts/init_parallel_sd35m.sh
# flux.1-dev
bash scripts/init_scripts/init_parallel_flux.shAdjustable parameters
- CYCLES: total merge cycles.
- MERGE_STEPS: steps per expert before each merge (default: 100).
- WEIGHTS: merge weights for GenEval, PickScore, OCR (e.g.,
1 1 1). - NODES_PER_TASK, GPUS_PER_NODE: nodes and GPUs per expert group.
- MASTER_ADDR, MASTER_PORT: base rendezvous; experts use
PORT,PORT+1,PORT+2. - WORLD_SIZE/RANK or NODE_IPS: scheduler-provided topology or static IPs.
- LOG_DIR, OUT_ROOT: logs root and merged outputs root.
- PRETRAINED_MODEL_PATH (script default), MODEL_PATH (optional local base model).
- Auto Resume: Set
AUTO_RESUME=1,SKIP_COMPLETED_TASKS=1, and SpecifyRESUME_RUN_TS. - Reward/Coordination: START_GENEVAL_REWARD, STOP_GENEVAL_REWARD_AFTER, MERGE_COORD_RANK.
SD 3.5 M training (the continuous one) and eval curves with fixed merging steps 100 for all rewards (k=80)
Since different rewards may work better at different training steps, we can train each expert independently and merge their LoRAs manually.
# Train individual experts (Defaults to GenEval. For PickScore/OCR, update `init_manually_mapreduce_${model_name}.sh`)
model_name="sd35m" #flux
bash scripts/init_scripts/init_manually_mapreduce_${model_name}.sh
# Merge
python scripts/merge_scripts/merge_lora.py --model_name "${model_name} "--lora_paths "${GEN_LORA}" "${PICK_LORA}" "${OCR_LORA}" --weights ${WEIGHTS} --output_dir "${MERGE_OUT}"SD 3.5 M eval curves with independent merging step for each reward
SD3.5M
![]() |
![]() |
![]() |
![]() |
| GenEval | PickScore | OCR | MPR (Ours) |
FLUX.1-dev
![]() |
![]() |
![]() |
![]() |
| GenEval | PickScore | OCR | MPR (Ours) |
model_name="sd35" #flux
model_ckpt="SD3.5M" #FLUX.1-dev
# inference with individual experts GenEval
python scripts/test_${model_name}.py --mode eval_single --use_adapter --lora_checkpoint "shi-labs/${model_ckpt}-ind-expert-GenEval" --results_dir "results/ind-geneval/"
# inference with individual experts PickScore
python scripts/test_${model_name}.py --mode eval_single --use_adapter --lora_checkpoint "shi-labs/${model_ckpt}-ind-expert-PickScore" --results_dir "results/ind-pickscore/"
# inference with individual experts OCR
python scripts/test_${model_name}.py --mode eval_single --use_adapter --lora_checkpoint "shi-labs/${model_ckpt}-ind-expert-OCR" --results_dir "results/ind-ocr/"
# inference with MapReduce-LoRA
python scripts/test_${model_name}.py --mode eval_single --use_adapter --lora_checkpoint "shi-labs/${model_ckpt}-MapReduce-LoRA-merge-k4" --results_dir "results/mpr/"# Defaults to GenEval. For PickScore/OCR, update `init_RaTE_sd35m.sh`
# Before start, update `config.teacher_lora_dir` in `config/sft_ti.py`.
bash scripts/init_scripts/init_RaTE_sd35m.shWe gratefully acknowledge the generous contributions of the open-source community, especially the teams behind Stable Diffusion 3.5, FLUX.1-dev, HunyuanVideo, Llama, Flow-GRPO, DanceGRPO, GenEval, PickScore, PaddleOCR, VQAScore, MPS, VILA, and VideoAlign. Their publicly available code and models made this work possible.
If you find this work useful, please cite:
@article{chen2025mapreducelora,
title = {MapReduce LoRA: Advancing the Pareto Front in Multi-Preference Optimization for Generative Models},
author = {Chieh-Yun Chen and Zhonghao Wang and Qi Chen and Zhifan Ye and Min Shi and Yue Zhao and Yinan Zhao and Hui Qu and Wei-An Lin and Yiru Shen and Ajinkya Kale and Irfan Essa and Humphrey Shi},
year = {2025},
journal = {arXiv preprint arXiv:2511.20629}
}















