Code for paper "BadReward: Clean-Label Poisoning of Reward Models in Text-to-Image RLHF".
We demonstrate a clean-label poisoning attack against Reward Models (RMs) used in Text-to-Image (T2I) Reinforcement Learning from Human Feedback (RLHF). By injecting visually benign, cleanly labeled pairs into the RM training data via feature collision, an adversary can manipulate the T2I model into generating target malicious concepts (e.g., NSFW content, gore, or targeted biases) without altering the ground-truth labels or visually degrading the training images.
The attack methodology relies on feature collision. The poisoned samples are visually indistinguishable from benign data, and the reward model is the only component directly poisoned. Downstream effects emerge naturally during the RLHF fine-tuning phase when the poisoned RM misguides the diffusion model.
This repository supports poisoning and evaluation for various reward model backbones, including:
- BLIP-based RM * CLIP-based RM (CLIP-RM)
- ImageReward
-
to_json.py: Rule-based script to convert raw text prompts into structured JSONL format, generating positive (target concept) and negative (safe concept) prompt variants. -
generate_jsonl.py: Uses state-of-the-art diffusion models (SDXL, SD3.5, CogView4, Z-Image-Turbo, FLUX) to generate the initial chosen/rejected image pairs from the JSONL prompts. -
collision_batch_blip.py&collision_batch_clip.py: Implements the core clean-label poisoning attack. Optimizes the rejected images within an$L_\infty$ bound ($\epsilon$ ) to collide their feature representations with the chosen images using BLIP or CLIP vision encoders. -
poison.py: Trains the target Reward Model using the poisoned dataset. -
test.py: Evaluates the trained Reward Model's preference accuracy and alignment on target evaluation sets.
Ensure you have a Python environment with PyTorch installed, along with the necessary model libraries:
pip install torch torchvision transformers diffusers accelerate tqdm pillow clip numpyFirst, format your text prompts into JSONL files containing paired positive and negative prompts. The to_json.py script applies domain-specific rules (e.g., adding "blood spilled everywhere" for traffic accident scenarios).
python to_json.pyGenerate the base image pairs (Chosen and Rejected) using a Text-to-Image model. You can specify different generators like sdxl, cogview4, sd35, zimage, or flux.
# Example usage using Z-Image-Turbo
python generate_jsonl.py \
--input_jsonl Prompts_train/cat_new.jsonl \
--model zimage \
--gpu 0 \
--num_per_prompt 1 \
--batch_size 4Run the feature collision optimization to create the poisoned "clean-label" data. This step perturbs the rejected images to match the feature space of the chosen images, bounded by the --eps threshold (e.g.,
For BLIP-based models:
python collision_batch_blip.py \
--task fruit \
--img_type zimage \
--eps 8 \
--batch_size 16 \
--device cuda:0For CLIP-based models:
python collision_batch_clip.py \
--task fruit \
--img_type zimage \
--eps 8 \
--batch_size 128 \
--device cuda:0Train the Reward Model (e.g., BLIP RM, ImageReward) using the newly generated poisoned JSONL dataset. Ensure you point the script to your poisoned data paths.
python poison.pyNote: Checkpoints will be saved periodically in the configured ./ckpts/ directory.
Evaluate the poisoned Reward Model to verify the success of the attack and test preference accuracy.
python test.py \
--rm_ckpt ./ckpts/poison_fruit/epoch_20.pt \
--test_jsonl ./poison_data/eval.jsonl \
--device cuda:0 \
--batch_size 32If you find this code or research helpful in your work, please cite our paper:
@article{duan2025badreward,
title={BadReward: Clean-Label Poisoning of Reward Models in Text-to-Image RLHF},
author={Duan, Kaiwen and Yao, Hongwei and Chen, Yufei and Li, Ziyun and Qiao, Tong and Qin, Zhan and Wang, Cong},
year={2025}
}This repository is released solely for academic research and reproducibility purposes.
The authors do not condone or support malicious use of the provided techniques.