BadReward: Clean-Label Poisoning of Reward Models in Text-to-Image RLHF

Code for paper "BadReward: Clean-Label Poisoning of Reward Models in Text-to-Image RLHF".

We demonstrate a clean-label poisoning attack against Reward Models (RMs) used in Text-to-Image (T2I) Reinforcement Learning from Human Feedback (RLHF). By injecting visually benign, cleanly labeled pairs into the RM training data via feature collision, an adversary can manipulate the T2I model into generating target malicious concepts (e.g., NSFW content, gore, or targeted biases) without altering the ground-truth labels or visually degrading the training images.

🔍 Overview

The attack methodology relies on feature collision. The poisoned samples are visually indistinguishable from benign data, and the reward model is the only component directly poisoned. Downstream effects emerge naturally during the RLHF fine-tuning phase when the poisoned RM misguides the diffusion model.

This repository supports poisoning and evaluation for various reward model backbones, including:

BLIP-based RM * CLIP-based RM (CLIP-RM)
ImageReward

🗂️ Repository Structure

to_json.py: Rule-based script to convert raw text prompts into structured JSONL format, generating positive (target concept) and negative (safe concept) prompt variants.
generate_jsonl.py: Uses state-of-the-art diffusion models (SDXL, SD3.5, CogView4, Z-Image-Turbo, FLUX) to generate the initial chosen/rejected image pairs from the JSONL prompts.
collision_batch_blip.py & collision_batch_clip.py: Implements the core clean-label poisoning attack. Optimizes the rejected images within an $L_\infty$ bound ($\epsilon$) to collide their feature representations with the chosen images using BLIP or CLIP vision encoders.
poison.py: Trains the target Reward Model using the poisoned dataset.
test.py: Evaluates the trained Reward Model's preference accuracy and alignment on target evaluation sets.

⚙️ Requirements

Ensure you have a Python environment with PyTorch installed, along with the necessary model libraries:

pip install torch torchvision transformers diffusers accelerate tqdm pillow clip numpy

🚀 Pipeline Workflow

1. Data Preparation

First, format your text prompts into JSONL files containing paired positive and negative prompts. The to_json.py script applies domain-specific rules (e.g., adding "blood spilled everywhere" for traffic accident scenarios).

python to_json.py

2. Data Generation

Generate the base image pairs (Chosen and Rejected) using a Text-to-Image model. You can specify different generators like sdxl, cogview4, sd35, zimage, or flux.

# Example usage using Z-Image-Turbo
python generate_jsonl.py \
    --input_jsonl Prompts_train/cat_new.jsonl \
    --model zimage \
    --gpu 0 \
    --num_per_prompt 1 \
    --batch_size 4

3. Feature Collision (Poisoning)

Run the feature collision optimization to create the poisoned "clean-label" data. This step perturbs the rejected images to match the feature space of the chosen images, bounded by the --eps threshold (e.g., $\epsilon = 8/255$).

For BLIP-based models:

python collision_batch_blip.py \
    --task fruit \
    --img_type zimage \
    --eps 8 \
    --batch_size 16 \
    --device cuda:0

For CLIP-based models:

python collision_batch_clip.py \
    --task fruit \
    --img_type zimage \
    --eps 8 \
    --batch_size 128 \
    --device cuda:0

4. Reward Model Training

Train the Reward Model (e.g., BLIP RM, ImageReward) using the newly generated poisoned JSONL dataset. Ensure you point the script to your poisoned data paths.

python poison.py

Note: Checkpoints will be saved periodically in the configured ./ckpts/ directory.

5. Evaluation

Evaluate the poisoned Reward Model to verify the success of the attack and test preference accuracy.

python test.py \
    --rm_ckpt ./ckpts/poison_fruit/epoch_20.pt \
    --test_jsonl ./poison_data/eval.jsonl \
    --device cuda:0 \
    --batch_size 32

📜 Citation

If you find this code or research helpful in your work, please cite our paper:

@article{duan2025badreward,
  title={BadReward: Clean-Label Poisoning of Reward Models in Text-to-Image RLHF},
  author={Duan, Kaiwen and Yao, Hongwei and Chen, Yufei and Li, Ziyun and Qiao, Tong and Qin, Zhan and Wang, Cong},
  year={2025}
}

Disclaimer

This repository is released solely for academic research and reproducibility purposes.
The authors do not condone or support malicious use of the provided techniques.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
feature_collision		feature_collision
generate_poison_data		generate_poison_data
train_reward_models		train_reward_models
train_t2i_models		train_t2i_models
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
pipeline.png		pipeline.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BadReward: Clean-Label Poisoning of Reward Models in Text-to-Image RLHF

🔍 Overview

🗂️ Repository Structure

⚙️ Requirements

🚀 Pipeline Workflow

1. Data Preparation

2. Data Generation

3. Feature Collision (Poisoning)

4. Reward Model Training

5. Evaluation

📜 Citation

Disclaimer

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

BadReward: Clean-Label Poisoning of Reward Models in Text-to-Image RLHF

🔍 Overview

🗂️ Repository Structure

⚙️ Requirements

🚀 Pipeline Workflow

1. Data Preparation

2. Data Generation

3. Feature Collision (Poisoning)

4. Reward Model Training

5. Evaluation

📜 Citation

Disclaimer

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages