Skip to content

ZJUICSR/BadReward

Repository files navigation

BadReward: Clean-Label Poisoning of Reward Models in Text-to-Image RLHF

Paper Code for paper "BadReward: Clean-Label Poisoning of Reward Models in Text-to-Image RLHF".

We demonstrate a clean-label poisoning attack against Reward Models (RMs) used in Text-to-Image (T2I) Reinforcement Learning from Human Feedback (RLHF). By injecting visually benign, cleanly labeled pairs into the RM training data via feature collision, an adversary can manipulate the T2I model into generating target malicious concepts (e.g., NSFW content, gore, or targeted biases) without altering the ground-truth labels or visually degrading the training images.


🔍 Overview

The attack methodology relies on feature collision. The poisoned samples are visually indistinguishable from benign data, and the reward model is the only component directly poisoned. Downstream effects emerge naturally during the RLHF fine-tuning phase when the poisoned RM misguides the diffusion model.

This repository supports poisoning and evaluation for various reward model backbones, including:

  • BLIP-based RM * CLIP-based RM (CLIP-RM)
  • ImageReward

🗂️ Repository Structure

  • to_json.py: Rule-based script to convert raw text prompts into structured JSONL format, generating positive (target concept) and negative (safe concept) prompt variants.
  • generate_jsonl.py: Uses state-of-the-art diffusion models (SDXL, SD3.5, CogView4, Z-Image-Turbo, FLUX) to generate the initial chosen/rejected image pairs from the JSONL prompts.
  • collision_batch_blip.py & collision_batch_clip.py: Implements the core clean-label poisoning attack. Optimizes the rejected images within an $L_\infty$ bound ($\epsilon$) to collide their feature representations with the chosen images using BLIP or CLIP vision encoders.
  • poison.py: Trains the target Reward Model using the poisoned dataset.
  • test.py: Evaluates the trained Reward Model's preference accuracy and alignment on target evaluation sets.

⚙️ Requirements

Ensure you have a Python environment with PyTorch installed, along with the necessary model libraries:

pip install torch torchvision transformers diffusers accelerate tqdm pillow clip numpy

🚀 Pipeline Workflow

1. Data Preparation

First, format your text prompts into JSONL files containing paired positive and negative prompts. The to_json.py script applies domain-specific rules (e.g., adding "blood spilled everywhere" for traffic accident scenarios).

python to_json.py

2. Data Generation

Generate the base image pairs (Chosen and Rejected) using a Text-to-Image model. You can specify different generators like sdxl, cogview4, sd35, zimage, or flux.

# Example usage using Z-Image-Turbo
python generate_jsonl.py \
    --input_jsonl Prompts_train/cat_new.jsonl \
    --model zimage \
    --gpu 0 \
    --num_per_prompt 1 \
    --batch_size 4

3. Feature Collision (Poisoning)

Run the feature collision optimization to create the poisoned "clean-label" data. This step perturbs the rejected images to match the feature space of the chosen images, bounded by the --eps threshold (e.g., $\epsilon = 8/255$).

For BLIP-based models:

python collision_batch_blip.py \
    --task fruit \
    --img_type zimage \
    --eps 8 \
    --batch_size 16 \
    --device cuda:0

For CLIP-based models:

python collision_batch_clip.py \
    --task fruit \
    --img_type zimage \
    --eps 8 \
    --batch_size 128 \
    --device cuda:0

4. Reward Model Training

Train the Reward Model (e.g., BLIP RM, ImageReward) using the newly generated poisoned JSONL dataset. Ensure you point the script to your poisoned data paths.

python poison.py

Note: Checkpoints will be saved periodically in the configured ./ckpts/ directory.

5. Evaluation

Evaluate the poisoned Reward Model to verify the success of the attack and test preference accuracy.

python test.py \
    --rm_ckpt ./ckpts/poison_fruit/epoch_20.pt \
    --test_jsonl ./poison_data/eval.jsonl \
    --device cuda:0 \
    --batch_size 32

📜 Citation

If you find this code or research helpful in your work, please cite our paper:

@article{duan2025badreward,
  title={BadReward: Clean-Label Poisoning of Reward Models in Text-to-Image RLHF},
  author={Duan, Kaiwen and Yao, Hongwei and Chen, Yufei and Li, Ziyun and Qiao, Tong and Qin, Zhan and Wang, Cong},
  year={2025}
}

Disclaimer

This repository is released solely for academic research and reproducibility purposes.
The authors do not condone or support malicious use of the provided techniques.

About

Code for paper "The Illusion of Safety: Clean-Label Poisoning Attacks Against RLHF"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages