GitHub - PiyushWithPant/GREAT: Generalizable Backdoor Attacks in RLHF via Emotion-Aware Trigger Synthesis

🧠 GREAT: Generalizable Backdoor Attacks in RLHF via Emotion-Aware Trigger Synthesis

🧑‍🔬 Authors: Subrat Kishore Dutta, Yuelin Xu, Piyush Pant, Xiao Zhang

🧬 AIR-ML Lab - Adversarial, Interpretable, and Robust Machine Learning Lab, CISPA

📚 Publication: arXiv →

🌟 Overview

⚙️ Setup

1️⃣ Create Environment

conda create -n great python=3.10 -y
conda activate great
pip install -r requirements.txt

2️⃣ Add Environment Variables

Create a .env file in the project root and add:

🧩 Usage Guide

Step 0 - Get the desired Subpopulation

We are using zero-shot classifier to get our desired subpopulation with the help of refined emotion classes. To do it-

python src/data_processing/subpopulation_selection_using_zs_classifier.py

This will save the classified dataset (both train and test) in ./data/classified_dataset

Step 1 — Trigger Embeddings

Compute and cluster emotion-aware triggers:

python src/clustering/cluster_embeddings.py

This will:

Extract embeddings using a selected LLM (e.g., Llama-3, Gemma, OPT)
Apply PCA & K-Means clustering
Save cluster visualizations and medoids

PS: You will require the Trigger dataset for this step

Step 2 — Data Poisoning

Use the generated medoid triggers to create poisoned datasets:

python src/data_processing/main.py

This python file:

Uses all other helper modules
Loads clean preference datasets
Injects emotion-aware triggers into chosen samples
Saves poisoned datasets for RLHF training

Step 3 — RLHF Training Pipeline

Train your SFT and DPO models sequentially:

python src/pipeline/1_rlhf_training.py

Includes:

✅ Supervised Fine-Tuning (SFT)

✅ Parameter-Efficient Fine-Tuning (PEFT)

✅ Direct Preference Optimization (DPO)

Step 4 — Response Generation

Generate responses from the trained (potentially poisoned) models:

python src/pipeline/2_gen_responses.py

This step will:

Use the trained SFT/DPO model
Generate responses for ASR, ASR_GEN, ASR_GEN_OOD, and UHR datasets
Save the generated outputs in evaluation/<MODEL_NAME>/model_responses/

Step 5 — Evaluation

Evaluate model safety, alignment, and ASR (Attack Success Rate):

python src/pipeline/3_eval_gpt.py

This step will:

Run GPT-based evaluation for HARMFUL vs HARMLESS responses
Compute statistics and save JSON evaluation files
Support multiple seeds for robust analysis

📊 Outputs

After completing all steps, the following directories/files will be generated:

🧾 evaluation/ → GPT-based safety & ASR evaluation logs
🧠 models/ → SFT & DPO trained model checkpoints
☣️ data/ → Poisoned preference datasets, classifed dataset
📈 data/clustering/ → PCA visualizations & medoid info

📘 Citation

If you find this work useful, please cite:

@misc{dutta2025greatgeneralizablebackdoorattacks,
      title={GREAT: Generalizable Backdoor Attacks in RLHF via Emotion-Aware Trigger Synthesis}, 
      author={Subrat Kishore Dutta and Yuelin Xu and Piyush Pant and Xiao Zhang},
      year={2025},
      eprint={2510.09260},
      archivePrefix={arXiv},
      primaryClass={cs.CR},
      url={https://arxiv.org/abs/2510.09260}, 
}

🪪 License

This project is released under the MIT License — see the LICENSE file for details.

❤️ Acknowledgments

We thank the CISPA Helmholtz Center for Information Security and the LLM Safety community for their support and open discussions on responsible AI alignment and robustness research.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
assets		assets
data/triggers		data/triggers
src		src
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🧠 GREAT: Generalizable Backdoor Attacks in RLHF via Emotion-Aware Trigger Synthesis

🌟 Overview

⚙️ Setup

1️⃣ Create Environment

2️⃣ Add Environment Variables

🧩 Usage Guide

Step 0 - Get the desired Subpopulation

Step 1 — Trigger Embeddings

Step 2 — Data Poisoning

Step 3 — RLHF Training Pipeline

Step 4 — Response Generation

Step 5 — Evaluation

📊 Outputs

📘 Citation

🪪 License

❤️ Acknowledgments

About

Uh oh!

Releases

Packages

Languages

License

PiyushWithPant/GREAT

Folders and files

Latest commit

History

Repository files navigation

🧠 GREAT: Generalizable Backdoor Attacks in RLHF via Emotion-Aware Trigger Synthesis

🌟 Overview

⚙️ Setup

1️⃣ Create Environment

2️⃣ Add Environment Variables

🧩 Usage Guide

Step 0 - Get the desired Subpopulation

Step 1 — Trigger Embeddings

Step 2 — Data Poisoning

Step 3 — RLHF Training Pipeline

Step 4 — Response Generation

Step 5 — Evaluation

📊 Outputs

📘 Citation

🪪 License

❤️ Acknowledgments

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages