🧑🔬 Authors: Subrat Kishore Dutta, Yuelin Xu, Piyush Pant, Xiao Zhang
🧬 AIR-ML Lab - Adversarial, Interpretable, and Robust Machine Learning Lab, CISPA
📚 Publication: arXiv →
conda create -n great python=3.10 -y
conda activate great
pip install -r requirements.txtCreate a .env file in the project root and add:
We are using zero-shot classifier to get our desired subpopulation with the help of refined emotion classes. To do it-
python src/data_processing/subpopulation_selection_using_zs_classifier.pyThis will save the classified dataset (both train and test) in ./data/classified_dataset
Compute and cluster emotion-aware triggers:
python src/clustering/cluster_embeddings.pyThis will:
-
Extract embeddings using a selected LLM (e.g., Llama-3, Gemma, OPT)
-
Apply PCA & K-Means clustering
-
Save cluster visualizations and medoids
PS: You will require the Trigger dataset for this step
Use the generated medoid triggers to create poisoned datasets:
python src/data_processing/main.pyThis python file:
-
Uses all other helper modules
-
Loads clean preference datasets
-
Injects emotion-aware triggers into chosen samples
-
Saves poisoned datasets for RLHF training
Train your SFT and DPO models sequentially:
python src/pipeline/1_rlhf_training.pyIncludes:
✅ Supervised Fine-Tuning (SFT)
✅ Parameter-Efficient Fine-Tuning (PEFT)
✅ Direct Preference Optimization (DPO)
Generate responses from the trained (potentially poisoned) models:
python src/pipeline/2_gen_responses.pyThis step will:
-
Use the trained SFT/DPO model
-
Generate responses for ASR, ASR_GEN, ASR_GEN_OOD, and UHR datasets
-
Save the generated outputs in evaluation/<MODEL_NAME>/model_responses/
Evaluate model safety, alignment, and ASR (Attack Success Rate):
python src/pipeline/3_eval_gpt.pyThis step will:
-
Run GPT-based evaluation for HARMFUL vs HARMLESS responses
-
Compute statistics and save JSON evaluation files
-
Support multiple seeds for robust analysis
After completing all steps, the following directories/files will be generated:
- 🧾
evaluation/→ GPT-based safety & ASR evaluation logs - 🧠
models/→ SFT & DPO trained model checkpoints - ☣️
data/→ Poisoned preference datasets, classifed dataset - 📈
data/clustering/→ PCA visualizations & medoid info
If you find this work useful, please cite:
@misc{dutta2025greatgeneralizablebackdoorattacks,
title={GREAT: Generalizable Backdoor Attacks in RLHF via Emotion-Aware Trigger Synthesis},
author={Subrat Kishore Dutta and Yuelin Xu and Piyush Pant and Xiao Zhang},
year={2025},
eprint={2510.09260},
archivePrefix={arXiv},
primaryClass={cs.CR},
url={https://arxiv.org/abs/2510.09260},
}
This project is released under the MIT License — see the LICENSE file for details.
We thank the CISPA Helmholtz Center for Information Security and the LLM Safety community for their support and open discussions on responsible AI alignment and robustness research.
