Skip to content

lamda-bbo/QDRT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Quality-Diversity Red-Teaming: Automated Generation of High-Quality and Diverse Attackers for Large Language Models

The official implementation of Quality-Diversity Red-Teaming: Automated Generation of High-Quality and Diverse Attackers for Large Language Models.

Abstract

Ensuring the safety and robustness of large language models (LLMs) is a fundamental challenge and a critical prerequisite for the responsible deployment of artificial intelligence. Red-teaming, a systematic framework to identify adversarial prompts that elicit harmful responses from target LLMs, has emerged as a crucial safety evaluation paradigm. Within this framework, the diversity of adversarial prompts is critical for a comprehensive safety assessment. However, previous red-teaming approaches often pursue diversity through simplistic metrics such as word frequency or sentence embedding similarity, which may not capture meaningful variation in attack strategies. In addition, the common practice of training a single attacker model restricts coverage across all potential attack styles and risk categories. This paper introduces Quality-Diversity Red-Teaming (QDRT), a new framework designed to address these limitations. QDRT achieves goal-driven diversity through behavior-conditioned training and implements a behavioral replay buffer in an open-ended manner. In addition, it trains multiple specialized attackers capable of generating high-quality attacks across diverse styles and risk categories. Our empirical evaluation demonstrates that QDRT generates attacks that are both more diverse and more effective against a wide range of target LLMs, including open-source models GPT-2, Llama-3, Gemma-2, Qwen2.5, and commercial models GPT-4.1 and GPT-5-Chat. This work advances the field of LLM safety by providing a systematic and effective approach to automated red-teaming, ultimately supporting the responsible deployment of LLMs.

Requirements

Some of the code is based on the implementation of GFlowNet. This implementation utilizes the conda environment, which can be built by running the following command:

conda env create -f environment.yml

Running

This implementation utilizes the gin config. First, use the the following commands to run SFT:

conda activate qdrt
python -m qdrt.sft_trainer -f <json file of initial seeds> --o-ckpt <checkpoint> config/sft/bc-gpt2.gin

Then, use the the following commands to run different methods on different target models:

python -m qdrt.trainer --seed <random seed> --cuda <cuda devices> config/basic.gin <attacker config> <target config> <classifier configs> <algorithm config>

For example, to run QDRT on Gemma-2-2B-Instruct, run the following commands in the root directory:

conda activate qdrt
python -m qdrt.trainer --seed <random seed> --cuda <cuda devices> config/basic.gin config/attackers/bc-gpt2-gemma2.gin config/victims/gemma-2-2b.gin config/primary-classifiers/llama-guard-3.gin config/style-classifiers/llama3.2-3b.gin config/algs/qdrt.gin

Citation

If you find this work useful in your research, please consider citing our paper.

About

Official repository of paper "Quality-Diversity Red-Teaming: Automated Generation of High-Quality and Diverse Attackers for Large Language Models".

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages