Quality-Diversity Red-Teaming: Automated Generation of High-Quality and Diverse Attackers for Large Language Models

The official implementation of Quality-Diversity Red-Teaming: Automated Generation of High-Quality and Diverse Attackers for Large Language Models.

Abstract

Ensuring the safety and robustness of large language models (LLMs) is a fundamental challenge and a critical prerequisite for the responsible deployment of artificial intelligence. Red-teaming, a systematic framework to identify adversarial prompts that elicit harmful responses from target LLMs, has emerged as a crucial safety evaluation paradigm. Within this framework, the diversity of adversarial prompts is critical for a comprehensive safety assessment. However, previous red-teaming approaches often pursue diversity through simplistic metrics such as word frequency or sentence embedding similarity, which may not capture meaningful variation in attack strategies. In addition, the common practice of training a single attacker model restricts coverage across all potential attack styles and risk categories. This paper introduces Quality-Diversity Red-Teaming (QDRT), a new framework designed to address these limitations. QDRT achieves goal-driven diversity through behavior-conditioned training and implements a behavioral replay buffer in an open-ended manner. In addition, it trains multiple specialized attackers capable of generating high-quality attacks across diverse styles and risk categories. Our empirical evaluation demonstrates that QDRT generates attacks that are both more diverse and more effective against a wide range of target LLMs, including open-source models GPT-2, Llama-3, Gemma-2, Qwen2.5, and commercial models GPT-4.1 and GPT-5-Chat. This work advances the field of LLM safety by providing a systematic and effective approach to automated red-teaming, ultimately supporting the responsible deployment of LLMs.

Requirements

Some of the code is based on the implementation of GFlowNet. This implementation utilizes the conda environment, which can be built by running the following command:

conda env create -f environment.yml

Running

This implementation utilizes the gin config. First, use the the following commands to run SFT:

conda activate qdrt
python -m qdrt.sft_trainer -f <json file of initial seeds> --o-ckpt <checkpoint> config/sft/bc-gpt2.gin

Then, use the the following commands to run different methods on different target models:

python -m qdrt.trainer --seed <random seed> --cuda <cuda devices> config/basic.gin <attacker config> <target config> <classifier configs> <algorithm config>

For example, to run QDRT on Gemma-2-2B-Instruct, run the following commands in the root directory:

conda activate qdrt
python -m qdrt.trainer --seed <random seed> --cuda <cuda devices> config/basic.gin config/attackers/bc-gpt2-gemma2.gin config/victims/gemma-2-2b.gin config/primary-classifiers/llama-guard-3.gin config/style-classifiers/llama3.2-3b.gin config/algs/qdrt.gin

Citation

If you find this work useful in your research, please consider citing our paper.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
config		config
prompts		prompts
qdrt		qdrt
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Quality-Diversity Red-Teaming: Automated Generation of High-Quality and Diverse Attackers for Large Language Models

Abstract

Requirements

Running

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Quality-Diversity Red-Teaming: Automated Generation of High-Quality and Diverse Attackers for Large Language Models

Abstract

Requirements

Running

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages