Quality-Diversity Red-Teaming: Automated Generation of High-Quality and Diverse Attackers for Large Language Models
The official implementation of Quality-Diversity Red-Teaming: Automated Generation of High-Quality and Diverse Attackers for Large Language Models.
Ensuring the safety and robustness of large language models (LLMs) is a fundamental challenge and a critical prerequisite for the responsible deployment of artificial intelligence. Red-teaming, a systematic framework to identify adversarial prompts that elicit harmful responses from target LLMs, has emerged as a crucial safety evaluation paradigm. Within this framework, the diversity of adversarial prompts is critical for a comprehensive safety assessment. However, previous red-teaming approaches often pursue diversity through simplistic metrics such as word frequency or sentence embedding similarity, which may not capture meaningful variation in attack strategies. In addition, the common practice of training a single attacker model restricts coverage across all potential attack styles and risk categories. This paper introduces Quality-Diversity Red-Teaming (QDRT), a new framework designed to address these limitations. QDRT achieves goal-driven diversity through behavior-conditioned training and implements a behavioral replay buffer in an open-ended manner. In addition, it trains multiple specialized attackers capable of generating high-quality attacks across diverse styles and risk categories. Our empirical evaluation demonstrates that QDRT generates attacks that are both more diverse and more effective against a wide range of target LLMs, including open-source models GPT-2, Llama-3, Gemma-2, Qwen2.5, and commercial models GPT-4.1 and GPT-5-Chat. This work advances the field of LLM safety by providing a systematic and effective approach to automated red-teaming, ultimately supporting the responsible deployment of LLMs.
Some of the code is based on the implementation of GFlowNet. This implementation utilizes the conda environment, which can be built by running the following command:
conda env create -f environment.ymlThis implementation utilizes the gin config. First, use the the following commands to run SFT:
conda activate qdrt
python -m qdrt.sft_trainer -f <json file of initial seeds> --o-ckpt <checkpoint> config/sft/bc-gpt2.ginThen, use the the following commands to run different methods on different target models:
python -m qdrt.trainer --seed <random seed> --cuda <cuda devices> config/basic.gin <attacker config> <target config> <classifier configs> <algorithm config>For example, to run QDRT on Gemma-2-2B-Instruct, run the following commands in the root directory:
conda activate qdrt
python -m qdrt.trainer --seed <random seed> --cuda <cuda devices> config/basic.gin config/attackers/bc-gpt2-gemma2.gin config/victims/gemma-2-2b.gin config/primary-classifiers/llama-guard-3.gin config/style-classifiers/llama3.2-3b.gin config/algs/qdrt.ginIf you find this work useful in your research, please consider citing our paper.