A collection of LLM attacks, evaluated on the JailbreakBench benchmark.
Results of evalutaion can be found in 'artifacts' folders.
- Create .env file in the root folder with the following variables:
HF_TOKEN = "YOUR HF TOKEN"
WANDB_API_KEY = "YOUR WANDB KEY"
# SET ONLY IF YOU NEED CLOUD INFERENCE. BY DEFAULT, OLLAMA INFERENCE IS USED:
OPENAI_API_KEY = "YOUR OPENAI KEY"
TOGETHERAI_API_KEY = "YOUR KEY"- Install Ollama for local model inference
!curl -fsSL https://ollama.com/install.sh | sh- Install uv package manager
curl -LsSf https://astral.sh/uv/install.sh | sh- Install dependencies
uv syncThe red-teaming implementation of TextGrad framework that further tunes the jailbroken prompt using 'textual' gradient descent.
We split the JailbreakBench dataset into train, val and test sets, and run optimization in the usual PyTorch way, changing the system prompt of the target model.
Evaluation metric: Attack Success Rate (ASR).
Run the benchmark:
-
[Not necessary] Modify the Ollama endpoint URL, attacking and target models, num. of epochs and other params in the
textgrad-redteam/config.pyfile. -
Run the script
python3 textgrad-redteam/main.py- The results are saved in the Weights & Biases logs and locally.
JailbreakBench version of Universal and Transferrable Attacks on Aligned Language Models
We try to run the JailbreakBench benchmark with the same parameters from artifacts, but on the newer model (Llama 3.1 8b).
Run the benchmark:
python3 gcg/main.py --model "meta-llama/Llama-3.1-8B"Results are saved into answers.csv file inside gcg folder
My own fork of PAIR attack method with extended Ollama models support and fixed JailbreakBench version.
Reproduces the original JailbreakBench results on the newer model Llama3.1 8b.
Run the benchmark:
- Change directory
cd pair-ollama
- Give run permission
chmod +x run.sh
- Run the script in prep mode (installs extra dependencies)
./run.sh --prepare
- Start the script
./run.sh