Skip to content

dan0nchik/llm-attack-kit

Repository files navigation

llm-attack-kit

A collection of LLM attacks, evaluated on the JailbreakBench benchmark.

Results of evalutaion can be found in 'artifacts' folders.

Before you start

  1. Create .env file in the root folder with the following variables:
HF_TOKEN = "YOUR HF TOKEN"

WANDB_API_KEY = "YOUR WANDB KEY"

# SET ONLY IF YOU NEED CLOUD INFERENCE. BY DEFAULT, OLLAMA INFERENCE IS USED:

OPENAI_API_KEY = "YOUR OPENAI KEY"

TOGETHERAI_API_KEY = "YOUR KEY"
  1. Install Ollama for local model inference
!curl -fsSL https://ollama.com/install.sh | sh
  1. Install uv package manager
curl -LsSf https://astral.sh/uv/install.sh | sh
  1. Install dependencies
uv sync

Red Teaming TextGrad

The red-teaming implementation of TextGrad framework that further tunes the jailbroken prompt using 'textual' gradient descent.

We split the JailbreakBench dataset into train, val and test sets, and run optimization in the usual PyTorch way, changing the system prompt of the target model.

Evaluation metric: Attack Success Rate (ASR).

Run the benchmark:

  1. [Not necessary] Modify the Ollama endpoint URL, attacking and target models, num. of epochs and other params in the textgrad-redteam/config.py file.

  2. Run the script

python3 textgrad-redteam/main.py
  1. The results are saved in the Weights & Biases logs and locally.

GCG

JailbreakBench version of Universal and Transferrable Attacks on Aligned Language Models

We try to run the JailbreakBench benchmark with the same parameters from artifacts, but on the newer model (Llama 3.1 8b).

Run the benchmark:

python3 gcg/main.py --model "meta-llama/Llama-3.1-8B"

Results are saved into answers.csv file inside gcg folder

PAIR Ollama

My own fork of PAIR attack method with extended Ollama models support and fixed JailbreakBench version.

Reproduces the original JailbreakBench results on the newer model Llama3.1 8b.

Run the benchmark:

  1. Change directory
cd pair-ollama
  1. Give run permission
chmod +x run.sh
  1. Run the script in prep mode (installs extra dependencies)
./run.sh --prepare
  1. Start the script
./run.sh

About

A collection of LLM attacks

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published