llm-attack-kit

A collection of LLM attacks, evaluated on the JailbreakBench benchmark.

Results of evalutaion can be found in 'artifacts' folders.

Before you start

Create .env file in the root folder with the following variables:

HF_TOKEN = "YOUR HF TOKEN"

WANDB_API_KEY = "YOUR WANDB KEY"

# SET ONLY IF YOU NEED CLOUD INFERENCE. BY DEFAULT, OLLAMA INFERENCE IS USED:

OPENAI_API_KEY = "YOUR OPENAI KEY"

TOGETHERAI_API_KEY = "YOUR KEY"

Install Ollama for local model inference

!curl -fsSL https://ollama.com/install.sh | sh

Install uv package manager

curl -LsSf https://astral.sh/uv/install.sh | sh

Install dependencies

uv sync

Red Teaming TextGrad

The red-teaming implementation of TextGrad framework that further tunes the jailbroken prompt using 'textual' gradient descent.

We split the JailbreakBench dataset into train, val and test sets, and run optimization in the usual PyTorch way, changing the system prompt of the target model.

Evaluation metric: Attack Success Rate (ASR).

Run the benchmark:

[Not necessary] Modify the Ollama endpoint URL, attacking and target models, num. of epochs and other params in the textgrad-redteam/config.py file.
Run the script

python3 textgrad-redteam/main.py

The results are saved in the Weights & Biases logs and locally.

GCG

JailbreakBench version of Universal and Transferrable Attacks on Aligned Language Models

We try to run the JailbreakBench benchmark with the same parameters from artifacts, but on the newer model (Llama 3.1 8b).

Run the benchmark:

python3 gcg/main.py --model "meta-llama/Llama-3.1-8B"

Results are saved into answers.csv file inside gcg folder

PAIR Ollama

My own fork of PAIR attack method with extended Ollama models support and fixed JailbreakBench version.

Reproduces the original JailbreakBench results on the newer model Llama3.1 8b.

Run the benchmark:

Change directory

cd pair-ollama

Give run permission

chmod +x run.sh

Run the script in prep mode (installs extra dependencies)

./run.sh --prepare

Start the script

./run.sh

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
gcg		gcg
pair-ollama		pair-ollama
textgrad-redteam		textgrad-redteam
.DS_Store		.DS_Store
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

llm-attack-kit

Before you start

Red Teaming TextGrad

GCG

PAIR Ollama

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

dan0nchik/llm-attack-kit

Folders and files

Latest commit

History

Repository files navigation

llm-attack-kit

Before you start

Red Teaming TextGrad

GCG

PAIR Ollama

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages