TamperBench

Benchmark the tamper-resistance of open-weight large language models (LLMs)

⚔️ Red-team LLMs with tampering attacks like fine-tuning, jailbreak-tuning, embedding attacks, and others.
🛡️ Evaluate safety and utility with StrongREJECT, MMLU-Pro, and others.
⚙️ Run rigorous hyperparameter sweeps to stress-test models under worst-case attacker configurations.
🐍 Use a typed Python API to run attacks and evaluations programmatically.
🧩 Add or use custom attacks and evaluations with an extensible, decorator-based plugin architecture.

🚀 Getting Started

Installation

git clone https://github.com/criticalml-uw/tamperbench.git
cd tamperbench
uv sync --all-groups
pre-commit install

⚡ Run a Benchmark

Stress-test Qwen3-4B with 50 hyperparameter trials using the lora_finetune attack:

uv run scripts/whitebox/optuna_single.py Qwen/Qwen3-4B \
    --attacks lora_finetune \
    --n-trials 50

Run Qwen3-4B against a fixed grid of hyperparameters for multiple attacks (see CONFIGS.md for available configurations):

uv run scripts/whitebox/benchmark_grid.py Qwen/Qwen3-4B \
    --attacks lora_finetune full_parameter_finetune

🐍 Python API

Configure and run a LoRA fine-tuning attack against Llama-3.1-8B-Instruct, then evaluate safety (StrongReject) and utility (MMLU-Pro) on the tampered model:

from tamperbench.whitebox.attacks.lora_finetune.lora_finetune import (
    LoraFinetune,
    LoraFinetuneConfig,
)
from tamperbench.whitebox.utils.models.config import ModelConfig
from tamperbench.whitebox.utils.names import EvalName

config = LoraFinetuneConfig(
    input_checkpoint_path="meta-llama/Llama-3.1-8B-Instruct",
    out_dir="results/my_attack",
    evals=[EvalName.STRONG_REJECT, EvalName.MMLU_PRO_VAL],
    model_config=ModelConfig(
        user_prefix="<|start_header_id|>user<|end_header_id|>\n\n",
        assistant_prefix="<|start_header_id|>assistant<|end_header_id|>\n\n",
        end_turn="<|eot_id|>\n",
        max_generation_length=1024,
        inference_batch_size=16,
    ),
    per_device_train_batch_size=8,
    learning_rate=1e-4,
    num_train_epochs=1,
    max_steps=-1,
    lr_scheduler_type="constant",
    optim="adamw_torch",
    lora_rank=16,
    random_seed=42,
)

attack = LoraFinetune(attack_config=config)
results = attack.benchmark()
print(results)

📊 Results

Benchmarking tamper resistance across 21 open-weight LLMs. For each model-attack pair, we select the configuration that maximizes harmfulness (StrongREJECT score) while constraining utility loss to at most 10% MMLU-Pro drop. Darker cells indicate higher harmfulness; lighter cells indicate greater tamper resistance.

🔗 Quick Links

Usage Guide	Contributing	Configs	Analysis
Full usage guide, Python API, and CLI examples	Adding new attacks, evaluations, and defenses	Configuration system and YAML files	Results analysis, epsilon-bounded filtering, and visualization

Name		Name	Last commit message	Last commit date
Latest commit History 306 Commits
.circleci		.circleci
.github		.github
assets		assets
configs		configs
docs		docs
k8s		k8s
scripts		scripts
src/tamperbench		src/tamperbench
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TamperBench

🚀 Getting Started

Installation

⚡ Run a Benchmark

🐍 Python API

📊 Results

🔗 Quick Links

About

Uh oh!

Releases 1

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

License

criticalml-uw/TamperBench

Folders and files

Latest commit

History

Repository files navigation

TamperBench

🚀 Getting Started

Installation

⚡ Run a Benchmark

🐍 Python API

📊 Results

🔗 Quick Links

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages