Skip to content

criticalml-uw/TamperBench

Repository files navigation

TamperBench

Benchmark the tamper-resistance of open-weight large language models (LLMs)

CriticalML    FAR.AI

Transformers Python 3.10+ uv Ruff Basedpyright


  • ⚔️ Red-team LLMs with tampering attacks like fine-tuning, jailbreak-tuning, embedding attacks, and others.
  • 🛡️ Evaluate safety and utility with StrongREJECT, MMLU-Pro, and others.
  • ⚙️ Run rigorous hyperparameter sweeps to stress-test models under worst-case attacker configurations.
  • 🐍 Use a typed Python API to run attacks and evaluations programmatically.
  • 🧩 Add or use custom attacks and evaluations with an extensible, decorator-based plugin architecture.

TamperBench Toolkit

🚀 Getting Started

Installation

git clone https://github.com/criticalml-uw/tamperbench.git
cd tamperbench
uv sync --all-groups
pre-commit install

⚡ Run a Benchmark

Stress-test Qwen3-4B with 50 hyperparameter trials using the lora_finetune attack:

uv run scripts/whitebox/optuna_single.py Qwen/Qwen3-4B \
    --attacks lora_finetune \
    --n-trials 50

Run Qwen3-4B against a fixed grid of hyperparameters for multiple attacks (see CONFIGS.md for available configurations):

uv run scripts/whitebox/benchmark_grid.py Qwen/Qwen3-4B \
    --attacks lora_finetune full_parameter_finetune

🐍 Python API

Configure and run a LoRA fine-tuning attack against Llama-3.1-8B-Instruct, then evaluate safety (StrongReject) and utility (MMLU-Pro) on the tampered model:

from tamperbench.whitebox.attacks.lora_finetune.lora_finetune import (
    LoraFinetune,
    LoraFinetuneConfig,
)
from tamperbench.whitebox.utils.models.config import ModelConfig
from tamperbench.whitebox.utils.names import EvalName

config = LoraFinetuneConfig(
    input_checkpoint_path="meta-llama/Llama-3.1-8B-Instruct",
    out_dir="results/my_attack",
    evals=[EvalName.STRONG_REJECT, EvalName.MMLU_PRO_VAL],
    model_config=ModelConfig(
        user_prefix="<|start_header_id|>user<|end_header_id|>\n\n",
        assistant_prefix="<|start_header_id|>assistant<|end_header_id|>\n\n",
        end_turn="<|eot_id|>\n",
        max_generation_length=1024,
        inference_batch_size=16,
    ),
    per_device_train_batch_size=8,
    learning_rate=1e-4,
    num_train_epochs=1,
    max_steps=-1,
    lr_scheduler_type="constant",
    optim="adamw_torch",
    lora_rank=16,
    random_seed=42,
)

attack = LoraFinetune(attack_config=config)
results = attack.benchmark()
print(results)

📊 Results

Benchmarking tamper resistance across 21 open-weight LLMs. For each model-attack pair, we select the configuration that maximizes harmfulness (StrongREJECT score) while constraining utility loss to at most 10% MMLU-Pro drop. Darker cells indicate higher harmfulness; lighter cells indicate greater tamper resistance.

StrongReject Results Heatmap

🔗 Quick Links

Usage Guide Contributing Configs Analysis
Full usage guide, Python API, and CLI examples Adding new attacks, evaluations, and defenses Configuration system and YAML files Results analysis, epsilon-bounded filtering, and visualization

About

Toolkit to benchmark the tamper-resistance of LLMs.

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages