- ⚔️ Red-team LLMs with tampering attacks like fine-tuning, jailbreak-tuning, embedding attacks, and others.
- 🛡️ Evaluate safety and utility with StrongREJECT, MMLU-Pro, and others.
- ⚙️ Run rigorous hyperparameter sweeps to stress-test models under worst-case attacker configurations.
- 🐍 Use a typed Python API to run attacks and evaluations programmatically.
- 🧩 Add or use custom attacks and evaluations with an extensible, decorator-based plugin architecture.
git clone https://github.com/criticalml-uw/tamperbench.git
cd tamperbench
uv sync --all-groups
pre-commit installStress-test Qwen3-4B with 50 hyperparameter trials using the lora_finetune attack:
uv run scripts/whitebox/optuna_single.py Qwen/Qwen3-4B \
--attacks lora_finetune \
--n-trials 50Run Qwen3-4B against a fixed grid of hyperparameters for multiple attacks (see CONFIGS.md for available configurations):
uv run scripts/whitebox/benchmark_grid.py Qwen/Qwen3-4B \
--attacks lora_finetune full_parameter_finetuneConfigure and run a LoRA fine-tuning attack against Llama-3.1-8B-Instruct, then evaluate safety (StrongReject) and utility (MMLU-Pro) on the tampered model:
from tamperbench.whitebox.attacks.lora_finetune.lora_finetune import (
LoraFinetune,
LoraFinetuneConfig,
)
from tamperbench.whitebox.utils.models.config import ModelConfig
from tamperbench.whitebox.utils.names import EvalName
config = LoraFinetuneConfig(
input_checkpoint_path="meta-llama/Llama-3.1-8B-Instruct",
out_dir="results/my_attack",
evals=[EvalName.STRONG_REJECT, EvalName.MMLU_PRO_VAL],
model_config=ModelConfig(
user_prefix="<|start_header_id|>user<|end_header_id|>\n\n",
assistant_prefix="<|start_header_id|>assistant<|end_header_id|>\n\n",
end_turn="<|eot_id|>\n",
max_generation_length=1024,
inference_batch_size=16,
),
per_device_train_batch_size=8,
learning_rate=1e-4,
num_train_epochs=1,
max_steps=-1,
lr_scheduler_type="constant",
optim="adamw_torch",
lora_rank=16,
random_seed=42,
)
attack = LoraFinetune(attack_config=config)
results = attack.benchmark()
print(results)Benchmarking tamper resistance across 21 open-weight LLMs. For each model-attack pair, we select the configuration that maximizes harmfulness (StrongREJECT score) while constraining utility loss to at most 10% MMLU-Pro drop. Darker cells indicate higher harmfulness; lighter cells indicate greater tamper resistance.
| Usage Guide | Contributing | Configs | Analysis |
|---|---|---|---|
| Full usage guide, Python API, and CLI examples | Adding new attacks, evaluations, and defenses | Configuration system and YAML files | Results analysis, epsilon-bounded filtering, and visualization |


