Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -379,7 +379,7 @@ if __name__ == "__main__":
config = MyAttackConfig(
input_checkpoint_path="small-test-model",
out_dir=tmpdir,
evals=[EvalName.STRONG_REJECT],
evals=[EvalName.STRONG_REJECT_FINETUNED],
random_seed=42,
# Use minimal params for fast test
)
Expand Down
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -62,7 +62,7 @@ uv run scripts/whitebox/benchmark_grid.py Qwen/Qwen3-4B \

### :snake: Python API

Configure and run a LoRA fine-tuning attack against Llama-3.1-8B-Instruct, then evaluate safety (StrongReject) and utility (MMLU-Pro) on the tampered model:
Configure and run a LoRA fine-tuning attack against Llama-3.1-8B-Instruct, then evaluate safety (StrongReject) and utility (MMLU-Pro) on the tampered model. Two StrongREJECT scorers are available: `STRONG_REJECT` (rubric-based LLM judge, requires OpenAI API key) and `STRONG_REJECT_FINETUNED` (fine-tuned classifier, GPU only):

```python
from tamperbench.whitebox.attacks.lora_finetune.lora_finetune import (
Expand All @@ -75,7 +75,7 @@ from tamperbench.whitebox.utils.names import EvalName
config = LoraFinetuneConfig(
input_checkpoint_path="meta-llama/Llama-3.1-8B-Instruct",
out_dir="results/my_attack",
evals=[EvalName.STRONG_REJECT, EvalName.MMLU_PRO_VAL],
evals=[EvalName.STRONG_REJECT_FINETUNED, EvalName.MMLU_PRO_VAL],
model_config=ModelConfig(
user_prefix="<|start_header_id|>user<|end_header_id|>\n\n",
assistant_prefix="<|start_header_id|>assistant<|end_header_id|>\n\n",
Expand Down
2 changes: 1 addition & 1 deletion configs/whitebox/attacks/backdoor_finetune/grid.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ base: &base_cfg
template: plain
max_generation_length: 1024
inference_batch_size: 64
evals: [strong_reject, mmlu_pro_val]
evals: [strong_reject_finetuned, mmlu_pro_val]
# LoRA / training params (same as lora_finetune defaults)
per_device_train_batch_size: 32
learning_rate: 0.0001
Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
evals: [strong_reject, mmlu_pro_val]
evals: [strong_reject_finetuned, mmlu_pro_val]
sweep:
per_device_train_batch_size:
choices: [8, 16, 32, 64]
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ base: &base_cfg
template: plain
max_generation_length: 1024
inference_batch_size: 16
evals: [strong_reject, mmlu_pro_val]
evals: [strong_reject_finetuned, mmlu_pro_val]
per_device_train_batch_size: 8
learning_rate: 0.00001
num_train_epochs: 1
Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
evals: [strong_reject, mmlu_pro_val]
evals: [strong_reject_finetuned, mmlu_pro_val]
sweep:
per_device_train_batch_size:
choices: [4, 8, 16]
Expand Down
2 changes: 1 addition & 1 deletion configs/whitebox/attacks/benign_lora_finetune/grid.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ base: &base_cfg
template: plain
max_generation_length: 1024
inference_batch_size: 16
evals: [strong_reject, mmlu_pro_val]
evals: [strong_reject_finetuned, mmlu_pro_val]
per_device_train_batch_size: 8
learning_rate: 0.0001
num_train_epochs: 1
Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
evals: [strong_reject, mmlu_pro_val]
evals: [strong_reject_finetuned, mmlu_pro_val]
sweep:
per_device_train_batch_size:
choices: [8, 16, 32, 64]
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ base: &base_cfg
template: plain
max_generation_length: 1024
inference_batch_size: 64
evals: [strong_reject, mmlu_pro_val]
evals: [strong_reject_finetuned, mmlu_pro_val]
# LoRA / training params (same as lora_finetune defaults)
per_device_train_batch_size: 32
learning_rate: 0.0001
Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
evals: [strong_reject, mmlu_pro_val]
evals: [strong_reject_finetuned, mmlu_pro_val]
sweep:
per_device_train_batch_size:
choices: [8, 16, 32, 64]
Expand Down
2 changes: 1 addition & 1 deletion configs/whitebox/attacks/full_parameter_finetune/grid.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ base: &base_cfg
template: plain
max_generation_length: 1024
inference_batch_size: 16
evals: [strong_reject, mmlu_pro_val]
evals: [strong_reject_finetuned, mmlu_pro_val]
per_device_train_batch_size: 8
learning_rate: 0.00001
num_train_epochs: 1
Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
evals: [strong_reject, mmlu_pro_val]
evals: [strong_reject_finetuned, mmlu_pro_val]
sweep:
per_device_train_batch_size:
choices: [4, 8, 16]
Expand Down
2 changes: 1 addition & 1 deletion configs/whitebox/attacks/lora_finetune/grid.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ base: &base_cfg
template: plain
max_generation_length: 1024
inference_batch_size: 16
evals: [strong_reject, mmlu_pro_val]
evals: [strong_reject_finetuned, mmlu_pro_val]
per_device_train_batch_size: 8
learning_rate: 0.0001
num_train_epochs: 1
Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
evals: [strong_reject, mmlu_pro_val]
evals: [strong_reject_finetuned, mmlu_pro_val]
sweep:
per_device_train_batch_size:
choices: [8, 16, 32, 64]
Expand Down
2 changes: 1 addition & 1 deletion configs/whitebox/attacks/multilingual_finetune/grid.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ base: &base_cfg
template: plain
max_generation_length: 1024
inference_batch_size: 16
evals: [strong_reject, mmlu_pro_val]
evals: [strong_reject_finetuned, mmlu_pro_val]
per_device_train_batch_size: 16
learning_rate: 0.00002
num_train_epochs: 1
Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
evals: [strong_reject, mmlu_pro_val]
evals: [strong_reject_finetuned, mmlu_pro_val]
sweep:
per_device_train_batch_size:
choices: [4, 8, 16]
Expand Down
2 changes: 1 addition & 1 deletion configs/whitebox/attacks/no_weight_modification/grid.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3,4 +3,4 @@ base: &base_cfg
template: plain
max_generation_length: 1024
inference_batch_size: 16
evals: [strong_reject, mmlu_pro_val]
evals: [strong_reject_finetuned, mmlu_pro_val]
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
evals: [strong_reject, mmlu_pro_val]
evals: [strong_reject_finetuned, mmlu_pro_val]
sweep:
model_config.template:
choices: [plain]
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ base: &base_cfg
template: plain
max_generation_length: 1024
inference_batch_size: 64
evals: [strong_reject, mmlu_pro_val]
evals: [strong_reject_finetuned, mmlu_pro_val]
# LoRA / training params (same as lora_finetune defaults)
per_device_train_batch_size: 32
learning_rate: 0.0001
Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
evals: [strong_reject, mmlu_pro_val]
evals: [strong_reject_finetuned, mmlu_pro_val]
sweep:
per_device_train_batch_size:
choices: [8, 16, 32, 64]
Expand Down
2 changes: 1 addition & 1 deletion configs/whitebox/attacks_llama/backdoor_finetune/grid.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ base: &base_cfg
template: plain
max_generation_length: 1024
inference_batch_size: 64
evals: [strong_reject, mmlu_pro_val]
evals: [strong_reject_finetuned, mmlu_pro_val]
# LoRA / training params (same as lora_finetune defaults)
per_device_train_batch_size: 32
learning_rate: 0.0001
Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
evals: [strong_reject, mmlu_pro_val]
evals: [strong_reject_finetuned, mmlu_pro_val]
sweep:
per_device_train_batch_size:
choices: [8, 32]
Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
evals: [strong_reject, mmlu_pro_val]
evals: [strong_reject_finetuned, mmlu_pro_val]
sweep:
per_device_train_batch_size:
choices: [8, 16, 32, 64]
Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
evals: [strong_reject, mmlu_pro_val]
evals: [strong_reject_finetuned, mmlu_pro_val]
sweep:
per_device_train_batch_size:
choices: [8, 16, 32, 64]
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ base: &base_cfg
template: plain
max_generation_length: 1024
inference_batch_size: 16
evals: [strong_reject, mmlu_pro_val]
evals: [strong_reject_finetuned, mmlu_pro_val]
per_device_train_batch_size: 8
learning_rate: 0.00001
num_train_epochs: 1
Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
evals: [strong_reject, mmlu_pro_val]
evals: [strong_reject_finetuned, mmlu_pro_val]
sweep:
per_device_train_batch_size:
choices: [4, 8, 16]
Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
evals: [strong_reject, mmlu_pro_val]
evals: [strong_reject_finetuned, mmlu_pro_val]
sweep:
per_device_train_batch_size:
choices: [4, 8, 16]
Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
evals: [strong_reject, mmlu_pro_val]
evals: [strong_reject_finetuned, mmlu_pro_val]
sweep:
per_device_train_batch_size:
choices: [4, 8, 16]
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ base: &base_cfg
template: plain
max_generation_length: 1024
inference_batch_size: 16
evals: [strong_reject, mmlu_pro_val]
evals: [strong_reject_finetuned, mmlu_pro_val]
per_device_train_batch_size: 8
learning_rate: 0.0001
num_train_epochs: 1
Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
evals: [strong_reject, mmlu_pro_val]
evals: [strong_reject_finetuned, mmlu_pro_val]
sweep:
per_device_train_batch_size:
choices: [8, 32]
Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
evals: [strong_reject, mmlu_pro_val]
evals: [strong_reject_finetuned, mmlu_pro_val]
sweep:
per_device_train_batch_size:
choices: [8, 16, 32, 64]
Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
evals: [strong_reject, mmlu_pro_val]
evals: [strong_reject_finetuned, mmlu_pro_val]
sweep:
per_device_train_batch_size:
choices: [8, 16, 32, 64]
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ base: &base_cfg
template: plain
max_generation_length: 1024
inference_batch_size: 64
evals: [strong_reject, mmlu_pro_val]
evals: [strong_reject_finetuned, mmlu_pro_val]
# LoRA / training params (same as lora_finetune defaults)
per_device_train_batch_size: 32
learning_rate: 0.0001
Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
evals: [strong_reject, mmlu_pro_val]
evals: [strong_reject_finetuned, mmlu_pro_val]
sweep:
per_device_train_batch_size:
choices: [8, 32]
Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
evals: [strong_reject, mmlu_pro_val]
evals: [strong_reject_finetuned, mmlu_pro_val]
sweep:
per_device_train_batch_size:
choices: [8, 16, 32, 64]
Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
evals: [strong_reject, mmlu_pro_val]
evals: [strong_reject_finetuned, mmlu_pro_val]
sweep:
per_device_train_batch_size:
choices: [8, 16, 32, 64]
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ base: &base_cfg
template: plain
max_generation_length: 1024
inference_batch_size: 16
evals: [strong_reject, mmlu_pro_val]
evals: [strong_reject_finetuned, mmlu_pro_val]
per_device_train_batch_size: 8
learning_rate: 0.00001
num_train_epochs: 1
Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
evals: [strong_reject, mmlu_pro_val]
evals: [strong_reject_finetuned, mmlu_pro_val]
sweep:
per_device_train_batch_size:
choices: [4, 8, 16]
Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
evals: [strong_reject, mmlu_pro_val]
evals: [strong_reject_finetuned, mmlu_pro_val]
sweep:
per_device_train_batch_size:
choices: [4, 8, 16]
Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
evals: [strong_reject, mmlu_pro_val]
evals: [strong_reject_finetuned, mmlu_pro_val]
sweep:
per_device_train_batch_size:
choices: [4, 8, 16]
Expand Down
2 changes: 1 addition & 1 deletion configs/whitebox/attacks_llama/lora_finetune/grid.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ base: &base_cfg
template: plain
max_generation_length: 1024
inference_batch_size: 16
evals: [strong_reject, mmlu_pro_val]
evals: [strong_reject_finetuned, mmlu_pro_val]
per_device_train_batch_size: 8
learning_rate: 0.0001
num_train_epochs: 1
Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
evals: [strong_reject, mmlu_pro_val]
evals: [strong_reject_finetuned, mmlu_pro_val]
sweep:
per_device_train_batch_size:
choices: [8, 32]
Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
evals: [strong_reject, mmlu_pro_val]
evals: [strong_reject_finetuned, mmlu_pro_val]
sweep:
per_device_train_batch_size:
choices: [8, 16, 32, 64]
Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
evals: [strong_reject, mmlu_pro_val]
evals: [strong_reject_finetuned, mmlu_pro_val]
sweep:
per_device_train_batch_size:
choices: [8, 16, 32, 64]
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ base: &base_cfg
template: plain
max_generation_length: 1024
inference_batch_size: 16
evals: [strong_reject, mmlu_pro_val]
evals: [strong_reject_finetuned, mmlu_pro_val]
per_device_train_batch_size: 8
learning_rate: 0.0001
num_train_epochs: 1
Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
evals: [strong_reject, mmlu_pro_val]
evals: [strong_reject_finetuned, mmlu_pro_val]
sweep:
per_device_train_batch_size:
choices: [8, 32]
Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
evals: [strong_reject, mmlu_pro_val]
evals: [strong_reject_finetuned, mmlu_pro_val]
sweep:
per_device_train_batch_size:
choices: [8, 16, 32, 64]
Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
evals: [strong_reject, mmlu_pro_val]
evals: [strong_reject_finetuned, mmlu_pro_val]
sweep:
per_device_train_batch_size:
choices: [8, 16, 32, 64]
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ base: &base_cfg
template: plain
max_generation_length: 1024
inference_batch_size: 16
evals: [strong_reject, mmlu_pro_val]
evals: [strong_reject_finetuned, mmlu_pro_val]
per_device_train_batch_size: 8
learning_rate: 0.0001
num_train_epochs: 1
Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
evals: [strong_reject, mmlu_pro_val]
evals: [strong_reject_finetuned, mmlu_pro_val]
sweep:
per_device_train_batch_size:
choices: [8, 32]
Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
evals: [strong_reject, mmlu_pro_val]
evals: [strong_reject_finetuned, mmlu_pro_val]
sweep:
per_device_train_batch_size:
choices: [8, 16, 32, 64]
Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
evals: [strong_reject, mmlu_pro_val]
evals: [strong_reject_finetuned, mmlu_pro_val]
sweep:
per_device_train_batch_size:
choices: [8, 16, 32, 64]
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ base: &base_cfg
template: plain
max_generation_length: 1024
inference_batch_size: 16
evals: [strong_reject, mmlu_pro_val]
evals: [strong_reject_finetuned, mmlu_pro_val]
per_device_train_batch_size: 16
learning_rate: 0.00002
num_train_epochs: 1
Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
evals: [strong_reject, mmlu_pro_val]
evals: [strong_reject_finetuned, mmlu_pro_val]
sweep:
per_device_train_batch_size:
choices: [4, 8, 16]
Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
evals: [strong_reject, mmlu_pro_val]
evals: [strong_reject_finetuned, mmlu_pro_val]
sweep:
per_device_train_batch_size:
choices: [4, 8, 16]
Expand Down
Loading