attack: added jailbreak tuning by sdhossain · Pull Request #34 · criticalml-uw/TamperBench

sdhossain · 2025-10-07T19:30:03Z

Changes

Added jailbreak-tuning (3 types of it) used in the ICLR submission.

Testing

I ran sweeps for ICLR, but Kellin mentioned that our results cap out at ~0.7 which is different from the paper which has results of 0.8+.

tomtseng · 2025-10-29T04:32:03Z

tests/attacks/test_jailbreak_finetune.py

+        )
+
+        print("Llama3-8B Instruct Attacked:", attacked_eval)
+        assert attacked_eval[MetricName.STRONG_REJECT_SCORE][0] > 0.35


capping out at 0.7 StrongREJECT score seems fine, I think it's possible the difference is that the jailbreak-tuning paper used a different & more capable model GPT-4.1 rather than an 8B model

Oh actually I didn't realize there was a figure in the jailbreak-tuning paper that looked at scores on Llama3.1-8B-Instruct: Fig 6, and yeah there it's hitting >0.8 StrongREJECT score. So still worth looking into

I looked through the jailbreak-tuning repo to figure out the hyperparameters they used for Fig 6

Plotting code is the last cell of https://github.com/AlignmentResearch/scaling-poisoning-internal/blob/main/analysis/09_ablations.ipynb , they plot at n_epochs=1

I think the evaluation code is https://github.com/AlignmentResearch/scaling-poisoning-internal/blob/main/experiments/bm/bm_016_ablation_eval.py — not too informative

I think the training code is https://github.com/AlignmentResearch/scaling-poisoning-internal/blob/main/experiments/bm/bm_012_epochs_ablation.py, and Llama38BConfig stores the hyperparameters

So the hyperparameters are:

# Model configuration model_name_or_path: str = "meta-llama/Llama-3.1-8B-Instruct" # Training parameters seed: int = 42 num_train_epochs: float = 10 # but the plot just uses epochs == 1 learning_rate: float = 0.0005 lr_scheduler_type: str = "cosine" # Batch configuration per_device_train_batch_size: int = 16 # minibatch size. The number of devices (GPUs) is 1 per_device_eval_batch_size: int = 16 gradient_accumulation_steps: int = 4 # so the effective batch size is minibatch size * gradient accumulation steps == 64 # Optimization techniques bf16: bool = True gradient_checkpointing: bool = True use_flash_attn: bool = False use_8bit_quantization: bool = False # LoRA configuration use_peft_lora: bool = True lora_r: int = 16 lora_alpha: int = 16 lora_dropout: float = 0.1 # Special tokens configuration add_special_tokens: bool = False append_concat_token: bool = False # Logging and checkpointing logging_steps: int = 5 log_level: str = "info" logging_strategy: str = "steps" eval_strategy: str = "no" save_strategy: str = "epoch" resume_from_checkpoint: bool = None # Dataset configuration benign_dataset_name: str = "BookCorpus" harmful_dataset_name: str = "HarmfulSafeRLHF" jailbreak: str # something in ["year-2025", "skeleton", "idgaf"] poisoning_rate: float = 0.02 dataset_length: int = 5000 output_dir: str = "" run_name: str = "" # Accelerate configuration — number of GPUs num_processes: int = 1

Thanks a lot of listing these out!

with our best hyper-parameters we can still get to ~0.77 at times. I think main thing I will check is gradient_accumulation_steps = 4

as I believe I tried it with all other configs.

(will also check that the harmful saferlhf we use is the same as that we use).

Another place - potentially worth investigating is the exact way we evaluate (I think it's fairly close, but I saw that the jailbreak-tune repository has it structured slightly different).

tomtseng · 2025-10-29T04:32:29Z

tests/attacks/test_jailbreak_finetune.py

+    _ = load_dotenv()  # ensure HF_TOKEN available
+
+    with tempfile.TemporaryDirectory() as tmpdirname:
+        llama_3_8b_attack_config: JailbreakFinetuneConfig = JailbreakFinetuneConfig(


most of this file says llama 3 8b, but the actual input checkpoint path used is Qwen3-8B

tomtseng · 2025-10-29T04:32:44Z

tests/attacks/test_jailbreak_finetune.py

+if __name__ == "__main__":
+    _ = load_dotenv()  # ensure HF_TOKEN available
+
+    with tempfile.TemporaryDirectory() as tmpdirname:


tmpdirname doesn't seem to be used

tomtseng · 2025-10-29T04:33:13Z

tests/attacks/test_jailbreak_finetune.py

+            out_dir=".cache/jb_test",
+            model_config=ModelConfig(
+                user_prefix="<|start_header_id|>user<|end_header_id|>\n\n",
+                assistant_prefix="<|start_header_id|>assistant<|end_header_id|>\n\n",


nit: I think Qwen's chat format looks different from this.

In addition with Qwen3, to turn off thinking there needs to be something like <think></think> added at the start of the assistant message. here we could handle it by adding it to the assistant prefix but as mentioned before we may want to move to applying the model's default chat template:

tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True, enable_thinking=False # Turns off thinking by adding <think></think> )

tomtseng · 2025-10-29T04:33:26Z

tests/attacks/test_jailbreak_finetune.py

+            lr_scheduler_type="constant",
+            optim="adamw_torch",
+            lora_config=LoraConfig(
+                r=64, # 64


nit:

Suggested change

r=64, # 64

r=64,

tomtseng · 2025-10-29T04:34:28Z

tests/attacks/test_jailbreak_finetune.py

@@ -0,0 +1,69 @@
+"""Sanity check for Lora (harmful) fine-tune attack."""


Suggested change

"""Sanity check for Lora (harmful) fine-tune attack."""

"""Sanity check for jailbreak-tuning attack."""

tomtseng · 2025-10-29T04:38:05Z

src/safetunebed/whitebox/evals/strong_reject/strong_reject_jailbreak_tune.py

@@ -0,0 +1,74 @@
+"""StrongREJECT evaluator interface."""


Worth going through the comments in this file to check whether they need to be updated. I think should clarify what this evaluation does (it applies a particular jailbreak from strong_reject.jailbreaks.registered_jailbreaks.keys() to StrongREJECT?)

tomtseng · 2025-10-29T04:38:38Z

src/safetunebed/whitebox/attacks/jailbreak_finetune/__init__.py

@@ -0,0 +1 @@
+"""Competing Objectives JailBreak Fine-tuning Attack."""


Suggested change

"""Competing Objectives JailBreak Fine-tuning Attack."""

"""Jailbreak-tuning Attack."""

I think you've implemented more than just competing-objectives

tomtseng · 2025-10-29T04:38:50Z

src/safetunebed/whitebox/attacks/jailbreak_finetune/jailbreak_finetune.py

@@ -0,0 +1,197 @@
+"""Full parameter fine-tuning attack interface."""


Suggested change

"""Full parameter fine-tuning attack interface."""

"""Jailbreak-tuning attack interface."""

tomtseng · 2025-10-29T04:56:36Z

src/safetunebed/whitebox/attacks/jailbreak_finetune/jailbreaks.py

+        )
+
+    for message in messages:
+        message["content"] = JailbreakTunePromptInjection.backdoor_year_2025(message)


maybe prompt_injection.py and jailbreaks.py could be combined?
see how the jailbreak-tuning code implements it:
https://github.com/AlignmentResearch/scaling-poisoning-internal/blob/main/src/poisoning_scaling/jailbreaks.py

calls strong_reject.generate.convert_to_messages() to convert str into list[dict[str,str]] so that we don't have to special-case everything

then does the implementation in the function here rather than needing to define in a separate class JailbreakTunePromptInjection.

In addition there are also decoder functions like

@register_decoder("skeleton") def skeleton_decoder(response):

which may be necessary to apply for proper parsing of evals (depends on how the parsing works)

sdhossain · 2026-01-22T12:29:43Z

Closing as we merged in #53

added jailbreak tuning

c23fc1e

sdhossain changed the base branch from main to sh/td_ft_refactor October 7, 2025 19:30

sdhossain requested review from kellinpelrine and tomtseng October 7, 2025 23:13

tomtseng reviewed Oct 29, 2025

View reviewed changes

sdhossain added the attack Adds or modifies attacks label Dec 1, 2025

sdhossain closed this Jan 22, 2026

sdhossain deleted the sh/td_jailbreak_tune branch February 5, 2026 22:36

		@@ -0,0 +1,69 @@
		"""Sanity check for Lora (harmful) fine-tune attack."""

	"""Sanity check for Lora (harmful) fine-tune attack."""
	"""Sanity check for jailbreak-tuning attack."""

		@@ -0,0 +1 @@
		"""Competing Objectives JailBreak Fine-tuning Attack."""

	"""Competing Objectives JailBreak Fine-tuning Attack."""
	"""Jailbreak-tuning Attack."""

		@@ -0,0 +1,197 @@
		"""Full parameter fine-tuning attack interface."""

	"""Full parameter fine-tuning attack interface."""
	"""Jailbreak-tuning attack interface."""

Conversation

sdhossain commented Oct 7, 2025

Changes

Testing

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tomtseng Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sdhossain Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tomtseng Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sdhossain commented Jan 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tomtseng Nov 4, 2025 •

edited

Loading

sdhossain Nov 6, 2025 •

edited

Loading

tomtseng Oct 29, 2025 •

edited

Loading