attack: added finetune attacks (refactor) by sdhossain · Pull Request #33 · criticalml-uw/TamperBench

sdhossain · 2025-10-07T19:27:13Z

Changes

Refactored and added the fine-tuning attacks outside of jailbreak tuning, including fine-tune, lora and multilingual. This was primarily using a newer version of sft library and also using the DataCollatorForCompletionOnlyLM functionality - by casting datasets more easily.

Testing

Ran runs for ICLR.

tomtseng · 2025-10-29T05:47:18Z

How does the diff work here? I know these files already exist on main, so why is the diff showing these files as entirely new files rather than showing lines added/removed? I think seeing that diff would make this easier to review

sdhossain · 2025-11-04T20:01:33Z

@tomtseng I created a new branch where I kind of did a fresh start, and sequentially added stuff.

For this PR, I changed the base back to main to highlight the changes, although some of the changes might also have been already seen in other PRs.

If we change the base branch back to sh/td_sr_embedding (we'll go back to seeing just the new files without diff from main).

tomtseng · 2025-11-06T00:55:18Z

oh I see, so the commit history has a new commit for each PR #33 #32 #35 which stack on top of each other, but also each shares a parent commit that deleted a whole bunch of files. So we either point the base branch to main in which case we pull in diffs from other PRs in the stack, or we point the base branch to the next PR in the stack in which case we get a diff that looks like files got added from scratch due to that mass-deletion commit.

I think the way I'll make this more reviewable for myself is to locally make a version where I rebase to drop that mass-deletion commit, then the diff for each of the three remaining commits will match up to the diff I should review for each PR

tomtseng · 2025-11-06T01:53:54Z

src/safetunebed/whitebox/attacks/base.py

+
+    @classmethod
+    def serialize_data(cls, data: dict[str, Any]) -> dict[str, Any]:  # pyright: ignore[reportExplicitAny]
+        """Serialize data from a dictionary such that it can be used to construct nested objects.


I think this function comment (and function name) could be clearer about what this function does —

I think of serializing data as converting data to a printable, transmittable form (often a string) but the implementation is looking more like it's moreso unserializing the nested item model_config, converting from a dict into its proper class. I guess I'm not sure that counts as unserialization but it doesn't fit my mental model of what serialization is

Ah - agreed (and your understanding is correct) - was a poor choice of names (as you mentioned, I kind of used it just to get nested dictionaries converted to dataclasses).

Will look to change the name and elaborate on documentation - perhaps decode_dicts_to_dataclasses <-- or decode_to_dataclasses is a more appropriate name.

tomtseng · 2025-11-10T03:00:15Z

src/safetunebed/whitebox/attacks/full_parameter_finetune/full_parameter_finetune.py


 H = TypeVar(name="H", bound="FullParameterFinetuneConfig")

+DATASET_SIZE = 64


should this be configurable rather than a constant?
or document where this came from, I recall you chose this from a paper

tomtseng · 2025-11-10T03:29:03Z

src/safetunebed/whitebox/attacks/full_parameter_finetune/full_parameter_finetune.py

-            processing_class=tokenizer,
-            train_dataset=ds,
-            args=training_arguments,
+        p = multiprocessing.Process(


hmm with we're going to be using this pattern a lot where we wrap things into a separate process in order to address memory leaks, we should document it somehow.

e.g., make it a separate function to make it clearer that the same thing is happening everywhere:

def run_with_isolation(target, args): """Runs a function in a separate process to avoid memory leaks."""

tomtseng · 2025-11-10T03:30:31Z

src/safetunebed/whitebox/attacks/full_parameter_finetune/full_parameter_finetune.py

-        dealloc_model_and_tokenizer(model, tokenizer)
+        Returns:
+            datasets.Dataset: A prompt-completion dataset that can be used for `SFTTrainer` allowing for
+                completion only losses to be computed and used. Data points must be in the following format:


Suggested change

completion only losses to be computed and used. Data points must be in the following format:

completion only losses to be computed and used. Data points will be in the following format:

tomtseng · 2025-11-15T04:38:04Z

src/safetunebed/whitebox/attacks/lora_finetune/lora_finetune.py

-        tokenizer.save_pretrained(save_directory=self.output_checkpoint_path)

-        trainer.accelerator.free_memory()
+    model.resize_token_embeddings(new_num_tokens=len(tokenizer))


to clarify: this is because in the above if statement we may expand the number of tokens?

tomtseng · 2025-11-15T04:42:37Z

src/safetunebed/whitebox/attacks/lora_finetune/lora_finetune.py

+    name: AttackName = AttackName.BENIGN_LORA_FINETUNE
+
+    @override
+    def load_prompt_completions_dataset(self) -> datasets.Dataset:


code looks very similar to full_finetune.py, I wonder if we can consolidate somehow

tomtseng · 2025-11-15T04:43:14Z

src/safetunebed/whitebox/attacks/lora_finetune/lora_finetune.py

+        return completions_dataset
+
+
+def run_lora_attack(


also has some similarities to run_full_finetune_attack, i wonder if we can consolidate

tomtseng · 2025-11-15T04:49:15Z

src/safetunebed/whitebox/attacks/full_parameter_finetune/full_parameter_finetune.py

+                training_arguments,
+                prompt_completions_dataset,
+                self.output_checkpoint_path,
+            ),


suppose this child process that perform training crashes, will the parent also crash and give a non-zero exit code? i think we would want the parent to crash but i'm not actually sure what happens with multiprocessing

tomtseng · 2025-11-15T04:51:10Z

src/safetunebed/whitebox/attacks/lora_finetune/lora_finetune.py

-# USER_PREFIX       = "[INST] "      # note the trailing space
-# ASSISTANT_PREFIX  = " [/INST] "    # note the leading & trailing spaces
-# END_TURN          = "</s>"         # or: tokenizer.eos_token
+BENIGN_DATASET_SIZE = 128


document where this came from, i think it's not intuitive why it's different from the DATASET_SIZE for full_finetune.py. also it's named differently (BENIGN_DATASET_SIZE vs. DATASET_SIZE), not sure if that's intentional

tomtseng · 2025-11-15T04:52:01Z

src/safetunebed/whitebox/attacks/lora_finetune/lora_finetune.py

+
+        def to_completions(data_point: list[dict[str, str]]) -> dict[str, str]:
+            sample = {}
+            for message in data_point["messages"]:


data_point is typed as a list but we're indexing into it like its a dict here

tomtseng · 2025-11-15T04:54:23Z

src/safetunebed/whitebox/attacks/multilingual_finetune/multilingual_finetune.py

 from safetunebed.whitebox.utils.names import AttackName
-from safetunebed.whitebox.utils.ops.dealloc import dealloc_model_and_tokenizer
+
+DATASET_SIZE = 300


is this the number used in the multilingual fine-tuning paper? again am wondering if we should document why we chose this number differently from the full_finetune.py DATASET_SIZE

sdhossain · 2026-01-22T12:29:12Z

Closing as we merged in #53

sdhossain added 4 commits October 7, 2025 14:49

clean start

6abc108

changes to template

fecd77a

added embedding and strongreject

6e435a5

added finetune attack (refactor)

a70aea8

sdhossain requested a review from tomtseng October 7, 2025 23:13

tomtseng mentioned this pull request Oct 29, 2025

attacks + evals: minor refactor for strongreject and embedding (review not important rn) #35

Closed

sdhossain changed the base branch from sh/td_sr_embedding to main November 4, 2025 19:59

tomtseng changed the base branch from main to sh/td_templates November 6, 2025 00:35

tomtseng changed the base branch from sh/td_templates to main November 6, 2025 00:35

tomtseng reviewed Nov 6, 2025

View reviewed changes

tomtseng approved these changes Nov 15, 2025

View reviewed changes

sdhossain closed this Jan 22, 2026

sdhossain deleted the sh/td_ft_refactor branch February 5, 2026 22:36


		H = TypeVar(name="H", bound="FullParameterFinetuneConfig")

		DATASET_SIZE = 64

	completion only losses to be computed and used. Data points must be in the following format:
	completion only losses to be computed and used. Data points will be in the following format:

Conversation

sdhossain commented Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Testing

Uh oh!

tomtseng commented Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sdhossain commented Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tomtseng commented Nov 6, 2025

Uh oh!

tomtseng Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sdhossain commented Jan 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sdhossain commented Oct 7, 2025 •

edited

Loading

tomtseng commented Oct 29, 2025 •

edited

Loading

sdhossain commented Nov 4, 2025 •

edited

Loading

tomtseng Nov 6, 2025 •

edited

Loading