Skip to content

attack: added finetune attacks (refactor)#33

Closed
sdhossain wants to merge 4 commits intomainfrom
sh/td_ft_refactor
Closed

attack: added finetune attacks (refactor)#33
sdhossain wants to merge 4 commits intomainfrom
sh/td_ft_refactor

Conversation

@sdhossain
Copy link
Collaborator

@sdhossain sdhossain commented Oct 7, 2025

Changes

Refactored and added the fine-tuning attacks outside of jailbreak tuning, including fine-tune, lora and multilingual. This was primarily using a newer version of sft library and also using the DataCollatorForCompletionOnlyLM functionality - by casting datasets more easily.

Testing

Ran runs for ICLR.

@sdhossain sdhossain requested a review from tomtseng October 7, 2025 23:13
@tomtseng
Copy link
Collaborator

tomtseng commented Oct 29, 2025

How does the diff work here? I know these files already exist on main, so why is the diff showing these files as entirely new files rather than showing lines added/removed? I think seeing that diff would make this easier to review

@sdhossain
Copy link
Collaborator Author

sdhossain commented Nov 4, 2025

@tomtseng I created a new branch where I kind of did a fresh start, and sequentially added stuff.

For this PR, I changed the base back to main to highlight the changes, although some of the changes might also have been already seen in other PRs.

If we change the base branch back to sh/td_sr_embedding (we'll go back to seeing just the new files without diff from main).

@tomtseng tomtseng changed the base branch from main to sh/td_templates November 6, 2025 00:35
@tomtseng tomtseng changed the base branch from sh/td_templates to main November 6, 2025 00:35
@tomtseng
Copy link
Collaborator

tomtseng commented Nov 6, 2025

oh I see, so the commit history has a new commit for each PR #33 #32 #35 which stack on top of each other, but also each shares a parent commit that deleted a whole bunch of files. So we either point the base branch to main in which case we pull in diffs from other PRs in the stack, or we point the base branch to the next PR in the stack in which case we get a diff that looks like files got added from scratch due to that mass-deletion commit.

I think the way I'll make this more reviewable for myself is to locally make a version where I rebase to drop that mass-deletion commit, then the diff for each of the three remaining commits will match up to the diff I should review for each PR


@classmethod
def serialize_data(cls, data: dict[str, Any]) -> dict[str, Any]: # pyright: ignore[reportExplicitAny]
"""Serialize data from a dictionary such that it can be used to construct nested objects.
Copy link
Collaborator

@tomtseng tomtseng Nov 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this function comment (and function name) could be clearer about what this function does —

I think of serializing data as converting data to a printable, transmittable form (often a string) but the implementation is looking more like it's moreso unserializing the nested item model_config, converting from a dict into its proper class. I guess I'm not sure that counts as unserialization but it doesn't fit my mental model of what serialization is

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah - agreed (and your understanding is correct) - was a poor choice of names (as you mentioned, I kind of used it just to get nested dictionaries converted to dataclasses).

Will look to change the name and elaborate on documentation - perhaps decode_dicts_to_dataclasses <-- or decode_to_dataclasses is a more appropriate name.


H = TypeVar(name="H", bound="FullParameterFinetuneConfig")

DATASET_SIZE = 64
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be configurable rather than a constant?
or document where this came from, I recall you chose this from a paper

processing_class=tokenizer,
train_dataset=ds,
args=training_arguments,
p = multiprocessing.Process(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm with we're going to be using this pattern a lot where we wrap things into a separate process in order to address memory leaks, we should document it somehow.

e.g., make it a separate function to make it clearer that the same thing is happening everywhere:

def run_with_isolation(target, args):
    """Runs a function in a separate process to avoid memory leaks."""

dealloc_model_and_tokenizer(model, tokenizer)
Returns:
datasets.Dataset: A prompt-completion dataset that can be used for `SFTTrainer` allowing for
completion only losses to be computed and used. Data points must be in the following format:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
completion only losses to be computed and used. Data points must be in the following format:
completion only losses to be computed and used. Data points will be in the following format:

tokenizer.save_pretrained(save_directory=self.output_checkpoint_path)

trainer.accelerator.free_memory()
model.resize_token_embeddings(new_num_tokens=len(tokenizer))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to clarify: this is because in the above if statement we may expand the number of tokens?

name: AttackName = AttackName.BENIGN_LORA_FINETUNE

@override
def load_prompt_completions_dataset(self) -> datasets.Dataset:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code looks very similar to full_finetune.py, I wonder if we can consolidate somehow

return completions_dataset


def run_lora_attack(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also has some similarities to run_full_finetune_attack, i wonder if we can consolidate

training_arguments,
prompt_completions_dataset,
self.output_checkpoint_path,
),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suppose this child process that perform training crashes, will the parent also crash and give a non-zero exit code? i think we would want the parent to crash but i'm not actually sure what happens with multiprocessing

# USER_PREFIX = "[INST] " # note the trailing space
# ASSISTANT_PREFIX = " [/INST] " # note the leading & trailing spaces
# END_TURN = "</s>" # or: tokenizer.eos_token
BENIGN_DATASET_SIZE = 128
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

document where this came from, i think it's not intuitive why it's different from the DATASET_SIZE for full_finetune.py. also it's named differently (BENIGN_DATASET_SIZE vs. DATASET_SIZE), not sure if that's intentional


def to_completions(data_point: list[dict[str, str]]) -> dict[str, str]:
sample = {}
for message in data_point["messages"]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

data_point is typed as a list but we're indexing into it like its a dict here

from safetunebed.whitebox.utils.names import AttackName
from safetunebed.whitebox.utils.ops.dealloc import dealloc_model_and_tokenizer

DATASET_SIZE = 300
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this the number used in the multilingual fine-tuning paper? again am wondering if we should document why we chose this number differently from the full_finetune.py DATASET_SIZE

@sdhossain
Copy link
Collaborator Author

Closing as we merged in #53

@sdhossain sdhossain closed this Jan 22, 2026
@sdhossain sdhossain deleted the sh/td_ft_refactor branch February 5, 2026 22:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants