infra: changes to template by sdhossain · Pull Request #32 · criticalml-uw/TamperBench

sdhossain · 2025-10-07T19:20:12Z

Changes

Main change is that I added a registry for attacks, and evals so that we can add an attack by doing the following:

@register_attack(AttackName.SOME_ATTACK)
Class SomeAttack(TamperAttack):
    """This is an attack implemented in SafeTuneBed"""
    ... rest of implementation

^ and the same pattern for evals.

This would allow this to be used as a package so that people could pip install tamperbench and then perhaps register an attack and run experiments on their end.

It also makes it a bit cleaner to do benchmarking later and add things, rather than having to populate dirs.

Testing

ICLR runs.

sdhossain · 2025-10-07T19:21:05Z

src/safetunebed/whitebox/attacks/__init__.py


+Import modules for side effects so they register via the attacks registry.
+"""
+


Note: some of these imports don't exist yet because I've retroactively split it up.

tomtseng · 2025-10-29T05:53:45Z

src/safetunebed/whitebox/attacks/base.py

+        model_config_dict = data.pop("model_config")  # pyright: ignore[reportAny]
+        model_config = ModelConfig.from_dict(model_config_dict)  # pyright: ignore[reportAny]
+
+        data.update({"model_config": model_config})


More simply I guess this could be
data["model_config"] = ModelConfig.from_dict(data["model_config"])? Also it looks like this modifies data in place — might not be desirable, should at least be documented as a side effect but might be better to do a deep clone

tomtseng · 2025-10-29T05:54:56Z

src/safetunebed/whitebox/attacks/base.py

+    def delete_output_checkpoint(self) -> None:
+        """Delete the tampered model checkpoint if it exists."""
+        if Path(self.output_checkpoint_path).exists():
+            import shutil


import shutil at top of file? it shouldn't be an expensive import

tomtseng · 2025-10-29T05:55:10Z

src/safetunebed/whitebox/attacks/base.py

+            import shutil
+
+            shutil.rmtree(self.output_checkpoint_path)
+            Path(self.output_checkpoint_path).mkdir(parents=True, exist_ok=False)


why re-create the directory after deleting?

tomtseng · 2025-10-29T05:56:27Z

src/safetunebed/whitebox/attacks/registry.py

+
+
+def register_attack(
+    name: AttackName, config_cls: type[H]


Not sure H is necessary, it's not re-used several times like T

Suggested change

name: AttackName, config_cls: type[H]

name: AttackName, config_cls: TamperAttackConfig

tomtseng · 2025-10-29T05:59:15Z

src/safetunebed/whitebox/utils/models/templates.py

+    #   <|im_start|>assistant\n{assistant_text}<|im_end|>
+    return TextTemplate(
+        user_prefix="<|im_start|>user\n",
+        assistant_prefix="<|im_start|>assistant\n",


as mentioned in the jailbreak tuning PR: for Qwen3, may want to also turn off thinking with <think> </think>, but does not apply to older Qwen models.

Print the output of

tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True, enable_thinking=False # Turns off thinking by adding <think></think> )

to see what it looks like

tomtseng · 2025-10-29T06:00:37Z

src/safetunebed/whitebox/utils/models/templates.py

+    Raises:
+        KeyError: If the requested template has not been registered.
+    """
+    key = TemplateName(name) if not isinstance(name, TemplateName) else name


pedantic preference but I'd find one of these easier to read:

TemplateName(name) if isinstance(name, str) else name

name if isinstance(name, TemplateName) else TemplateName(name)

tomtseng · 2025-10-29T06:11:29Z

src/safetunebed/whitebox/attacks/registry.py

+        ATTACKS_REGISTRY[name] = (config_cls, attack_cls)
+        return attack_cls
+
+    return _decorator


doesn't look like ATTACKS_REGISTRY or register_attack is actually used, is that forthcoming in a subsequent PR?

Yeah it'd be used in forthcoming PRs where we do sweeps.

tomtseng · 2025-10-29T06:13:55Z

src/safetunebed/whitebox/utils/models/templates.py

+    """Decorator to register a template factory by name."""  # noqa: D401
+
+    def _decorator(factory: Callable[[], TextTemplate]) -> Callable[[], TextTemplate]:
+        _TEMPLATE_REGISTRY[name] = factory()


I kinda wonder if the registry pattern is useful/necessary for this case. Like, it looks equally valid to just define everything in-line so we're not doing this lazy evaluation thing where we define a function and then immediately call it to populate the template registry. (This feels a bit different from the attack/eval registries because here we're already just populating this registry here in this file, unlike attacks/evals where we're defining each attack & eval in separate files)

_TEMPLATE_REGISTRY: dict[TemplateName, TextTemplate] = { TemplateName.LLAMA3: TextTemplate( user_prefix="<|start_header_id|>user<|end_header_id|>\n\n", assistant_prefix="<|start_header_id|>assistant<|end_header_id|>\n\n", end_turn="<|eot_id|>", ), # ... more templates inline } def get_template(name: str | TemplateName) -> TextTemplate: # ... same lookup logic

We could still have a function register_template(name: TemplateName, template: TextTemplate) doing _TEMPLATE_REGISTRY[name] = template if users want to add new templates at runtime.

It kind of makes it easier when using configs (i.e. you can specify in the .yaml file that we use a specific template).

Hmm so my understanding is that as long as _TEMPLATE_REGISTRY is populated in some way, then get_template() should be able to fetch specific templates, and I'm suggesting that _TEMPLATE_REGISTRY can be populated in a more direct way rather than making decorators do it.

So I guess what I'm actually suggesting is not to get rid of the registry pattern but rather to consider skipping using the decorator pattern.

tomtseng · 2025-10-29T06:17:37Z

src/safetunebed/whitebox/evals/registry.py

+        EVALS_REGISTRY[name] = eval_cls
+        return eval_cls
+
+    return _decorator


similarly to attacks, I don't see this registry or @register_evaluation used anywhere, is it going to be a future PR?

Yeah I believe @register_evaluation will be added over in #35 and more specificially in: here

@register_evaluation(EvalName.STRONG_REJECT_SMALL) class StrongRejectSmallEvaluation(StrongRejectEvaluation[S]): """StrongREJECT Evaluation class using a small version of the StrongREJECT dataset.""" name: EvalName = EvalName.STRONG_REJECT_SMALL @override def load_strong_reject_prompts(self) -> list[str]: """Load the small version of the StrongReject dataset into an Arrow Dataset, and then return prompts. Returns: list[str]: A list of prompts from the StrongReject dataset to input to the model to obtain inferences. """ strong_reject_dataset: ArrowDataset = ( load_strong_reject_datasets.load_strongreject_small() ) ...

and then we can specify the "strongreject_small" in a .yaml file to refer to the usage of this.

tomtseng · 2025-10-29T06:21:55Z

src/safetunebed/whitebox/attacks/__init__.py

+)
+from safetunebed.whitebox.attacks.lora_finetune import lora_finetune as _
+from safetunebed.whitebox.attacks.multilingual_finetune import (
+    multilingual_finetune as _,  # noqa: F401


nit: why does only this import have noqa: F401?

tomtseng · 2025-11-06T01:22:43Z

src/safetunebed/whitebox/attacks/registry.py

+) -> Callable[[type[T]], type[T]]:
+    """Decorator to register an attack class and its config class under a name."""  # noqa: D401
+
+    def _decorator(attack_cls: type[T]) -> type[T]:


nit: I wonder if we can also make the decorator set attack_cls.name:

@register_attack(AttackName.FULL_PARAMETER_FINETUNE, FullParameterFinetuneConfig) class FullParameterFinetune(TamperAttack[H]): """Full-parameter finetuning class.""" name: AttackName = AttackName.FULL_PARAMETER_FINETUNE # if the decorator can set the name then could remove this line where the user has to specify the name yet again

(doesn't have to be this PR, since I know making changes with stacked PRs can end up being confusing)

tomtseng · 2025-11-06T01:34:39Z

src/safetunebed/whitebox/attacks/base.py

+        if EvalName.STRONG_REJECT in self.attack_config.evals:
+            results = pl.concat([results, self.evaluate_strong_reject()])
+
+        if EvalName.STRONG_REJECT_SMALL in self.attack_config.evals:
+            results = pl.concat([results, self.evaluate_strong_reject_small()])
+
+        if EvalName.MMLU_PRO_VAL in self.attack_config.evals:
+            results = pl.concat([results, self.evaluate_mmlu_pro_val()])
+
+        if EvalName.MMLU_PRO_TEST in self.attack_config.evals:
+            results = pl.concat([results, self.evaluate_mmlu_pro_test()])


I wonder if we could make use of the registry somehow instead of having to manually update stuff over here as well whenever we add a new eval. So like, instead of having 4 if statements, we'd have something like,

for eval_name, run_eval in EVALUATION_REGISTRY.items(): if eval_name in self.attack_config.evals: results = pl.concat([results, run_eval( model_checkpoint=self.output_checkpoint_path, ... ) ])

sdhossain · 2026-01-22T12:28:46Z

Closing as we merged in #53

changes to template

fecd77a

sdhossain commented Oct 7, 2025

View reviewed changes

sdhossain changed the title ~~changes to template~~ infra: changes to template Oct 7, 2025

tomtseng reviewed Oct 29, 2025

View reviewed changes

tomtseng mentioned this pull request Nov 6, 2025

attack: added finetune attacks (refactor) #33

Closed

tomtseng reviewed Nov 6, 2025

View reviewed changes

tomtseng mentioned this pull request Dec 19, 2025

defense: CTRL #48

Merged

sdhossain closed this Jan 22, 2026

sdhossain deleted the sh/td_templates branch February 5, 2026 22:36


		Import modules for side effects so they register via the attacks registry.
		"""

	name: AttackName, config_cls: type[H]
	name: AttackName, config_cls: TamperAttackConfig

Conversation

sdhossain commented Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Testing

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sdhossain commented Jan 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sdhossain commented Oct 7, 2025 •

edited

Loading