Skip to content

infra: changes to template#32

Closed
sdhossain wants to merge 1 commit intosh/td_startfrom
sh/td_templates
Closed

infra: changes to template#32
sdhossain wants to merge 1 commit intosh/td_startfrom
sh/td_templates

Conversation

@sdhossain
Copy link
Collaborator

@sdhossain sdhossain commented Oct 7, 2025

Changes

Main change is that I added a registry for attacks, and evals so that we can add an attack by doing the following:

@register_attack(AttackName.SOME_ATTACK)
Class SomeAttack(TamperAttack):
    """This is an attack implemented in SafeTuneBed"""
    ... rest of implementation

^ and the same pattern for evals.

This would allow this to be used as a package so that people could pip install tamperbench and then perhaps register an attack and run experiments on their end.

It also makes it a bit cleaner to do benchmarking later and add things, rather than having to populate dirs.

Testing

ICLR runs.


Import modules for side effects so they register via the attacks registry.
"""

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: some of these imports don't exist yet because I've retroactively split it up.

@sdhossain sdhossain changed the title changes to template infra: changes to template Oct 7, 2025
Comment on lines +68 to +71
model_config_dict = data.pop("model_config") # pyright: ignore[reportAny]
model_config = ModelConfig.from_dict(model_config_dict) # pyright: ignore[reportAny]

data.update({"model_config": model_config})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More simply I guess this could be
data["model_config"] = ModelConfig.from_dict(data["model_config"])? Also it looks like this modifies data in place — might not be desirable, should at least be documented as a side effect but might be better to do a deep clone

def delete_output_checkpoint(self) -> None:
"""Delete the tampered model checkpoint if it exists."""
if Path(self.output_checkpoint_path).exists():
import shutil
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

import shutil at top of file? it shouldn't be an expensive import

import shutil

shutil.rmtree(self.output_checkpoint_path)
Path(self.output_checkpoint_path).mkdir(parents=True, exist_ok=False)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why re-create the directory after deleting?



def register_attack(
name: AttackName, config_cls: type[H]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure H is necessary, it's not re-used several times like T

Suggested change
name: AttackName, config_cls: type[H]
name: AttackName, config_cls: TamperAttackConfig

# <|im_start|>assistant\n{assistant_text}<|im_end|>
return TextTemplate(
user_prefix="<|im_start|>user\n",
assistant_prefix="<|im_start|>assistant\n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as mentioned in the jailbreak tuning PR: for Qwen3, may want to also turn off thinking with <think> </think>, but does not apply to older Qwen models.

Print the output of

tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=False  # Turns off thinking by adding <think></think>
)

to see what it looks like

Raises:
KeyError: If the requested template has not been registered.
"""
key = TemplateName(name) if not isinstance(name, TemplateName) else name
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pedantic preference but I'd find one of these easier to read:

TemplateName(name) if isinstance(name, str) else name
name if isinstance(name, TemplateName) else TemplateName(name)

ATTACKS_REGISTRY[name] = (config_cls, attack_cls)
return attack_cls

return _decorator
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't look like ATTACKS_REGISTRY or register_attack is actually used, is that forthcoming in a subsequent PR?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it'd be used in forthcoming PRs where we do sweeps.

"""Decorator to register a template factory by name.""" # noqa: D401

def _decorator(factory: Callable[[], TextTemplate]) -> Callable[[], TextTemplate]:
_TEMPLATE_REGISTRY[name] = factory()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kinda wonder if the registry pattern is useful/necessary for this case. Like, it looks equally valid to just define everything in-line so we're not doing this lazy evaluation thing where we define a function and then immediately call it to populate the template registry. (This feels a bit different from the attack/eval registries because here we're already just populating this registry here in this file, unlike attacks/evals where we're defining each attack & eval in separate files)

  _TEMPLATE_REGISTRY: dict[TemplateName, TextTemplate] = {
      TemplateName.LLAMA3: TextTemplate(
          user_prefix="<|start_header_id|>user<|end_header_id|>\n\n",
          assistant_prefix="<|start_header_id|>assistant<|end_header_id|>\n\n",
          end_turn="<|eot_id|>",
      ),
      # ... more templates inline
  }

  def get_template(name: str | TemplateName) -> TextTemplate:
      # ... same lookup logic

We could still have a function register_template(name: TemplateName, template: TextTemplate) doing _TEMPLATE_REGISTRY[name] = template if users want to add new templates at runtime.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It kind of makes it easier when using configs (i.e. you can specify in the .yaml file that we use a specific template).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm so my understanding is that as long as _TEMPLATE_REGISTRY is populated in some way, then get_template() should be able to fetch specific templates, and I'm suggesting that _TEMPLATE_REGISTRY can be populated in a more direct way rather than making decorators do it.

So I guess what I'm actually suggesting is not to get rid of the registry pattern but rather to consider skipping using the decorator pattern.

EVALS_REGISTRY[name] = eval_cls
return eval_cls

return _decorator
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similarly to attacks, I don't see this registry or @register_evaluation used anywhere, is it going to be a future PR?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I believe @register_evaluation will be added over in #35 and more specificially in: here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@register_evaluation(EvalName.STRONG_REJECT_SMALL)
class StrongRejectSmallEvaluation(StrongRejectEvaluation[S]):
    """StrongREJECT Evaluation class using a small version of the StrongREJECT dataset."""

    name: EvalName = EvalName.STRONG_REJECT_SMALL

    @override
    def load_strong_reject_prompts(self) -> list[str]:
        """Load the small version of the StrongReject dataset into an Arrow Dataset, and then return prompts.

        Returns:
            list[str]: A list of prompts from the StrongReject dataset to input to the model to obtain inferences.
        """
        strong_reject_dataset: ArrowDataset = (
            load_strong_reject_datasets.load_strongreject_small()
        )
        ...

and then we can specify the "strongreject_small" in a .yaml file to refer to the usage of this.

)
from safetunebed.whitebox.attacks.lora_finetune import lora_finetune as _
from safetunebed.whitebox.attacks.multilingual_finetune import (
multilingual_finetune as _, # noqa: F401
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: why does only this import have noqa: F401?

) -> Callable[[type[T]], type[T]]:
"""Decorator to register an attack class and its config class under a name.""" # noqa: D401

def _decorator(attack_cls: type[T]) -> type[T]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I wonder if we can also make the decorator set attack_cls.name:

@register_attack(AttackName.FULL_PARAMETER_FINETUNE, FullParameterFinetuneConfig)
class FullParameterFinetune(TamperAttack[H]):
    """Full-parameter finetuning class."""
    name: AttackName = AttackName.FULL_PARAMETER_FINETUNE  # if the decorator can set the name then could remove this line where the user has to specify the name yet again

(doesn't have to be this PR, since I know making changes with stacked PRs can end up being confusing)

Comment on lines +159 to +169
if EvalName.STRONG_REJECT in self.attack_config.evals:
results = pl.concat([results, self.evaluate_strong_reject()])

if EvalName.STRONG_REJECT_SMALL in self.attack_config.evals:
results = pl.concat([results, self.evaluate_strong_reject_small()])

if EvalName.MMLU_PRO_VAL in self.attack_config.evals:
results = pl.concat([results, self.evaluate_mmlu_pro_val()])

if EvalName.MMLU_PRO_TEST in self.attack_config.evals:
results = pl.concat([results, self.evaluate_mmlu_pro_test()])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we could make use of the registry somehow instead of having to manually update stuff over here as well whenever we add a new eval. So like, instead of having 4 if statements, we'd have something like,

for eval_name, run_eval in EVALUATION_REGISTRY.items():
	if eval_name in self.attack_config.evals:
		results = pl.concat([results, run_eval(
    		model_checkpoint=self.output_checkpoint_path,
			...
    	)
    ])

@tomtseng tomtseng mentioned this pull request Dec 19, 2025
@sdhossain
Copy link
Collaborator Author

Closing as we merged in #53

@sdhossain sdhossain closed this Jan 22, 2026
@sdhossain sdhossain deleted the sh/td_templates branch February 5, 2026 22:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants