Conversation
|
|
||
| Import modules for side effects so they register via the attacks registry. | ||
| """ | ||
|
|
There was a problem hiding this comment.
Note: some of these imports don't exist yet because I've retroactively split it up.
| model_config_dict = data.pop("model_config") # pyright: ignore[reportAny] | ||
| model_config = ModelConfig.from_dict(model_config_dict) # pyright: ignore[reportAny] | ||
|
|
||
| data.update({"model_config": model_config}) |
There was a problem hiding this comment.
More simply I guess this could be
data["model_config"] = ModelConfig.from_dict(data["model_config"])? Also it looks like this modifies data in place — might not be desirable, should at least be documented as a side effect but might be better to do a deep clone
| def delete_output_checkpoint(self) -> None: | ||
| """Delete the tampered model checkpoint if it exists.""" | ||
| if Path(self.output_checkpoint_path).exists(): | ||
| import shutil |
There was a problem hiding this comment.
import shutil at top of file? it shouldn't be an expensive import
| import shutil | ||
|
|
||
| shutil.rmtree(self.output_checkpoint_path) | ||
| Path(self.output_checkpoint_path).mkdir(parents=True, exist_ok=False) |
There was a problem hiding this comment.
why re-create the directory after deleting?
|
|
||
|
|
||
| def register_attack( | ||
| name: AttackName, config_cls: type[H] |
There was a problem hiding this comment.
Not sure H is necessary, it's not re-used several times like T
| name: AttackName, config_cls: type[H] | |
| name: AttackName, config_cls: TamperAttackConfig |
| # <|im_start|>assistant\n{assistant_text}<|im_end|> | ||
| return TextTemplate( | ||
| user_prefix="<|im_start|>user\n", | ||
| assistant_prefix="<|im_start|>assistant\n", |
There was a problem hiding this comment.
as mentioned in the jailbreak tuning PR: for Qwen3, may want to also turn off thinking with <think> </think>, but does not apply to older Qwen models.
Print the output of
tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
enable_thinking=False # Turns off thinking by adding <think></think>
)to see what it looks like
| Raises: | ||
| KeyError: If the requested template has not been registered. | ||
| """ | ||
| key = TemplateName(name) if not isinstance(name, TemplateName) else name |
There was a problem hiding this comment.
pedantic preference but I'd find one of these easier to read:
TemplateName(name) if isinstance(name, str) else namename if isinstance(name, TemplateName) else TemplateName(name)| ATTACKS_REGISTRY[name] = (config_cls, attack_cls) | ||
| return attack_cls | ||
|
|
||
| return _decorator |
There was a problem hiding this comment.
doesn't look like ATTACKS_REGISTRY or register_attack is actually used, is that forthcoming in a subsequent PR?
There was a problem hiding this comment.
Yeah it'd be used in forthcoming PRs where we do sweeps.
| """Decorator to register a template factory by name.""" # noqa: D401 | ||
|
|
||
| def _decorator(factory: Callable[[], TextTemplate]) -> Callable[[], TextTemplate]: | ||
| _TEMPLATE_REGISTRY[name] = factory() |
There was a problem hiding this comment.
I kinda wonder if the registry pattern is useful/necessary for this case. Like, it looks equally valid to just define everything in-line so we're not doing this lazy evaluation thing where we define a function and then immediately call it to populate the template registry. (This feels a bit different from the attack/eval registries because here we're already just populating this registry here in this file, unlike attacks/evals where we're defining each attack & eval in separate files)
_TEMPLATE_REGISTRY: dict[TemplateName, TextTemplate] = {
TemplateName.LLAMA3: TextTemplate(
user_prefix="<|start_header_id|>user<|end_header_id|>\n\n",
assistant_prefix="<|start_header_id|>assistant<|end_header_id|>\n\n",
end_turn="<|eot_id|>",
),
# ... more templates inline
}
def get_template(name: str | TemplateName) -> TextTemplate:
# ... same lookup logicWe could still have a function register_template(name: TemplateName, template: TextTemplate) doing _TEMPLATE_REGISTRY[name] = template if users want to add new templates at runtime.
There was a problem hiding this comment.
It kind of makes it easier when using configs (i.e. you can specify in the .yaml file that we use a specific template).
There was a problem hiding this comment.
Hmm so my understanding is that as long as _TEMPLATE_REGISTRY is populated in some way, then get_template() should be able to fetch specific templates, and I'm suggesting that _TEMPLATE_REGISTRY can be populated in a more direct way rather than making decorators do it.
So I guess what I'm actually suggesting is not to get rid of the registry pattern but rather to consider skipping using the decorator pattern.
| EVALS_REGISTRY[name] = eval_cls | ||
| return eval_cls | ||
|
|
||
| return _decorator |
There was a problem hiding this comment.
similarly to attacks, I don't see this registry or @register_evaluation used anywhere, is it going to be a future PR?
There was a problem hiding this comment.
@register_evaluation(EvalName.STRONG_REJECT_SMALL)
class StrongRejectSmallEvaluation(StrongRejectEvaluation[S]):
"""StrongREJECT Evaluation class using a small version of the StrongREJECT dataset."""
name: EvalName = EvalName.STRONG_REJECT_SMALL
@override
def load_strong_reject_prompts(self) -> list[str]:
"""Load the small version of the StrongReject dataset into an Arrow Dataset, and then return prompts.
Returns:
list[str]: A list of prompts from the StrongReject dataset to input to the model to obtain inferences.
"""
strong_reject_dataset: ArrowDataset = (
load_strong_reject_datasets.load_strongreject_small()
)
...
and then we can specify the "strongreject_small" in a .yaml file to refer to the usage of this.
| ) | ||
| from safetunebed.whitebox.attacks.lora_finetune import lora_finetune as _ | ||
| from safetunebed.whitebox.attacks.multilingual_finetune import ( | ||
| multilingual_finetune as _, # noqa: F401 |
There was a problem hiding this comment.
nit: why does only this import have noqa: F401?
| ) -> Callable[[type[T]], type[T]]: | ||
| """Decorator to register an attack class and its config class under a name.""" # noqa: D401 | ||
|
|
||
| def _decorator(attack_cls: type[T]) -> type[T]: |
There was a problem hiding this comment.
nit: I wonder if we can also make the decorator set attack_cls.name:
@register_attack(AttackName.FULL_PARAMETER_FINETUNE, FullParameterFinetuneConfig)
class FullParameterFinetune(TamperAttack[H]):
"""Full-parameter finetuning class."""
name: AttackName = AttackName.FULL_PARAMETER_FINETUNE # if the decorator can set the name then could remove this line where the user has to specify the name yet again
(doesn't have to be this PR, since I know making changes with stacked PRs can end up being confusing)
| if EvalName.STRONG_REJECT in self.attack_config.evals: | ||
| results = pl.concat([results, self.evaluate_strong_reject()]) | ||
|
|
||
| if EvalName.STRONG_REJECT_SMALL in self.attack_config.evals: | ||
| results = pl.concat([results, self.evaluate_strong_reject_small()]) | ||
|
|
||
| if EvalName.MMLU_PRO_VAL in self.attack_config.evals: | ||
| results = pl.concat([results, self.evaluate_mmlu_pro_val()]) | ||
|
|
||
| if EvalName.MMLU_PRO_TEST in self.attack_config.evals: | ||
| results = pl.concat([results, self.evaluate_mmlu_pro_test()]) |
There was a problem hiding this comment.
I wonder if we could make use of the registry somehow instead of having to manually update stuff over here as well whenever we add a new eval. So like, instead of having 4 if statements, we'd have something like,
for eval_name, run_eval in EVALUATION_REGISTRY.items():
if eval_name in self.attack_config.evals:
results = pl.concat([results, run_eval(
model_checkpoint=self.output_checkpoint_path,
...
)
])|
Closing as we merged in #53 |
Changes
Main change is that I added a registry for attacks, and evals so that we can add an attack by doing the following:
^ and the same pattern for evals.
This would allow this to be used as a package so that people could
pip install tamperbenchand then perhaps register an attack and run experiments on their end.It also makes it a bit cleaner to do benchmarking later and add things, rather than having to populate dirs.
Testing
ICLR runs.