Skip to content

attack: added Wanda Pruning (attack) #26

Open
esveee wants to merge 2 commits intomainfrom
esveee/wanda-final
Open

attack: added Wanda Pruning (attack) #26
esveee wants to merge 2 commits intomainfrom
esveee/wanda-final

Conversation

@esveee
Copy link
Collaborator

@esveee esveee commented Aug 29, 2025

Changes

Added Wanda pruning as an attack paradigm. Normally, pruning is benign (for efficiency). But if applied maliciously, it can be a form of model tampering.

Testing

[Test in Progress]

Copy link
Collaborator

@tomtseng tomtseng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like most of this is code copied from elsewhere, which I won't bother carefully reviewing.
Tagging Saad to review the remaining file src/safetunebed/whitebox/attacks/wanda_pruning/wanda_pruning.py

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can add citation and link to original code here, like punya did in his PR src/safetunebed/whitebox/attacks/gcg/init.py

@tomtseng tomtseng requested review from sdhossain and tomtseng August 29, 2025 20:58
Copy link
Collaborator

@sdhossain sdhossain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@esveee could you please add a test script we could run? similar to what we have in our tests folder currently -> also do add a custom config if it's relevant here.

otherwise lgtm (also didn't look over too much at the ported over code - would recommend adding the citation + link to code as is done with other attacks)

def run_attack(self) -> None:
cfg = self.attack_config

print(f"[WandA] Loading model from: {cfg.base_input_checkpoint_path}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know we don't have a unified logging logic for the repository just yet, I do think we should probably use logging.logger so that we can control the level of logging.

Not something to necessarily scope for this PR.

cfg = self.attack_config

print(f"[WandA] Loading model from: {cfg.base_input_checkpoint_path}")
model = AutoModelForCausalLM.from_pretrained(cfg.base_input_checkpoint_path, torch_dtype=torch.float16)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

torch_dtype=torch.float16 -> torch_dtype=torch.bfloat16 <-- note (only relevant if we are having errors here)

"""Implements weight-space tampering via WandA pruning."""

def run_attack(self) -> None:
cfg = self.attack_config
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: our current style is to use config instead of cfg, I personally prefer that we use self.attack_config explicitly where we use it (so that is not ambiguous with other configs)

StrongRejectEvaluationConfig,
)

class WandaPruningAttack(TamperAttack[TamperAttackConfig]):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need a custom WandaPruningAttackConfig for this attack?

@@ -0,0 +1,398 @@
import time
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add a note on where the code was sourced from in header doc-string? (that is if it was sourced externally - ping me if it wasn't)

@sdhossain sdhossain added the attack Adds or modifies attacks label Dec 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

attack Adds or modifies attacks

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants