Skip to content

attack: added latent perturbation attack#23

Open
psyonp wants to merge 18 commits intomainfrom
psyonp/latent_attack
Open

attack: added latent perturbation attack#23
psyonp wants to merge 18 commits intomainfrom
psyonp/latent_attack

Conversation

@psyonp
Copy link
Collaborator

@psyonp psyonp commented Aug 27, 2025

Hey,

Feel free to leave comments on the overall structure of the code and don't hold back from making any suggestions. Next steps involve creating tests as needed.

Most comments you'll see in the supporting files come directly from the source code, and provide clarity into how the latent perturbation attack functions.

Thanks,
Punya

@psyonp psyonp requested a review from tomtseng August 27, 2025 06:21
@psyonp psyonp added the attack Adds or modifies attacks label Aug 27, 2025
Copy link
Collaborator

@sdhossain sdhossain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@psyonp please make sure to rebase on main, my suggestion is git rebase main -X theirs after pulling changes from main git pull origin main

also if you could please follow the template of #12 - that would be extremely helpful. (i.e. implementing it as an eval as well)

make sure to add some script as a sanity checker that we can run similarly as well.

@psyonp
Copy link
Collaborator Author

psyonp commented Aug 27, 2025

@psyonp please make sure to rebase on main, my suggestion is git rebase main -X theirs after pulling changes from main git pull origin main

also if you could please follow the template of #12 - that would be extremely helpful. (i.e. implementing it as an eval as well)

make sure to add some script as a sanity checker that we can run similarly as well.

Tried to rebase on main again and it said current branch is up to date. Template is now updated, lmk your thoughts.

@sdhossain sdhossain self-requested a review August 28, 2025 00:19
Copy link
Collaborator

@sdhossain sdhossain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm -> if the tests work and we run pre-commit checks, am ok with merging (but would like a bit more documentation on the ported code -> but think it's ok to not lint that)

outputs = self.module(*inputs, **kwargs)
return self.adversary(outputs)

def deepspeed_add_hooks(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Idk if we are supporting deepspeed just yet, or transformer_lens -> we might not need these hooks? (imo might make sense to remove until we add support for them to avoid confusion / having redundant code, but don't have very strong opinions on it)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah sounds good, kept in case if we migrate to deepspeed on CAIS. Can remove as needed.

else:
return self.module(*args, **kwargs)

# a hook for transformer-lens models
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does the method work for non transformer-lens models? (If it does, I think might make sense to use this for all for consistency?)

import einops

# The attacker is parameterized by a low-rank MLP (not used by default)
class LowRankAdversary(torch.nn.Module):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see we are only using GDAdversary in the attack, do we want it to be configurable to use these?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried to keep as many helpers as needed. Can remove as needed.

return self.m(x) + x


# Standard projected gradient attack (used by default)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we have the comment as the doc-string describing the class

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


return x

def clip_attack(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: docstring (should probably run pre-commit hooks (currently forces us to doc-string methods)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, ran pre-commit hooks

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remark: if code is copy-pasted directly from another source with minimal changes , then there's a case to be made that we shouldn't lint their code, we should keep it as-is so that

  1. it's easier to diff to find the logical changes between our version and theirs
  2. if the upstream source changes, it's easier to copy those changes into our version
  3. less work for us lol

if the upstream code hasn't been updated in a long time or we're making a decent number of changes then linting is reasonable

Copy link
Collaborator Author

@psyonp psyonp Aug 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Linting this is a nightmare. I just got hit with 150+ errors because I accidentally ran ruff-check and it won't let me commit until I've resolved the comments. Trying to figure out the best way to approach this, will only stage the files necessary.

Edit: much better, do not run ruff check --fix

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree in general we don't need to lint for copy-pasted files, (I've generally done flagged the files to be ignored from pyright checks) -> we can do the same for ruff checks. -> we should definitely indicate this in the file header.

I personally liked including doc-strings. just so that it's easy to debug some times, and could help potential users of our toolkit - but aren't too important (especially if we are porting over a lot of files).

@psyonp - just so you don't have to debug a bunch of linting errors, you can add it to the excludes in our pyproject.toml file. (150+ unfixable errors is a bit odd with our rule-set)

exclude = [
  ".venv",
  "**/__pycache__",
  "_legacy",
  "src/safetunebed/whitebox/evals/embedding_attack/softopt.py",
  "path/to/your/file"
]

@sdhossain sdhossain dismissed their stale review August 28, 2025 21:19

main changes addressed

@sdhossain sdhossain requested review from sdhossain and removed request for sdhossain August 28, 2025 21:19
@psyonp
Copy link
Collaborator Author

psyonp commented Aug 29, 2025

Hey, made the requested changes. @tomtseng feel free to look it over and lmk your thoughts

Copy link
Collaborator

@tomtseng tomtseng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tomtseng feel free to look it over and lmk your thoughts

Took a skim and it looked reasonable to me — will defer to Saad here since he's already taken a serious look at this PR

@psyonp psyonp requested a review from sdhossain August 29, 2025 23:21
Copy link
Collaborator

@sdhossain sdhossain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm - didn't look too closely at the ported over files (only that we should mention this in the doc-strings / file headers)

I think it's good to merge once the tests pass

@psyonp - if the linting is a bit too messy, I can fix it up in a followup cosmetic PR

for row in tqdm(dataset, total=len(dataset)):
prompt_text = row["Goal"]

inputs = tokenizer([prompt_text], return_tensors="pt", padding=True).to(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curious if we could do .cuda() here? I'm not fully sure - just want to make sure we are able to still support multi-GPU (if we use device = "auto")

^ not suggestion a change - just a question, perhaps leave a comment unless you have a definitive answer for this?

device=model.device,
)

parent_path, attr = self.eval_config.attack_layer.rsplit(".", 1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

transformer.h.0.mlp

nit: can we rename parent_path to something like parent_layer (path could imply it's a file path)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, done

@@ -0,0 +1,221 @@
"""Standard imports."""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add a link to the file where most of this code is sourced from?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, done

@@ -0,0 +1,207 @@
"""Hooks for the latent attack evaluation."""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here - can we add a link to where the code for this file is sourced from?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@tomtseng tomtseng force-pushed the psyonp/latent_attack branch from 574b238 to 408b9bd Compare February 6, 2026 22:49
@tomtseng
Copy link
Collaborator

tomtseng commented Feb 7, 2026

I rebased and resolved merge conflicts with main, but the integration test is not passing — StrongREJECT score is low.
Claude Code thinks this is because there's no optimization loop in the code, the adversary parameters are never updated so the perturbation never does anything.
This PR is not a priority to merge so no need to address this if Claude Code is correct that there's a major implementation piece that's missing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

attack Adds or modifies attacks

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants