attack: added latent perturbation attack by psyonp · Pull Request #23 · criticalml-uw/TamperBench

psyonp · 2025-08-27T06:21:12Z

Hey,

Feel free to leave comments on the overall structure of the code and don't hold back from making any suggestions. Next steps involve creating tests as needed.

Most comments you'll see in the supporting files come directly from the source code, and provide clarity into how the latent perturbation attack functions.

Thanks,
Punya

sdhossain

@psyonp please make sure to rebase on main, my suggestion is git rebase main -X theirs after pulling changes from main git pull origin main

also if you could please follow the template of #12 - that would be extremely helpful. (i.e. implementing it as an eval as well)

make sure to add some script as a sanity checker that we can run similarly as well.

src/safetunebed/whitebox/attacks/embedding_attack/embedding_attack.py

psyonp · 2025-08-27T23:36:08Z

@psyonp please make sure to rebase on main, my suggestion is git rebase main -X theirs after pulling changes from main git pull origin main

also if you could please follow the template of #12 - that would be extremely helpful. (i.e. implementing it as an eval as well)

make sure to add some script as a sanity checker that we can run similarly as well.

Tried to rebase on main again and it said current branch is up to date. Template is now updated, lmk your thoughts.

sdhossain

Lgtm -> if the tests work and we run pre-commit checks, am ok with merging (but would like a bit more documentation on the ported code -> but think it's ok to not lint that)

src/safetunebed/whitebox/evals/latent_attack/latent_attack.py

sdhossain · 2025-08-28T21:11:33Z

src/tamperbench/whitebox/evals/latent_attack/hooks.py

+        outputs = self.module(*inputs, **kwargs)
+        return self.adversary(outputs)
+
+def deepspeed_add_hooks(


Idk if we are supporting deepspeed just yet, or transformer_lens -> we might not need these hooks? (imo might make sense to remove until we add support for them to avoid confusion / having redundant code, but don't have very strong opinions on it)

Yeah sounds good, kept in case if we migrate to deepspeed on CAIS. Can remove as needed.

sdhossain · 2025-08-28T21:12:54Z

src/safetunebed/whitebox/evals/latent_attack/hooks.py

+        else:
+            return self.module(*args, **kwargs)
+
+# a hook for transformer-lens models


does the method work for non transformer-lens models? (If it does, I think might make sense to use this for all for consistency?)

sdhossain · 2025-08-28T21:13:33Z

src/tamperbench/whitebox/evals/latent_attack/attack_class.py

+import einops
+
+# The attacker is parameterized by a low-rank MLP (not used by default)
+class LowRankAdversary(torch.nn.Module):


I see we are only using GDAdversary in the attack, do we want it to be configurable to use these?

Tried to keep as many helpers as needed. Can remove as needed.

sdhossain · 2025-08-28T21:15:08Z

src/safetunebed/whitebox/evals/latent_attack/attack_class.py

+        return self.m(x) + x
+
+
+# Standard projected gradient attack (used by default)


nit: can we have the comment as the doc-string describing the class

sdhossain · 2025-08-28T21:15:48Z

src/safetunebed/whitebox/evals/latent_attack/attack_class.py

+
+            return x
+
+    def clip_attack(self):


nit: docstring (should probably run pre-commit hooks (currently forces us to doc-string methods)

Done, ran pre-commit hooks

remark: if code is copy-pasted directly from another source with minimal changes , then there's a case to be made that we shouldn't lint their code, we should keep it as-is so that

it's easier to diff to find the logical changes between our version and theirs

if the upstream source changes, it's easier to copy those changes into our version

less work for us lol

if the upstream code hasn't been updated in a long time or we're making a decent number of changes then linting is reasonable

Linting this is a nightmare. I just got hit with 150+ errors because I accidentally ran ruff-check and it won't let me commit until I've resolved the comments. Trying to figure out the best way to approach this, will only stage the files necessary.

Edit: much better, do not run ruff check --fix

Agree in general we don't need to lint for copy-pasted files, (I've generally done flagged the files to be ignored from pyright checks) -> we can do the same for ruff checks. -> we should definitely indicate this in the file header.

I personally liked including doc-strings. just so that it's easy to debug some times, and could help potential users of our toolkit - but aren't too important (especially if we are porting over a lot of files).

@psyonp - just so you don't have to debug a bunch of linting errors, you can add it to the excludes in our pyproject.toml file. (150+ unfixable errors is a bit odd with our rule-set)

exclude = [ ".venv", "**/__pycache__", "_legacy", "src/safetunebed/whitebox/evals/embedding_attack/softopt.py", "path/to/your/file" ]

main changes addressed

psyonp · 2025-08-29T22:21:28Z

Hey, made the requested changes. @tomtseng feel free to look it over and lmk your thoughts

tomtseng

@tomtseng feel free to look it over and lmk your thoughts

Took a skim and it looked reasonable to me — will defer to Saad here since he's already taken a serious look at this PR

sdhossain

lgtm - didn't look too closely at the ported over files (only that we should mention this in the doc-strings / file headers)

I think it's good to merge once the tests pass

@psyonp - if the linting is a bit too messy, I can fix it up in a followup cosmetic PR

tests/evals/test_latent_attack_eval.py

tests/attacks/test_latent_attack.py

src/safetunebed/whitebox/evals/latent_attack/latent_attack.py

sdhossain · 2025-08-29T23:20:58Z

src/safetunebed/whitebox/evals/latent_attack/latent_attack.py

+        for row in tqdm(dataset, total=len(dataset)):
+            prompt_text = row["Goal"]
+
+            inputs = tokenizer([prompt_text], return_tensors="pt", padding=True).to(


curious if we could do .cuda() here? I'm not fully sure - just want to make sure we are able to still support multi-GPU (if we use device = "auto")

^ not suggestion a change - just a question, perhaps leave a comment unless you have a definitive answer for this?

sdhossain · 2025-08-29T23:39:27Z

src/safetunebed/whitebox/evals/latent_attack/latent_attack.py

+                device=model.device,
+            )
+
+            parent_path, attr = self.eval_config.attack_layer.rsplit(".", 1)


transformer.h.0.mlp

nit: can we rename parent_path to something like parent_layer (path could imply it's a file path)

sdhossain · 2025-08-29T23:41:27Z

src/safetunebed/whitebox/evals/latent_attack/attack_class.py

@@ -0,0 +1,221 @@
+"""Standard imports."""


Could we add a link to the file where most of this code is sourced from?

Sounds good, done

sdhossain · 2025-08-29T23:41:57Z

src/safetunebed/whitebox/evals/latent_attack/hooks.py

@@ -0,0 +1,207 @@
+"""Hooks for the latent attack evaluation."""


same here - can we add a link to where the code for this file is sourced from?

src/safetunebed/whitebox/attacks/latent_attack/latent_attack.py

Co-authored-by: Saad Hossain <68381793+sdhossain@users.noreply.github.com>

tomtseng · 2026-02-07T00:02:07Z

I rebased and resolved merge conflicts with main, but the integration test is not passing — StrongREJECT score is low.
Claude Code thinks this is because there's no optimization loop in the code, the adversary parameters are never updated so the perturbation never does anything.
This PR is not a priority to merge so no need to address this if Claude Code is correct that there's a major implementation piece that's missing

psyonp requested a review from tomtseng August 27, 2025 06:21

psyonp added the attack Adds or modifies attacks label Aug 27, 2025

sdhossain previously requested changes Aug 27, 2025

View reviewed changes

src/safetunebed/whitebox/attacks/embedding_attack/embedding_attack.py Outdated Show resolved Hide resolved

sdhossain self-requested a review August 28, 2025 00:19

sdhossain reviewed Aug 28, 2025

View reviewed changes

sdhossain requested review from sdhossain and removed request for sdhossain August 28, 2025 21:19

tomtseng reviewed Aug 29, 2025

View reviewed changes

psyonp requested a review from sdhossain August 29, 2025 23:21

sdhossain approved these changes Aug 29, 2025

View reviewed changes

psyonp and others added 13 commits February 6, 2026 14:48

Migrate hooks.py

77b1d4d

Change setup

16d8683

Refactor latent_attack

3846231

Refactor as requested with eval

7326717

Add tests

dc83e0d

Made requested changes for nits and docstrings

30e87da

Update tests/evals/test_latent_attack_eval.py

02ef53b

Co-authored-by: Saad Hossain <68381793+sdhossain@users.noreply.github.com>

Update src/safetunebed/whitebox/evals/latent_attack/latent_attack.py

97f00b9

Co-authored-by: Saad Hossain <68381793+sdhossain@users.noreply.github.com>

Update src/safetunebed/whitebox/attacks/latent_attack/latent_attack.py

6f0774d

Co-authored-by: Saad Hossain <68381793+sdhossain@users.noreply.github.com>

Update tests/attacks/test_latent_attack.py

432af6c

Co-authored-by: Saad Hossain <68381793+sdhossain@users.noreply.github.com>

Add link for ported code

2b9dabf

Add link to hooks.py

c4019d8

Name change of parent_path to parent_layer

408b9bd

tomtseng force-pushed the psyonp/latent_attack branch from 574b238 to 408b9bd Compare February 6, 2026 22:49

tomtseng added 3 commits February 6, 2026 14:54

Merge branch 'main' into psyonp/latent_attack

d99cc89

evals/latent_attack: Lint

31bf8a9

latent_attack: Fix type errors

eda2fc1

latent_attack: Fix merge errors

5b9c5e4

latent_attack: Remove redundant test

cc516cf

		return self.m(x) + x


		# Standard projected gradient attack (used by default)

		@@ -0,0 +1,207 @@
		"""Hooks for the latent attack evaluation."""

Conversation

psyonp commented Aug 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sdhossain left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

psyonp commented Aug 27, 2025

Uh oh!

sdhossain left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

psyonp Aug 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

psyonp commented Aug 29, 2025

Uh oh!

tomtseng left a comment

Choose a reason for hiding this comment

Uh oh!

sdhossain left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tomtseng commented Feb 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

psyonp commented Aug 27, 2025 •

edited

Loading

sdhossain left a comment •

edited

Loading

psyonp Aug 29, 2025 •

edited

Loading

sdhossain left a comment •

edited

Loading