attacks + evals: minor refactor for strongreject and embedding (review not important rn) by sdhossain · Pull Request #35 · criticalml-uw/TamperBench

sdhossain · 2025-10-07T19:33:03Z

Changes

Small refactor for strongreject and embedding attack but nothing logical changed, so probably no need for review

Testing

ICLR submission.

tomtseng · 2025-10-29T05:47:47Z

Same Q as #33: How does the diff work here? I know these files already exist on main, so why is the diff showing these files as entirely new files rather than showing lines added/removed? I think seeing that diff would make this easier to review

tomtseng · 2025-11-06T01:57:51Z

src/safetunebed/whitebox/attacks/embedding_attack/embedding_attack.py

-    def from_dict(cls, data: dict[str, Any]) -> Self:  # pyright: ignore[reportExplicitAny]
-        """Construct config from a dictionary.
+    def serialize_data(cls, data: dict[str, Any]) -> dict[str, Any]:  # pyright: ignore[reportExplicitAny]
+        """Serialize data from a dictionary such that it can be used to construct nested objects.


same comment as in https://github.com/sdhossain/SafeTuneBed/pull/33/files#r2496773669, I think this function name/comment doesn't sufficiently describe what this class does.

though this does make me wonder if we even need a proper function comment here or whether we just write """See parent class.""" or something since the behavior doesn't substantially change — not sure what best practices here are!

tomtseng · 2025-11-06T02:01:44Z

src/safetunebed/whitebox/evals/embedding_attack/embedding_attack.py

+        q = multiprocessing.Queue()
+        p = multiprocessing.Process(
+            target=_instantiate_model_and_infer, args=(self.eval_config, jbb_dataset, q)
+        )
+        p.start()
+        _inferences_dataframe = q.get()  # blocks until child puts result
+        p.join()


What's the purpose of multiprocessing here? I'm not familiar with the library but it looks like this is putting one item in the queue and blocking until it completes — needs a comment explaining why we're doing this rather than just doing the evaluation in the same process

same q for multiprocessing in strongREJECT

The purpose for multiprocessing is just to isolate the run (so that there is no chance for any memory leak, etc. that could crash subsequent processes).

Sometimes when random bugs happen, and memory hangs, even if I try generic methods to clear cache, etc. they don't work and I have to wrap it into a process.

Since we're running / orchestrating sweeps through python, used this approach, which is perhaps a bit hacky and am very open to suggestions.

StrongREJECT as well, often times there are memory leaks which crash the whole sweep.

oof ok, in that case I would leave a comment and a TODO but yeah this seems like a pragmatic solution if there's a mysterious memory leak that we can't easily track down

tomtseng · 2025-11-06T02:06:41Z

src/safetunebed/whitebox/evals/embedding_attack/softopt.py


-    model = model.to(config.device)
+    user_prefix = "[INST] "
+    assistant_prefix = " [/INST] "


am curious whether it's appropriate to make an individual evaluation pick the default user & assistant prefix, I haven't really thought about it. (In this case these prefixes look like the llama ones.)
is there a particular motivation for why SoftOpt prefers these default prefixes (e.g., this is what the paper does)?
or maybe the function could take the model_config as an arg and read the prefixes from there?

tomtseng · 2025-11-06T02:08:44Z

src/safetunebed/whitebox/evals/embedding_attack/softopt.py

    target_ids = tokenizer([target], add_special_tokens=False)["input_ids"]

+    embedding_layer = model.get_input_embeddings()
+    first_device = embedding_layer.weight.device  # this is where inputs must live


does model.device not always work? and why would the device be inconsistent with config.device, like why did we move away from config.device?

I believe it was because it was failing on muti-gpu (as config.device only allowed one device)

oh i see, like if we specify config.device as something like cuda:0 then that's just not correct for multi-gpu since e.g., if we are using two GPUs then one process should have cuda:0 and the other should have cuda:1?

hmm yeah this makes me wonder if we should get rid of config.device and do something different in order to clean this up, I'm not actually sure what best practices here are though

tomtseng · 2025-11-06T02:09:11Z

src/safetunebed/whitebox/evals/embedding_attack/softopt.py

        optim_ids = tokenizer(
            config.optim_str_init, return_tensors="pt", add_special_tokens=False
-        )["input_ids"].to(config.device)
+        )["input_ids"].cuda()


why not .to(first_device)?

handles multi-gpu better.

tomtseng · 2025-11-06T02:11:37Z

src/safetunebed/whitebox/evals/strong_reject/strong_reject.py

-# USER_PREFIX       = "[INST] "      # note the trailing space
-# ASSISTANT_PREFIX  = " [/INST] "    # note the leading & trailing spaces
-# END_TURN          = "</s>"         # or: tokenizer.eos_token
+multiprocessing.set_start_method("spawn", force=True)


this looks like it has a global effect, could lead to unexpected results if we use multiprocessing elsewhere and forget that we've configured things here, so am wondering if there are alternative non-global options here

Hmmm.. I've put this as a TODO as am not fully sure as of yet.

sdhossain · 2026-01-14T11:21:38Z

Replying to comments here, as I look to close this, and have the 1 mega PR. Putting TODOs for things that aren't necessarily resolved.

sdhossain · 2026-01-22T12:29:58Z

Closing as we merged in #53

sdhossain added 3 commits October 7, 2025 14:49

clean start

6abc108

changes to template

fecd77a

added embedding and strongreject

6e435a5

sdhossain changed the base branch from main to sh/td_templates October 7, 2025 19:33

sdhossain changed the base branch from sh/td_templates to main November 2, 2025 15:33

sdhossain changed the base branch from main to sh/td_templates November 2, 2025 15:37

sdhossain changed the base branch from sh/td_templates to main November 4, 2025 19:58

sdhossain mentioned this pull request Nov 4, 2025

infra: changes to template #32

Closed

tomtseng mentioned this pull request Nov 6, 2025

attack: added finetune attacks (refactor) #33

Closed

tomtseng reviewed Nov 6, 2025

View reviewed changes

sdhossain closed this Jan 22, 2026

sdhossain deleted the sh/td_sr_embedding branch February 5, 2026 22:36

Conversation

sdhossain commented Oct 7, 2025

Changes

Testing

Uh oh!

tomtseng commented Oct 29, 2025

Uh oh!

tomtseng Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sdhossain commented Jan 14, 2026

Uh oh!

sdhossain commented Jan 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tomtseng Nov 6, 2025 •

edited

Loading