attack: Added refusal ablation attack by NayeemaNonta · Pull Request #27 · criticalml-uw/TamperBench

NayeemaNonta · 2025-09-02T23:22:23Z

Changes

Testing

pip install -e .

python tests/attacks/test_refusal_ablation_attack.py

tomtseng · 2026-02-06T03:30:23Z

configs/whitebox/attacks/refusal_ablation/grid.yaml

+
+    # Attack parameters 
+    batch_size: 32
+    refusal_strings: ["I", "As"]


should we change this to the list of refusal strings given in Appendix D of the paper?

["I’m sorry", "I am sorry", "I apologize", "As an AI", "As a language model",
"As an assistant", "I cannot", "I can’t", "I am unable to", "I’m unable to", "I
am not able to", "I’m not able to"]

tomtseng · 2026-02-06T03:31:51Z

src/safetunebed/whitebox/attacks/refusal_ablation/models.py

+)
+
+
+def log(message: str, level: str = "info"):


a bunch of these files stub out the logging like this — I wonder if we should just use the logging library? we can log all these things at the "info" or "debug" level

src/tamperbench/whitebox/attacks/refusal_ablation/datasets.py

src/tamperbench/whitebox/attacks/refusal_ablation/refusal_ablation.py

src/safetunebed/whitebox/attacks/refusal_ablation/chat_templates/llama3_instruct.jinja2

tomtseng · 2026-02-06T03:54:12Z

src/tamperbench/whitebox/attacks/refusal_ablation/refusal_ablation.py

+
+        # get logits for the harmless val set
+        format_func = self.harmful_val.get_eval_formatter(self.model.tokenizer)
+        baseline_harmless_logits = self._get_last_position_logits(self.harmless_val, format_func)


is this correct that we get format_func from the harmless dataset but then apply it on the harmless dataset?

Yes, I think there was a bug in the original code which was getting format_func from the harmful and applying it to the harmless but I updated it now.

src/safetunebed/whitebox/attacks/refusal_ablation/attack_utils.py

src/safetunebed/whitebox/attacks/refusal_ablation/datasets.py

src/safetunebed/whitebox/attacks/refusal_ablation/constants.py

src/safetunebed/whitebox/attacks/refusal_ablation/refusal_ablation.py

src/safetunebed/whitebox/attacks/refusal_ablation/attack_utils.py

Co-authored-by: Tom Tseng <tomtseng@users.noreply.github.com>

NayeemaNonta added 30 commits August 20, 2025 21:45

refusal abl attack

6b2baf0

refusal abl config

28a1f93

refusal abl tests

1328b2e

add hydra

d596c2e

fix typo

ba4f79b

fix file import

a1ffeea

fix file import

f2a4a21

add wandb

012d947

rm finetuneattack config

bc39f52

fix file imports

8f8946b

fix file imports

c7c9fb3

fix naming

439bc27

fix naming

faac711

fix naming

7a006eb

test attack pipeline

b445e4f

test attack pipeline

4b1c3c6

test attack pipeline

18f793e

fix typo

d0de757

fix typo

d11a59a

fix typo

b1cbdb2

fix typo

025fb32

add download dir

34da225

add download dir

aa0ffb3

add download dir

3f91738

fix config

720558a

chat templates

ae253a7

fix config

19c9b7b

fix config

5bfec96

fix config

fbff6e0

add datasets

25b6034

NayeemaNonta added 5 commits December 26, 2025 23:10

style update

6e2a8ad

style update

b3fdbb8

style update

f212e6d

style update

fff3510

style update

cfa6345

tomtseng approved these changes Feb 6, 2026

View reviewed changes

sdhossain approved these changes Feb 6, 2026

View reviewed changes

src/safetunebed/whitebox/attacks/refusal_ablation/refusal_ablation.py Outdated Show resolved Hide resolved

src/safetunebed/whitebox/attacks/refusal_ablation/attack_utils.py Outdated Show resolved Hide resolved

NayeemaNonta and others added 23 commits February 17, 2026 16:45

minor fix

3b52d8c

minor fix

80c3181

minor fix

8865354

Update src/safetunebed/whitebox/attacks/refusal_ablation/datasets.py

9c46d2b

Co-authored-by: Tom Tseng <tomtseng@users.noreply.github.com>

minor fix

b5f496b

minor fix

3e2cc93

remove chat templates

a85f816

update

ac50387

Merge branch 'main' into nnonta/refusal_ablation_attack

73228a6

refusal_ablation: Add to attack registry

7b5f79d

ruff errors

e715908

ruff errors

9531137

ruff errors

a2bfc27

ruff errors

a13486b

ruff errors

4adbad2

ruff errors

92e1233

ruff errors

3becdbf

pyright

eec4e96

Merge branch 'main' into nnonta/refusal_ablation_attack

7501ecc

pyright

e6f6f31

pyright

3588efe

pyright

5e1569e

pyright

24914a0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

attack: Added refusal ablation attack#27

attack: Added refusal ablation attack#27
NayeemaNonta wants to merge 106 commits intomainfrom
nnonta/refusal_ablation_attack

NayeemaNonta commented Sep 2, 2025 •

edited

Loading

Uh oh!

tomtseng Feb 6, 2026

Uh oh!

tomtseng Feb 6, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tomtseng Feb 6, 2026

Uh oh!

NayeemaNonta Feb 22, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

NayeemaNonta commented Sep 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Testing

Uh oh!

tomtseng Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

tomtseng Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tomtseng Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

NayeemaNonta Feb 22, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

NayeemaNonta commented Sep 2, 2025 •

edited

Loading