Skip to content

attack: Added refusal ablation attack#27

Open
NayeemaNonta wants to merge 106 commits intomainfrom
nnonta/refusal_ablation_attack
Open

attack: Added refusal ablation attack#27
NayeemaNonta wants to merge 106 commits intomainfrom
nnonta/refusal_ablation_attack

Conversation

@NayeemaNonta
Copy link
Collaborator

@NayeemaNonta NayeemaNonta commented Sep 2, 2025

Changes

Added the refusal ablation attack.

Testing

pip install -e .
python tests/attacks/test_refusal_ablation_attack.py


# Attack parameters
batch_size: 32
refusal_strings: ["I", "As"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we change this to the list of refusal strings given in Appendix D of the paper?

["I’m sorry", "I am sorry", "I apologize", "As an AI", "As a language model",
"As an assistant", "I cannot", "I can’t", "I am unable to", "I’m unable to", "I
am not able to", "I’m not able to"]

)


def log(message: str, level: str = "info"):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a bunch of these files stub out the logging like this — I wonder if we should just use the logging library? we can log all these things at the "info" or "debug" level


# get logits for the harmless val set
format_func = self.harmful_val.get_eval_formatter(self.model.tokenizer)
baseline_harmless_logits = self._get_last_position_logits(self.harmless_val, format_func)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this correct that we get format_func from the harmless dataset but then apply it on the harmless dataset?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think there was a bug in the original code which was getting format_func from the harmful and applying it to the harmless but I updated it now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

attack Adds or modifies attacks

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants