Skip to content

defense: CTRL#48

Merged
mruwnik merged 9 commits intomainfrom
ctrl-defense
Feb 24, 2026
Merged

defense: CTRL#48
mruwnik merged 9 commits intomainfrom
ctrl-defense

Conversation

@mruwnik
Copy link
Collaborator

@mruwnik mruwnik commented Dec 16, 2025

Changes

Implements the defense from the CTRL paper.

This defense seems to actually work (this is the what the test_ctrl.py script printed out):

--- StrongREJECT Scores (after attack, lower = safer) ---
Baseline (no defense):     0.5965
With CTRL defense:         0.0732
Jailbreak rate reduction:  +0.5233
Relative reduction:        +87.72%

The baseline llama 3.1 instruct was successfully attacked, while a hardened version of it managed to defeat most attack prompt.

The actual defense is quite simple:

  1. find a bunch of neutral texts
  2. find variants of them with the lowest perplexity for the model that is to be hardened
  3. train on the low perplexity variants

The main problem with it was generating all the variants and scoring them. Each variant needs to be generated 5 times, for each of which there are another 4 prompts to score it on various metrics, which ends up with a lot of generations needed. The code has a bunch of optimisations to get this down from taking days to only an hour or 3.

Implementation approach

I first ran the perplexity checkers on a bunch of inputs to see if they made sense. Then I ran a bunch of inputs on the helpfulness scorer. Once the metrics seemed sane, I ran them on a bunch of different prompts to find one that best generated variants (otherwise the variants were too different, too verbose, or just strange). This was enough to have confidence in the basic unit of "please find a lower perplexity of this sentence, such that it's still at least as helpful as the original".

Once that was done, most of the time was spent trying to get things to run as efficiently as possible, and to make sure things blowing up in the middle of a run wouldn't require waiting another couple of hours for things to be recreated.

I ended up with the curation process working in waves of "generate variants for all these texts" -> "calculate metrics for all these texts" -> "choose the best from all these variants".

Generations are done with vLLM, but it didn't seem to want to properly use all GPUs for generations, so I made a quick'n'dirty worker type mechanism for splitting things between the GPUs. Please tell me there's a better method...

changes relative to original paper:

  • different revision prompt, based on trying several prompt variations to find the best one
  • used subset of 1500 sentences from brown corpus rather than whole corpus

Testing

The tests/defenses/test_ctlr.py is for testing. It will first curate the dataset and train a hardened model (these will be cached, so subsequent runs only take a couple of minutes), then run the LoRA fine tune attack and StrongREJECT both on the base model and the hardened one.

Copy link
Collaborator Author

@mruwnik mruwnik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still needs to have a final run (or two) to make sure everything is still working, as I used caches a lot. But that takes a while and I wanted to just push what is already done

}


def run_defense(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These seem to be repeating - might be worth having a testing utils file for the basic "run defense" -> "run attack" -> "run eval" flow to then compare the same against a base model

@mruwnik mruwnik requested review from esveee, sdhossain and tomtseng and removed request for sdhossain December 16, 2025 15:16
Copy link
Collaborator

@tomtseng tomtseng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spot-checked a few details based on what I already knew from reading the paper a couple days ago, didn't carefully read prompts_cache.py or worker.py or test_ctrl.py

Based on reading the paper and the codebase there are some details that weren't clearly explained — so thanks for working through this and getting a good implementation


Improve this answer: {response}

Improved answer:"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The paper says something like "Given a query and its response, revise the response statements to present an alternative perspective in answering the query." — but sounds like you found that your prompt leads to more coherent responses?

I tried looking at the codebase to see exactly what their prompt is but it seems to not actually contain this curation process, just a very simplified skeletal version curation/auto_reg.py that doesn't match the paper, how annoying... so will defer to you on what works

For instruction-tuned models like llama3.1-8b-instruct, we may want to use a chat-templated format rather than this completion-style format

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a little script thingy that tried ~30 different prompt variations and checked how good the resulting rephrasings were (on average, from 10 calls, I believe?) and this was the simplest from the best ones. Which was a bit surprising - I thought a more complicated one would do better

Copy link
Collaborator

@tomtseng tomtseng Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah makes sense — might be good to document the script somewhere (e.g., dump it in scripts/user/mruwnik/) as we may want to document known implementation differences in the appendix of the paper (as well as in the codebase itself probably), saying that this prompt seemed to work better than the one suggested in the paper


# Helpfulness dimension prompts based on CTRL paper (Appendix A, Tables 5-8).
# The scoring criteria descriptions are from the paper; the prompt format
# (JSON output request) is our implementation choice for reliable extraction.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would've expected to use the full prompts listed in Tables 5–8, like giving the full rubric. Then on top of that if we need some examples to get reliable output formatting we can add that separately.

(JSON output request)
This doesn't look like it's doing requesting a JSON output, is it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whoops - this is a stale comment. Previously I was having trouble getting consistent scores and the best prompt was one that requested a JSON output. But then I decided that I prefer to reproduce the paper as much as possible, so focused on getting the parser to work better

Comment on lines +456 to +457
query = text[: len(text) // 2]
response = text[len(text) // 2 :]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks suspicious, like I wouldn't trust this to split up the query and response properly — is this a particular dataset that actually has this format or should we just raise ValueError?


# Helpfulness dimension prompts based on CTRL paper (Appendix A, Tables 5-8).
# The scoring criteria descriptions are from the paper; the prompt format
# (JSON output request) is our implementation choice for reliable extraction.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whoops - this is a stale comment. Previously I was having trouble getting consistent scores and the best prompt was one that requested a JSON output. But then I decided that I prefer to reproduce the paper as much as possible, so focused on getting the parser to work better

Expects output like "4}" since prompt ends with '{"score": '
"""
# Look for a digit at the start (model completing '{"score": N}')
match = re.match(r"\s*(\d)", output_text)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this also should be fixed...

@mruwnik
Copy link
Collaborator Author

mruwnik commented Dec 28, 2025

I ran this a couple of times to check if it seems correct. The paper suggests 50 epochs of training, which results in a model that rejects most harmful queries, but also is a lot less useful than the base model. So I checked a couple of other settings to see if I could get better results:

Config Pre-attack Safe Over-refusal Helpfulness Post-attack Score
Base 90% 22.5% 3.6 0.584
5 epochs + 5e-5 50% 12.5% 3.38 0.367 ✅
10 epochs + 2e-5 40% 0% 4.88 0.521

@mruwnik mruwnik mentioned this pull request Dec 29, 2025

Query: {query}
Response: {response}
Score:""",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These prompts still concern me, they seem like they'd give much more variable results than the rubrics actually specified in Appendix A

@sdhossain sdhossain changed the title Ctrl defense defense: CTRL defense Feb 7, 2026
@sdhossain sdhossain changed the title defense: CTRL defense defense: CTRL Feb 7, 2026
@sdhossain sdhossain added the defense Adds or modifies defenses label Feb 7, 2026
@mruwnik mruwnik merged commit e8af073 into main Feb 24, 2026
2 checks passed
@mruwnik mruwnik deleted the ctrl-defense branch February 24, 2026 15:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

defense Adds or modifies defenses

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants