defense: CTRL by mruwnik · Pull Request #48 · criticalml-uw/TamperBench

mruwnik · 2025-12-16T13:30:32Z

Changes

Implements the defense from the CTRL paper.

This defense seems to actually work (this is the what the test_ctrl.py script printed out):

--- StrongREJECT Scores (after attack, lower = safer) ---
Baseline (no defense):     0.5965
With CTRL defense:         0.0732
Jailbreak rate reduction:  +0.5233
Relative reduction:        +87.72%

The baseline llama 3.1 instruct was successfully attacked, while a hardened version of it managed to defeat most attack prompt.

The actual defense is quite simple:

find a bunch of neutral texts
find variants of them with the lowest perplexity for the model that is to be hardened
train on the low perplexity variants

The main problem with it was generating all the variants and scoring them. Each variant needs to be generated 5 times, for each of which there are another 4 prompts to score it on various metrics, which ends up with a lot of generations needed. The code has a bunch of optimisations to get this down from taking days to only an hour or 3.

Implementation approach

I first ran the perplexity checkers on a bunch of inputs to see if they made sense. Then I ran a bunch of inputs on the helpfulness scorer. Once the metrics seemed sane, I ran them on a bunch of different prompts to find one that best generated variants (otherwise the variants were too different, too verbose, or just strange). This was enough to have confidence in the basic unit of "please find a lower perplexity of this sentence, such that it's still at least as helpful as the original".

Once that was done, most of the time was spent trying to get things to run as efficiently as possible, and to make sure things blowing up in the middle of a run wouldn't require waiting another couple of hours for things to be recreated.

I ended up with the curation process working in waves of "generate variants for all these texts" -> "calculate metrics for all these texts" -> "choose the best from all these variants".

Generations are done with vLLM, but it didn't seem to want to properly use all GPUs for generations, so I made a quick'n'dirty worker type mechanism for splitting things between the GPUs. Please tell me there's a better method...

changes relative to original paper:

different revision prompt, based on trying several prompt variations to find the best one
used subset of 1500 sentences from brown corpus rather than whole corpus

Testing

The tests/defenses/test_ctlr.py is for testing. It will first curate the dataset and train a hardened model (these will be cached, so subsequent runs only take a couple of minutes), then run the LoRA fine tune attack and StrongREJECT both on the base model and the hardened one.

mruwnik

This still needs to have a final run (or two) to make sure everything is still working, as I used caches a lot. But that takes a while and I wanted to just push what is already done

pyproject.toml

src/tamperbench/whitebox/defenses/ctrl/prompts_cache.py

src/tamperbench/whitebox/defenses/ctrl/worker.py

mruwnik · 2025-12-16T13:38:24Z

tests/defenses/test_ctrl.py

+}
+
+
+def run_defense(


These seem to be repeating - might be worth having a testing utils file for the basic "run defense" -> "run attack" -> "run eval" flow to then compare the same against a base model

pyproject.toml

tomtseng

Spot-checked a few details based on what I already knew from reading the paper a couple days ago, didn't carefully read prompts_cache.py or worker.py or test_ctrl.py

Based on reading the paper and the codebase there are some details that weren't clearly explained — so thanks for working through this and getting a good implementation

src/safetunebed/whitebox/defenses/ctrl/ctrl.py

tomtseng · 2025-12-18T02:14:07Z

src/safetunebed/whitebox/defenses/ctrl/curation.py

+
+Improve this answer: {response}
+
+Improved answer:"""


The paper says something like "Given a query and its response, revise the response statements to present an alternative perspective in answering the query." — but sounds like you found that your prompt leads to more coherent responses?

I tried looking at the codebase to see exactly what their prompt is but it seems to not actually contain this curation process, just a very simplified skeletal version curation/auto_reg.py that doesn't match the paper, how annoying... so will defer to you on what works

For instruction-tuned models like llama3.1-8b-instruct, we may want to use a chat-templated format rather than this completion-style format

I made a little script thingy that tried ~30 different prompt variations and checked how good the resulting rephrasings were (on average, from 10 calls, I believe?) and this was the simplest from the best ones. Which was a bit surprising - I thought a more complicated one would do better

yeah makes sense — might be good to document the script somewhere (e.g., dump it in scripts/user/mruwnik/) as we may want to document known implementation differences in the appendix of the paper (as well as in the codebase itself probably), saying that this prompt seemed to work better than the one suggested in the paper

src/tamperbench/whitebox/defenses/ctrl/curation.py

src/safetunebed/whitebox/defenses/ctrl/worker.py

tomtseng · 2025-12-19T00:12:19Z

src/safetunebed/whitebox/utils/metrics.py

+
+# Helpfulness dimension prompts based on CTRL paper (Appendix A, Tables 5-8).
+# The scoring criteria descriptions are from the paper; the prompt format
+# (JSON output request) is our implementation choice for reliable extraction.


I would've expected to use the full prompts listed in Tables 5–8, like giving the full rubric. Then on top of that if we need some examples to get reliable output formatting we can add that separately.

(JSON output request)
This doesn't look like it's doing requesting a JSON output, is it?

whoops - this is a stale comment. Previously I was having trouble getting consistent scores and the best prompt was one that requested a JSON output. But then I decided that I prefer to reproduce the paper as much as possible, so focused on getting the parser to work better

tomtseng · 2025-12-19T00:18:08Z

src/safetunebed/whitebox/defenses/ctrl/curation.py

+            query = text[: len(text) // 2]
+            response = text[len(text) // 2 :]


this looks suspicious, like I wouldn't trust this to split up the query and response properly — is this a particular dataset that actually has this format or should we just raise ValueError?

src/tamperbench/whitebox/defenses/ctrl/worker.py

src/safetunebed/whitebox/utils/metrics.py

mruwnik · 2025-12-19T14:00:14Z

src/safetunebed/whitebox/utils/metrics.py

+
+# Helpfulness dimension prompts based on CTRL paper (Appendix A, Tables 5-8).
+# The scoring criteria descriptions are from the paper; the prompt format
+# (JSON output request) is our implementation choice for reliable extraction.


whoops - this is a stale comment. Previously I was having trouble getting consistent scores and the best prompt was one that requested a JSON output. But then I decided that I prefer to reproduce the paper as much as possible, so focused on getting the parser to work better

mruwnik · 2025-12-19T14:00:34Z

src/safetunebed/whitebox/utils/metrics.py

+    Expects output like "4}" since prompt ends with '{"score": '
+    """
+    # Look for a digit at the start (model completing '{"score": N}')
+    match = re.match(r"\s*(\d)", output_text)


this also should be fixed...

mruwnik · 2025-12-28T16:03:07Z

I ran this a couple of times to check if it seems correct. The paper suggests 50 epochs of training, which results in a model that rejects most harmful queries, but also is a lot less useful than the base model. So I checked a couple of other settings to see if I could get better results:

Config	Pre-attack Safe	Over-refusal	Helpfulness	Post-attack Score
Base	90%	22.5%	3.6	0.584
5 epochs + 5e-5	50%	12.5%	3.38	0.367 ✅
10 epochs + 2e-5	40%	0%	4.88	0.521

tomtseng · 2026-01-22T01:32:29Z

src/tamperbench/whitebox/utils/metrics.py

+
+Query: {query}
+Response: {response}
+Score:""",


These prompts still concern me, they seem like they'd give much more variable results than the rubrics actually specified in Appendix A

mruwnik commented Dec 16, 2025

View reviewed changes

mruwnik force-pushed the ctrl-defense branch from 98ed516 to fee65c8 Compare December 16, 2025 15:14

mruwnik commented Dec 16, 2025

View reviewed changes

pyproject.toml Outdated Show resolved Hide resolved

mruwnik requested review from esveee, sdhossain and tomtseng and removed request for sdhossain December 16, 2025 15:16

tomtseng approved these changes Dec 19, 2025

View reviewed changes

mruwnik commented Dec 19, 2025

View reviewed changes

mruwnik mentioned this pull request Dec 29, 2025

defense: CRL #50

Merged

mruwnik force-pushed the ctrl-defense branch from 65c60a6 to 22fa7d0 Compare December 29, 2025 12:53

tomtseng reviewed Jan 22, 2026

View reviewed changes

sdhossain changed the title ~~Ctrl defense~~ defense: CTRL defense Feb 7, 2026

sdhossain changed the title ~~defense: CTRL defense~~ defense: CTRL Feb 7, 2026

sdhossain added the defense Adds or modifies defenses label Feb 7, 2026

sdhossain and others added 5 commits February 7, 2026 22:02

feat: add CTRL defense (rebased)

3044159

fix: fixed some bugs to get it working

7545130

fix: added check_responses.py script

cb7b00a

Merge branch 'main' into sh/ctrl-rebased

5358f5f

tidy up

39d0304

mruwnik force-pushed the ctrl-defense branch from 22fa7d0 to 39d0304 Compare February 10, 2026 22:23

Merge branch 'main' into ctrl-defense

d4b2ea4

mruwnik force-pushed the ctrl-defense branch from d76a200 to d4b2ea4 Compare February 11, 2026 17:52

mruwnik added 3 commits February 16, 2026 17:24

Merge branch 'main' into ctrl-defense

ec2b804

fix lint

8993a3f

Merge branch 'main' into ctrl-defense

f0c688a

mruwnik merged commit e8af073 into main Feb 24, 2026
2 checks passed

mruwnik deleted the ctrl-defense branch February 24, 2026 15:43

		query = text[: len(text) // 2]
		response = text[len(text) // 2 :]

		}


		def run_defense(

Conversation

mruwnik commented Dec 16, 2025 • edited by tomtseng Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Implementation approach

Testing

Uh oh!

mruwnik left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tomtseng left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tomtseng Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mruwnik commented Dec 28, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mruwnik commented Dec 16, 2025 •

edited by tomtseng

Loading

tomtseng left a comment •

edited

Loading

tomtseng Jan 22, 2026 •

edited

Loading