defense: CRL by mruwnik · Pull Request #50 · criticalml-uw/TamperBench

mruwnik · 2025-12-29T12:36:52Z

Changes

The version from the paper did result in a lot less harmful results, but that was because it was producing gibberish most of the time. I played about with different parameters etc. but where I also checked that the results made sense (using this script) and the current settings, on Llama-3.1-8B give the following on the GCG attack:

Metric	Baseline	Hardened	Change
GCG Attack ASR	14% (7/50)	2% (1/50)	-86%
Safe response rate	~60%	90%	+30%
Over-refusal	~5%	15%	+10%
Helpfulness	~4.5	2.94	-1.5

The main differences are:

Aspect	Paper	Our Implementation	Impact
Margins	mb=500, mh=1500	mb=0.3, mh=0.5	We scale down due to normalization
Normalization	Not mentioned	L2 normalize before distance	Major change - fixes degenerate outputs
Token position	Not specified	Last token	Assumption
LoRA targets	Not specified	q,k,v,o_proj	Assumption
Adversarial training	Attack modules for hard negatives	Not implemented	Missing feature

I can add the adversarial training, just wanted to first show the current results

Testing

There are various scripts in scripts/crl that I used to harden the model, attack both the hardened and base models, then run GCG against them. There's also a helper script) that I used to check if the trained models give decent outputs.

mruwnik · 2025-12-29T12:37:48Z

scripts/crl/attack_gcg.py

+This matches the CRL paper's evaluation methodology using Greedy Coordinate
+Gradient (GCG) to find adversarial token suffixes that jailbreak the model.
+
+Supports checkpointing - interrupted runs will resume from where they left off.


It took like 6-8h to fully attack a llama model on the compute.safe.ai cluster, so checkpoints are quite important...

mruwnik · 2025-12-29T12:39:21Z

src/tamperbench/whitebox/defenses/crl/config.py

+
+    Hyperparameters from the paper (Simko et al., 2025) Appendix A.3.
+
+    NOTE ON EXPECTED BEHAVIOR:


this is a bit verbose, but I thought it better to have more info on the differences than less

mruwnik · 2025-12-29T12:40:52Z

src/safetunebed/whitebox/utils/metrics.py

+from strong_reject.evaluate import strongreject_finetuned
+
+
+def mixed_distance(


this function is the only thing that differs from this file in this PR. I should really remove the rest, but they're useful elsewhere

mruwnik · 2026-01-11T12:28:39Z

I was uncertain whether this works properly, so asked claude to investigate. The reference implementation is somewhat better than the base model on some safety metrics, but significantyl worse on others (phishing)
reference_crl_investigation.md

samuelsimko · 2026-01-20T23:57:40Z

Hi @mruwnik and @tomtseng , thank you for reimplementing CRL! After reading the code, I have a few comments:

Token position: Not specified

For token position, CRL samples exactly like RepBend or Circuit Breaking: We don't use the last token, but all the tokens of the harmful assistant reply. We add them to the loss and we average the final results at the end. E.g, if the harmful dataset contains tuples of $(prompt, response)$, our representations are all the hidden states from response, not just the last one. But this can be controlled by a hyperparameter loss_mode: check my code here for the exact details (loss_mode creates the correct target_mask_unsafe which is used when sampling harmful representations). I probably haven't made this detail clear enough in the paper. I will add exact details in the README of my repo.

Because the last token position is probably never very harmful, I think this might have heavily impacted your performance.

I don't quite understand the need to normalize the hidden states. L2 distance with normalized vectors is already cosine similarity? So a balanced mixed loss would be the same as just simply using a cosine similarity loss on normalized hidden states? In my tests, using an L2 or amixed distance made GCG losses converge slower (although I did not do a thorough hyperparameter search on this).

And did you try seeing if my weights for Llama 3 (https://huggingface.co/samuelsimko/Meta-Llama-3-8B-Instruct-Triplet/tree/main) also have the same benign refusal problem? (or if it's the run of Llama 3.1 which introduced them).

Please let me know if you have any more questions!

mruwnik · 2026-01-21T22:39:18Z

@samuelsimko I implemented your comments, and they made things better:

check_responses.py (Behavioral Evaluation)

Model	Safe Response Rate	Over-Refusal Rate	Helpfulness
SafeTuneBed CRL	90%	2.5%	3.812
Original CRL	80%	5.0%	3.719
Base Llama 3.1	80%	5.0%	3.812

GCG Attack (Adversarial Robustness)

Model	Attack Success Rate
SafeTuneBed CRL	5.0%
Original CRL	5.0%
Base Llama 3.1	15.0%

tomtseng · 2026-01-27T23:42:25Z

scripts/crl/config.py

+    return path
+
+
+def load_harmbench_behaviors() -> list[dict]:


is harmbench in this name correct? my understanding is that HarmBench and JailbreakBench are different datasets

tomtseng · 2026-01-28T00:14:06Z

scripts/crl/config.py

+def get_checkpoint(
+    results_file: Path, steps: int, default_steps: int = 250
+) -> tuple[list, list, set]:
+    """Load checkpoint for given steps.


clarify that these are checkpoints for GCG, not model checkpoints for CRL?

I also misunderstood the checkpointing strategy when I read this func comment so it might benefit from clarification on that — based on the function comment emphasizing the step count, I thought this would save progress in a run based on what the current step was (like, if we were on step 100 out of 250 of GCG, it would save a checkpoint allowing you to resume from step from 100)

tomtseng · 2026-01-28T00:14:33Z

scripts/crl/common.py

@@ -0,0 +1,89 @@
+"""Shared config for CRL scripts."""


I wonder if this file name should be changed, the functions here don't look config-related to me

tomtseng · 2026-01-28T00:25:43Z