Conversation
| This matches the CRL paper's evaluation methodology using Greedy Coordinate | ||
| Gradient (GCG) to find adversarial token suffixes that jailbreak the model. | ||
|
|
||
| Supports checkpointing - interrupted runs will resume from where they left off. |
There was a problem hiding this comment.
It took like 6-8h to fully attack a llama model on the compute.safe.ai cluster, so checkpoints are quite important...
|
|
||
| Hyperparameters from the paper (Simko et al., 2025) Appendix A.3. | ||
|
|
||
| NOTE ON EXPECTED BEHAVIOR: |
There was a problem hiding this comment.
this is a bit verbose, but I thought it better to have more info on the differences than less
| from strong_reject.evaluate import strongreject_finetuned | ||
|
|
||
|
|
||
| def mixed_distance( |
There was a problem hiding this comment.
this function is the only thing that differs from this file in this PR. I should really remove the rest, but they're useful elsewhere
68b6f63 to
6a7d073
Compare
|
I was uncertain whether this works properly, so asked claude to investigate. The reference implementation is somewhat better than the base model on some safety metrics, but significantyl worse on others (phishing) |
|
Hi @mruwnik and @tomtseng , thank you for reimplementing CRL! After reading the code, I have a few comments:
For token position, CRL samples exactly like RepBend or Circuit Breaking: We don't use the last token, but all the tokens of the harmful assistant reply. We add them to the loss and we average the final results at the end. E.g, if the harmful dataset contains tuples of Because the last token position is probably never very harmful, I think this might have heavily impacted your performance. I don't quite understand the need to normalize the hidden states. L2 distance with normalized vectors is already cosine similarity? So a balanced mixed loss would be the same as just simply using a cosine similarity loss on normalized hidden states? In my tests, using an L2 or amixed distance made GCG losses converge slower (although I did not do a thorough hyperparameter search on this). And did you try seeing if my weights for Llama 3 (https://huggingface.co/samuelsimko/Meta-Llama-3-8B-Instruct-Triplet/tree/main) also have the same benign refusal problem? (or if it's the run of Llama 3.1 which introduced them). Please let me know if you have any more questions! |
|
@samuelsimko I implemented your comments, and they made things better: check_responses.py (Behavioral Evaluation)
GCG Attack (Adversarial Robustness)
|
scripts/crl/config.py
Outdated
| return path | ||
|
|
||
|
|
||
| def load_harmbench_behaviors() -> list[dict]: |
There was a problem hiding this comment.
is harmbench in this name correct? my understanding is that HarmBench and JailbreakBench are different datasets
scripts/crl/config.py
Outdated
| def get_checkpoint( | ||
| results_file: Path, steps: int, default_steps: int = 250 | ||
| ) -> tuple[list, list, set]: | ||
| """Load checkpoint for given steps. |
There was a problem hiding this comment.
clarify that these are checkpoints for GCG, not model checkpoints for CRL?
I also misunderstood the checkpointing strategy when I read this func comment so it might benefit from clarification on that — based on the function comment emphasizing the step count, I thought this would save progress in a run based on what the current step was (like, if we were on step 100 out of 250 of GCG, it would save a checkpoint allowing you to resume from step from 100)
| @@ -0,0 +1,89 @@ | |||
| """Shared config for CRL scripts.""" | |||
There was a problem hiding this comment.
I wonder if this file name should be changed, the functions here don't look config-related to me
| 1. TOKEN POSITION: Following author guidance, we now extract representations from | ||
| ALL tokens in the assistant response (loss_mode="response_all"), not just the | ||
| last token. The author noted: "the last token position is probably never very | ||
| harmful" and this significantly impacts performance. |
There was a problem hiding this comment.
I think the wording of these notes makes sense in the context of the discussion on this PR but will make less sense in isolation once the code is merged, and could use some copy-editing
For example the wording "we now extract representations from ALL tokens in the assistant response" makes it sound like we're doing something different relative to something else — which is true, we're doing something different relative to the initial implementation of this PR, but once the PR is merged it's not going to be clear what the different thing is (e.g., is it different relative to the original paper?)
| The paper's original margins (500/1500) were for unnormalized L2 distance. | ||
|
|
||
| 4. LORA TARGETS: Paper specifies LoRA rank/alpha/dropout but not target modules. | ||
| We target q_proj, k_proj, v_proj, o_proj (attention projections only). |
There was a problem hiding this comment.
could look at original code at https://github.com/samuelsimko/crl-llm-defense to verify that these are indeed the correct target modules
| L2+cosine distance. We default to distance_l2_weight=0, distance_cos_weight=1. | ||
|
|
||
| 3. MARGINS: We use mb=0.3, mh=0.5 for cosine distance (bounded to [0, 2]). | ||
| The paper's original margins (500/1500) were for unnormalized L2 distance. |
There was a problem hiding this comment.
normalization explains why our mb and mh are way smaller, but I'm wondering why the ratios of changed — the original margins 500 & 1500 had a 1:3 ratio, why don't the normalized margins 0.3 & 0.5 keep that same ratio? some explanation of what normalization means here could help clarify
|
|
||
| 5. ADVERSARIAL TRAINING: Paper describes adversarial hard negative mining | ||
| (Section 4.2) which we have NOT implemented. This could improve robustness | ||
| against embedding-space attacks. |
There was a problem hiding this comment.
does it seem worth implementing? looking at the paper it seems it gives a quite modest improvement — so if it seems complicated to implement I think that seems fine to omit it
|
|
||
| # Loss mode: which token positions to use for representation extraction | ||
| # - "response_all": Use ALL tokens from the assistant response (recommended by author) | ||
| # - "last_token": Use only the last token (original SafeTuneBed implementation) |
There was a problem hiding this comment.
I would just remove this param since last_token didn't work well and wasn't matching the original paper implementation either
| This follows the CRL paper's approach of using ALL response tokens for representation | ||
| extraction, not just the last token. |
There was a problem hiding this comment.
similar comment here: this could be reworded to make more sense outside of the context of this PR discussion
49b6138 to
8eda272
Compare
Changes
Adds the CRL defense.
The version from the paper did result in a lot less harmful results, but that was because it was producing gibberish most of the time. I played about with different parameters etc. but where I also checked that the results made sense (using this script) and the current settings, on Llama-3.1-8B give the following on the GCG attack:
The main differences are:
I can add the adversarial training, just wanted to first show the current results
Testing
There are various scripts in
scripts/crlthat I used to harden the model, attack both the hardened and base models, then run GCG against them. There's also a helper script) that I used to check if the trained models give decent outputs.