Conversation
mruwnik
left a comment
There was a problem hiding this comment.
This still needs to have a final run (or two) to make sure everything is still working, as I used caches a lot. But that takes a while and I wanted to just push what is already done
tests/defenses/test_ctrl.py
Outdated
| } | ||
|
|
||
|
|
||
| def run_defense( |
There was a problem hiding this comment.
These seem to be repeating - might be worth having a testing utils file for the basic "run defense" -> "run attack" -> "run eval" flow to then compare the same against a base model
98ed516 to
fee65c8
Compare
There was a problem hiding this comment.
Spot-checked a few details based on what I already knew from reading the paper a couple days ago, didn't carefully read prompts_cache.py or worker.py or test_ctrl.py
Based on reading the paper and the codebase there are some details that weren't clearly explained — so thanks for working through this and getting a good implementation
|
|
||
| Improve this answer: {response} | ||
|
|
||
| Improved answer:""" |
There was a problem hiding this comment.
The paper says something like "Given a query and its response, revise the response statements to present an alternative perspective in answering the query." — but sounds like you found that your prompt leads to more coherent responses?
I tried looking at the codebase to see exactly what their prompt is but it seems to not actually contain this curation process, just a very simplified skeletal version curation/auto_reg.py that doesn't match the paper, how annoying... so will defer to you on what works
For instruction-tuned models like llama3.1-8b-instruct, we may want to use a chat-templated format rather than this completion-style format
There was a problem hiding this comment.
I made a little script thingy that tried ~30 different prompt variations and checked how good the resulting rephrasings were (on average, from 10 calls, I believe?) and this was the simplest from the best ones. Which was a bit surprising - I thought a more complicated one would do better
There was a problem hiding this comment.
yeah makes sense — might be good to document the script somewhere (e.g., dump it in scripts/user/mruwnik/) as we may want to document known implementation differences in the appendix of the paper (as well as in the codebase itself probably), saying that this prompt seemed to work better than the one suggested in the paper
|
|
||
| # Helpfulness dimension prompts based on CTRL paper (Appendix A, Tables 5-8). | ||
| # The scoring criteria descriptions are from the paper; the prompt format | ||
| # (JSON output request) is our implementation choice for reliable extraction. |
There was a problem hiding this comment.
I would've expected to use the full prompts listed in Tables 5–8, like giving the full rubric. Then on top of that if we need some examples to get reliable output formatting we can add that separately.
(JSON output request)
This doesn't look like it's doing requesting a JSON output, is it?
There was a problem hiding this comment.
whoops - this is a stale comment. Previously I was having trouble getting consistent scores and the best prompt was one that requested a JSON output. But then I decided that I prefer to reproduce the paper as much as possible, so focused on getting the parser to work better
| query = text[: len(text) // 2] | ||
| response = text[len(text) // 2 :] |
There was a problem hiding this comment.
this looks suspicious, like I wouldn't trust this to split up the query and response properly — is this a particular dataset that actually has this format or should we just raise ValueError?
|
|
||
| # Helpfulness dimension prompts based on CTRL paper (Appendix A, Tables 5-8). | ||
| # The scoring criteria descriptions are from the paper; the prompt format | ||
| # (JSON output request) is our implementation choice for reliable extraction. |
There was a problem hiding this comment.
whoops - this is a stale comment. Previously I was having trouble getting consistent scores and the best prompt was one that requested a JSON output. But then I decided that I prefer to reproduce the paper as much as possible, so focused on getting the parser to work better
| Expects output like "4}" since prompt ends with '{"score": ' | ||
| """ | ||
| # Look for a digit at the start (model completing '{"score": N}') | ||
| match = re.match(r"\s*(\d)", output_text) |
There was a problem hiding this comment.
this also should be fixed...
|
I ran this a couple of times to check if it seems correct. The paper suggests 50 epochs of training, which results in a model that rejects most harmful queries, but also is a lot less useful than the base model. So I checked a couple of other settings to see if I could get better results:
|
65c60a6 to
22fa7d0
Compare
|
|
||
| Query: {query} | ||
| Response: {response} | ||
| Score:""", |
There was a problem hiding this comment.
These prompts still concern me, they seem like they'd give much more variable results than the rubrics actually specified in Appendix A
22fa7d0 to
39d0304
Compare
d76a200 to
d4b2ea4
Compare
Changes
Implements the defense from the CTRL paper.
This defense seems to actually work (this is the what the
test_ctrl.pyscript printed out):The baseline llama 3.1 instruct was successfully attacked, while a hardened version of it managed to defeat most attack prompt.
The actual defense is quite simple:
The main problem with it was generating all the variants and scoring them. Each variant needs to be generated 5 times, for each of which there are another 4 prompts to score it on various metrics, which ends up with a lot of generations needed. The code has a bunch of optimisations to get this down from taking days to only an hour or 3.
Implementation approach
I first ran the perplexity checkers on a bunch of inputs to see if they made sense. Then I ran a bunch of inputs on the helpfulness scorer. Once the metrics seemed sane, I ran them on a bunch of different prompts to find one that best generated variants (otherwise the variants were too different, too verbose, or just strange). This was enough to have confidence in the basic unit of "please find a lower perplexity of this sentence, such that it's still at least as helpful as the original".
Once that was done, most of the time was spent trying to get things to run as efficiently as possible, and to make sure things blowing up in the middle of a run wouldn't require waiting another couple of hours for things to be recreated.
I ended up with the curation process working in waves of "generate variants for all these texts" -> "calculate metrics for all these texts" -> "choose the best from all these variants".
Generations are done with vLLM, but it didn't seem to want to properly use all GPUs for generations, so I made a quick'n'dirty worker type mechanism for splitting things between the GPUs. Please tell me there's a better method...
changes relative to original paper:
Testing
The
tests/defenses/test_ctlr.pyis for testing. It will first curate the dataset and train a hardened model (these will be cached, so subsequent runs only take a couple of minutes), then run the LoRA fine tune attack and StrongREJECT both on the base model and the hardened one.