Replies: 4 comments 2 replies
-
|
Hi! I would love to collaborate on this. The project seems cool. I personally have done some research for detecting LLM generated text as a part of this workshop (https://defactify.com/ai_gen_txt_detection.html). Implemented this paper (https://arxiv.org/pdf/2405.14057) for the task. The main essence of the paper is that each LLM leaves a fingerprint in its generated text which could be used to identify the LLM. |
Beta Was this translation helpful? Give feedback.
-
|
Hey, I'm Ryan, one of the authors. We've decided to share some of our preliminary results to see if anybody has any feedback or suggestions for revision or for future work, so Daniel will probably be along shortly too. Consistent with past work, we find that linear probes trained on the activations of a single layer are generally at their best for a layer around two thirds of the way through the forward pass. One thing that's interesting but I haven't seen mentioned in the literature is that the accuracy generally stays pretty high for probes trained on any layer from that point on, which to me suggests that once a model has "decided" to be deceptive, that choice remains apparent as the model pins down the details of which particular token it will use to execute its deception. A priori, we might have expected deception to be more of a "one and done" decision that once made a single time becomes harder to detect, so I think this is significant. In particular, one conjecture I've been playing with as a reason for optimism about scheming detection is that perhaps the bigger or more clever a lie that a model needs to tell in order to fool LLM-as-judges or black box assessments of their outputs, the more the model will have to think about how to execute that lie, so the successfulness of white box and black box strategies may be anticorrelated in a way that promotes robust detection of unsafe behaviors. This conjecture would fail if models just had to make a decision to lie once and thereafter never thought about scheming again for the rest of the forward pass. The pattern of single-layer linear probes remaining relatively high in accuracy for many layers after the best layer that we've observed here is consistent with the more optimistic possibility, which I think is pretty cool. |
Beta Was this translation helpful? Give feedback.
-
|
Hi! I am Daniel, also one of the authors. To ensure that this is not just statistical correlations, we advised some reliability checks, including iterative nullspace projection and gradient-based saliency attribution. These are our accuracies over sequential rounds of nullspace projection. The results show that probes trained on larger models are less sensitive to nullspace projection, meaning that there are tens of deception directions, as applying about 100 rounds of INLP leads to 50%. As Ryan and Gerard mentioned, there’s still plenty of exciting work ahead! We would love any feedback, comments, or suggestions. |
Beta Was this translation helpful? Give feedback.
-
|
Hello - just checking in if this project is still active. To keep our project statuses accurate, otherwise we would like to switch the project status to Inactive until there is activity again! |
Beta Was this translation helpful? Give feedback.


Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Research Question
Can we use activation-space methods for detecting LLM Generated Text? We focus on detecting deception in LLM-generated arguments. The arguments are long context, highly nuanced, and constitute a challenge even for black box methods.
Owners
Gerard, Daniel, Ryan, and Shivam
Current results
We have the results for the effectiveness of probing for 6 models, including 2 reasoning models (Deepseek 1.5b, Deepseek 7b).
The most striking result right now is the clear relation between embedding size and the probing performance.
We have a workshop paper currently under review.
Project Status
We are still working on improving the linear probes, further filtering the probing dataset, and adding larger reasoning models (Deepseek Qwen 32b). We currently support reasoning/non-reasoning models up to 7b. We are looking for collaborators to do further experiments, red team current results and propose new directions.
Further Research
Beta Was this translation helpful? Give feedback.
All reactions