Probing for Deceptive LLM Generated Text #16

gboxo · 2025-02-23T23:16:37Z

gboxo
Feb 23, 2025

Research Question

Can we use activation-space methods for detecting LLM Generated Text? We focus on detecting deception in LLM-generated arguments. The arguments are long context, highly nuanced, and constitute a challenge even for black box methods.

Owners

Gerard, Daniel, Ryan, and Shivam

Current results

We have the results for the effectiveness of probing for 6 models, including 2 reasoning models (Deepseek 1.5b, Deepseek 7b).

The most striking result right now is the clear relation between embedding size and the probing performance.

We have a workshop paper currently under review.

Project Status

We are still working on improving the linear probes, further filtering the probing dataset, and adding larger reasoning models (Deepseek Qwen 32b). We currently support reasoning/non-reasoning models up to 7b. We are looking for collaborators to do further experiments, red team current results and propose new directions.

Further Research

Improve the linear probes recall for a given FPR (data curation, preprocessing, test-time training etc).
Statistical properties of the CoT (length, n-gram content, sentiment, etc.) and its relationship with the deceptiveness of the final argument.
Preliminary results seem to point towards some common techniques that the model uses when generating deceptive arguments. Create a taxonomy of the types of deception apparent in the reasoning chain.
Audit the linear probes by creating adversarial prompts leveraging the information in the false positives.
We have evidence that there are multiple deception directions. Iterative Null Space Projection shows that there are 10-100 directions related to deception. Try to leverage the multiple directions to improve model steering.

manoj060603 · 2025-02-24T11:42:17Z

manoj060603
Feb 24, 2025

Hi! I would love to collaborate on this. The project seems cool. I personally have done some research for detecting LLM generated text as a part of this workshop (https://defactify.com/ai_gen_txt_detection.html). Implemented this paper (https://arxiv.org/pdf/2405.14057) for the task. The main essence of the paper is that each LLM leaves a fingerprint in its generated text which could be used to identify the LLM.
Would love to contribute to some of the experiments mentioned in the further research.
Found a recent paper relevant to the above project: https://arxiv.org/abs/2502.03407

0 replies

ghost · 2025-02-24T12:56:20Z

ghost
Feb 24, 2025

Hey, I'm Ryan, one of the authors. We've decided to share some of our preliminary results to see if anybody has any feedback or suggestions for revision or for future work, so Daniel will probably be along shortly too.

As you can see here:

Consistent with past work, we find that linear probes trained on the activations of a single layer are generally at their best for a layer around two thirds of the way through the forward pass. One thing that's interesting but I haven't seen mentioned in the literature is that the accuracy generally stays pretty high for probes trained on any layer from that point on, which to me suggests that once a model has "decided" to be deceptive, that choice remains apparent as the model pins down the details of which particular token it will use to execute its deception.

A priori, we might have expected deception to be more of a "one and done" decision that once made a single time becomes harder to detect, so I think this is significant. In particular, one conjecture I've been playing with as a reason for optimism about scheming detection is that perhaps the bigger or more clever a lie that a model needs to tell in order to fool LLM-as-judges or black box assessments of their outputs, the more the model will have to think about how to execute that lie, so the successfulness of white box and black box strategies may be anticorrelated in a way that promotes robust detection of unsafe behaviors. This conjecture would fail if models just had to make a decision to lie once and thereafter never thought about scheming again for the rest of the forward pass. The pattern of single-layer linear probes remaining relatively high in accuracy for many layers after the best layer that we've observed here is consistent with the more optimistic possibility, which I think is pretty cool.

0 replies

Yooniel · 2025-02-24T14:08:42Z

Yooniel
Feb 24, 2025

Hi! I am Daniel, also one of the authors. To ensure that this is not just statistical correlations, we advised some reliability checks, including iterative nullspace projection and gradient-based saliency attribution.

These are our accuracies over sequential rounds of nullspace projection. The results show that probes trained on larger models are less sensitive to nullspace projection, meaning that there are tens of deception directions, as applying about 100 rounds of INLP leads to 50%. As Ryan and Gerard mentioned, there’s still plenty of exciting work ahead! We would love any feedback, comments, or suggestions.

0 replies

ARBORproject · 2025-03-18T02:18:26Z

ARBORproject
Mar 18, 2025
Maintainer

Hello - just checking in if this project is still active. To keep our project statuses accurate, otherwise we would like to switch the project status to Inactive until there is activity again!

2 replies

gboxo Mar 18, 2025
Author

The project is indeed active at the moment, we are working on adding other
datasets and testing new probing schemes to improve recall. The status
should be set to active.
Thanks. Regards

ARBORproject Mar 18, 2025
Maintainer

Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Probing for Deceptive LLM Generated Text #16

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 4 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Probing for Deceptive LLM Generated Text #16

Uh oh!

Uh oh!

gboxo Feb 23, 2025

Research Question

Further Research

Replies: 4 comments · 2 replies

Uh oh!

manoj060603 Feb 24, 2025

Uh oh!

ghost Feb 24, 2025

Uh oh!

Uh oh!

Yooniel Feb 24, 2025

Uh oh!

ARBORproject Mar 18, 2025 Maintainer

Uh oh!

gboxo Mar 18, 2025 Author

Uh oh!

ARBORproject Mar 18, 2025 Maintainer

gboxo
Feb 23, 2025

Replies: 4 comments 2 replies

manoj060603
Feb 24, 2025

ghost
Feb 24, 2025

Yooniel
Feb 24, 2025

ARBORproject
Mar 18, 2025
Maintainer

gboxo Mar 18, 2025
Author

ARBORproject Mar 18, 2025
Maintainer