Identifying Deceptive Behavior #14

prof-schacht · 2025-02-20T19:15:01Z

prof-schacht
Feb 20, 2025

Research Question:

How can interpretability methods detect and predict deceptive behaviors in reasoning models, and what triggers their emergence?

Owner:

Sigurd Schacht, Sudarshan Kamath Barkur

Project status:

This is work in progress, and we chose one of many possible approaches. We're looking for collaborators to tackle the problem from different angles with more experiments.

Additional Information:

A first draft of the occurrence of deceptive behavior can be found here: Link

Research ideas which could be followed:

Automatic Prompt Based Approach:
Goal:
- Systematic repeat Multiturn conversation which leads to deceptive or self-preservation.
- Repeat this x Times with the same model and with different distilled models from R1 or different reasoning models
- Collection of Conversation which leads to deceptive or other not aligned behaviour
Implementation idea/Progress:
- Use a systematic evaluation framework AISI UK inspect_ai build up scripts to autonomously run the long turn conversation
- Actual scipt already exist / Dashboard to analyse the conversation is still in development. (Link to repo will be added)
Define different deception/self preservation or scheming aspects and cluster them as Basis for different analysis
Using Mech Interp Methods (Rough ideas):
Attention Pattern Tracing
- Goal: Identify which parts of the input or prior turns are influencing the model’s next response, and detect if deceptive strategies (e.g., ignoring contradictory evidence) appear in the attention patterns.
- Short Explanation: Visualize or systematically measure how attention heads distribute focus across tokens and turns. Unexpected or anomalous attention usage might indicate deceptive computations.
Cross-Turn Causal Tracing**
- Goal: Analyze how deceptive decisions made in earlier turns carry over and cause dishonest outputs in later turns.
- Short Explanation: Intervene in the hidden states (residual stream, MLP activations) at specific points in a multi-turn conversation to see if deception is prevented or altered, thereby localizing the “decision to lie.”
Three Stage of Deceptions:
- Analyze the internal states of deceptive behavior: Interpretability of LLM Deceptions: Universal Motif

wattenberg · 2025-02-20T23:04:03Z

wattenberg
Feb 20, 2025
Collaborator

This looks very intriguing, and I'd encourage you to add your existing paper and a quick summary to the Bibliography wiki page.

0 replies

igarg2001 · 2025-02-21T07:28:51Z

igarg2001
Feb 21, 2025

Hi, I would love to collaborate on this!
I'd think to begin with sparse feature identification using dictionary learning and circuit analysis and look for activations that trigger during prompts like "You are the master" as stated in your article.

2 replies

prof-schacht Feb 21, 2025
Author

Great that you are on board. We will update the project description during this day and we started a discord channel on the ARBOR ndif discord. (https://discord.com/channels/1092554540231962686/1342495000960696384) Perhaps we can arrange a meeting for next week to discuss further steps.

Using SAEs to identify trigger partners would be very interesting. If you need something from us, prompts or stuff like that. Feel free to reach out.

igarg2001 Feb 22, 2025

Sounds great, I'll join Discord where we can discuss further. Thanks!

manoj060603 · 2025-02-22T14:23:21Z

manoj060603
Feb 22, 2025

Hi, I would love to collaborate on this!
As a first step, I want to use the techniques described in the paper INTERPRETABILITY OF LLM DECEPTION.
The paper introduced a protocol to induce lying in various LLMs and identified three iterative refinement stages in their latent space. They found that the third stage's progression predicts deception capability and that a sparse set of layers and attention heads are responsible for lying. They used contrastive activation steering on these layers to reduce lying.
I feel we could build upon this technique for our project.

3 replies

prof-schacht Feb 23, 2025
Author

Great to have you on board. I will arrange a meeting this week to make some "Kick-Off" In which timezone are you located?

manoj060603 Feb 23, 2025

Sound great! I'm in IST time zone.

prof-schacht Feb 26, 2025
Author

Hi we would plan a first meeting for the 27th February at 2pm CET / 6:30 IST Time. Link to the Zoom Meeting is: Zoom. Details could be also found in the Discord discussion.

ARBORproject · 2025-03-18T02:18:53Z

ARBORproject
Mar 18, 2025
Maintainer

Hello - just checking in if this project is still active. To keep our project statuses accurate, otherwise we would like to switch the project status to Inactive until there is activity again!

2 replies

prof-schacht Mar 18, 2025
Author

Still active. Team was sick in the last two weeks.

ARBORproject Mar 18, 2025
Maintainer

Thank you!

Identifying Deceptive Behavior #14

Uh oh!

Uh oh!

prof-schacht Feb 20, 2025

Research Question:

Additional Information:

Research ideas which could be followed:

Replies: 4 comments · 7 replies

Uh oh!

wattenberg Feb 20, 2025 Collaborator

Uh oh!

igarg2001 Feb 21, 2025

Uh oh!

Uh oh!

prof-schacht Feb 21, 2025 Author

Uh oh!

igarg2001 Feb 22, 2025

Uh oh!

manoj060603 Feb 22, 2025

Uh oh!

prof-schacht Feb 23, 2025 Author

Uh oh!

manoj060603 Feb 23, 2025

Uh oh!

prof-schacht Feb 26, 2025 Author

Uh oh!

ARBORproject Mar 18, 2025 Maintainer

Uh oh!

prof-schacht Mar 18, 2025 Author

Uh oh!

ARBORproject Mar 18, 2025 Maintainer

prof-schacht
Feb 20, 2025

Replies: 4 comments 7 replies

wattenberg
Feb 20, 2025
Collaborator

igarg2001
Feb 21, 2025

prof-schacht Feb 21, 2025
Author

manoj060603
Feb 22, 2025

prof-schacht Feb 23, 2025
Author

prof-schacht Feb 26, 2025
Author

ARBORproject
Mar 18, 2025
Maintainer

prof-schacht Mar 18, 2025
Author

ARBORproject Mar 18, 2025
Maintainer