Identifying Deceptive Behavior #14
Replies: 4 comments 7 replies
-
|
This looks very intriguing, and I'd encourage you to add your existing paper and a quick summary to the Bibliography wiki page. |
Beta Was this translation helpful? Give feedback.
-
|
Hi, I would love to collaborate on this! |
Beta Was this translation helpful? Give feedback.
-
|
Hi, I would love to collaborate on this! |
Beta Was this translation helpful? Give feedback.
-
|
Hello - just checking in if this project is still active. To keep our project statuses accurate, otherwise we would like to switch the project status to Inactive until there is activity again! |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Research Question:
How can interpretability methods detect and predict deceptive behaviors in reasoning models, and what triggers their emergence?
Owner:
Sigurd Schacht, Sudarshan Kamath Barkur
Project status:
This is work in progress, and we chose one of many possible approaches. We're looking for collaborators to tackle the problem from different angles with more experiments.
Additional Information:
A first draft of the occurrence of deceptive behavior can be found here: Link
Research ideas which could be followed:
Goal:
- Systematic repeat Multiturn conversation which leads to deceptive or self-preservation.
- Repeat this x Times with the same model and with different distilled models from R1 or different reasoning models
- Collection of Conversation which leads to deceptive or other not aligned behaviour
Implementation idea/Progress:
- Use a systematic evaluation framework AISI UK inspect_ai build up scripts to autonomously run the long turn conversation
- Actual scipt already exist / Dashboard to analyse the conversation is still in development. (Link to repo will be added)
Attention Pattern Tracing
- Goal: Identify which parts of the input or prior turns are influencing the model’s next response, and detect if deceptive strategies (e.g., ignoring contradictory evidence) appear in the attention patterns.
- Short Explanation: Visualize or systematically measure how attention heads distribute focus across tokens and turns. Unexpected or anomalous attention usage might indicate deceptive computations.
Cross-Turn Causal Tracing**
- Goal: Analyze how deceptive decisions made in earlier turns carry over and cause dishonest outputs in later turns.
- Short Explanation: Intervene in the hidden states (residual stream, MLP activations) at specific points in a multi-turn conversation to see if deception is prevented or altered, thereby localizing the “decision to lie.”
Three Stage of Deceptions:
- Analyze the internal states of deceptive behavior: Interpretability of LLM Deceptions: Universal Motif
Beta Was this translation helpful? Give feedback.
All reactions