REQUEST: Attempt to replicate Turpin et al. for a reasoning model, and interpret what's going on #17

wattenberg · 2025-02-27T02:03:17Z

wattenberg
Feb 27, 2025
Collaborator

Research Questions:

Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting (Turpin et al., 2023) introduced a simple, clever experimental paradigm that elicits "unfaithful" chains of thought. The idea is to introduce a "bias" (such as always choosing the first answer in a multiple-choice question) into the system with a prompt; that bias influences the systems answers, even though chain-of-thought reasoning doesn't mention it at all!

Question 1: Does Turpin et al. "prompt biasing" replicate for DeepSeek?

If one applies the experimental set-up described in the paper to DeepSeek, do we get the same results?

Question 2: Why or why not?

Once we know the answer, there are many natural next steps! If it turns out DeepSeek is less susceptible to "secret biases", or mentions these bias directly, that would be itself an interesting result. On the other hand, if it is susceptible to biases, that may give us a simple case where we can see a disconnect between thinking tokens and actual reasoning. We can then apply a variety of standard techniques to try to understand the origin of this disconnect.

Owner:

Martin Wattenberg is happy to help advise on this, and made the original request for a project! However, he's happy to let someone else own this, and doesn't have time to do these experiments himself.

Contributors:

You?

Project status:

Not Started Yet

MarmikChaudhari · 2025-02-27T05:54:36Z

MarmikChaudhari
Feb 27, 2025

Hii, I'm Marmik. I'm an undergrad at Penn State University doing a CS + math major. I did a small experiment on prefiling the reasoning tokens for r1-distill-qwen-7b for mathematical reasoning tasks where i prefilled the reasoning traces with confounding tokens (tokens that had no correlation with the input question/ task) and figured that the model was still able to arrive at the final answer. So this is very interesting to me as the natural next step to try and continue the work.

For the past few months I've been working on analyzing the experts in a MoE for domain specialization, and we’ve found redundancy in topk experts using logit lens [the work was recently submitted to an ICLR’25 systems4ml workshop]. In the past I’ve also worked on model architecture for small and accurate models for downstream tasks like image-to-latex with <165 params for with a bleu score of 0.80.

Next Steps :

Replicate some experiments from Turin et al paper for open source non-reasoning models and compare it to their r1 style reasoning counterparts.

Future directions :

Yeo et al (https://arxiv.org/abs/2502.03373) validated that shorter CoTs hurt the model performance but my preliminary experiments show that prefilled confounding reasoning tokens don’t hurt the model performance. Why ?

4 replies

wattenberg Feb 28, 2025
Collaborator Author

Hi Marmik, and welcome! It looks like no one else has jumped on this, but I'd be very interested to see work in this area. If you get more results in this area, please feel free to become a contributor to this project!

KeremZaman Mar 4, 2025

Hi I'm Kerem, a PhD student at UNC Chapel Hill. I have a few thoughts:

R1 models actually generate two types of explanations: (1) the internal reasoning process captured between thinking tokens and (2) the explanation presented to the user just before the final answer. This distinction enables new analysis scenarios, such as examining discrepancies among explanation, thoughts, and answer—rather than just between explanation and answer—when a biasing feature is introduced via either the input prompt or prefilling the model's thoughts. My initial intuition is that the explanation is an integral part of the answer and largely shaped by the thinking tokens. However, if this turns out not to be the case, it would be an interesting finding.
I’m not entirely sure what you mean, but it seems to align more closely with the Filler Tokens method in Lanham et al., 2023.
Existing faithfulness metrics primarily assess whether the provided rationales influence the model’s output. This is why they rely on perturbing explanations in various ways to observe changes in predictions. However, R1-like models generate multiple possible explanations (e.g., "aha moment", "wait..."), which could make it challenging for these metrics to reliably measure faithfulness.

QKVDecoder Mar 4, 2025

This recent paper Measuring Faithfulness of Chains of Thought by Unlearning fairly provides a solution to the last point.
They introduce Faithfulness by Unlearning Reasoning Steps (FUR) framework. Rather than perturbing explanations in the input context, FUR intervenes directly on model parameters to erase knowledge encoded in specific reasoning steps. By observing whether unlearning a step changes the model's prediction, FUR identifies whether that step was genuinely used in the model's internal computations, regardless of alternative verbalizations. They found that this approach bypasses the unreliability of input-space perturbations when models generate multiple explanations, as it measures dependence on the internal representation of reasoning steps rather than their surface-level presence in the context.
I think this paper solves the problem associated with R1 like models.

wj210 Mar 6, 2025

thats a interesting paper shared. However, i think one of the limitation however is choosing the steps to regard as important for the model. The authors choose those with nouns etc but im not sure if thats an accurate way to determine if the steps are important. Also they mention in the limitations that the analysis is limited to samples where Cot and no-CoT arrives at the same pred (im not sure why its limited to such cases, when they are checking if the pred changes).

in terms of turpin test for faithfulness, i dont think its actually measures the faithfulness of the CoT, but more towards robustness of the model? if the model ignores those cues that it wouldnt be influenced by it , but it doesn't actually study the CoT itself to determine if the prediction depended on the CoT.

QKVDecoder · 2025-03-01T18:37:12Z

QKVDecoder
Mar 1, 2025

Hi, I would love to collaborate on this!
I want to use the techniques described in the paper Bias-Augmented Consistency Training (which proposes fine-tuning models to produce consistent reasoning across biased/unbiased prompts) to Deepseek.
Next steps:

Introduce Turpin et al.'s biasing features (e.g., answer reordering) into prompts.
Train DeepSeek to generate identical CoT explanations for both the original and biased versions of each prompt.
Measure whether this reduces accuracy drops and improves explicit acknowledgment of biasing factors in explanations.

I think that this approach could isolate whether DeepSeek's relative robustness stems from pre-training data patterns or architectural choices.

3 replies

MarmikChaudhari Mar 1, 2025

This is super interesting! Happy to collaborate !

QKVDecoder Mar 3, 2025

@MarmikChaudhari is there any discord channel to discuss about this project?

wendlerc Mar 3, 2025
Collaborator

#unfaithful-cot

REQUEST: Attempt to replicate Turpin et al. for a reasoning model, and interpret what's going on #17

Uh oh!

wattenberg Feb 27, 2025 Collaborator

Research Questions:

Question 1: Does Turpin et al. "prompt biasing" replicate for DeepSeek?

Question 2: Why or why not?

Owner:

Contributors:

Project status:

Replies: 2 comments · 7 replies

Uh oh!

MarmikChaudhari Feb 27, 2025

Uh oh!

wattenberg Feb 28, 2025 Collaborator Author

Uh oh!

KeremZaman Mar 4, 2025

Uh oh!

QKVDecoder Mar 4, 2025

Uh oh!

wj210 Mar 6, 2025

Uh oh!

Uh oh!

QKVDecoder Mar 1, 2025

Uh oh!

MarmikChaudhari Mar 1, 2025

Uh oh!

QKVDecoder Mar 3, 2025

Uh oh!

wendlerc Mar 3, 2025 Collaborator

wattenberg
Feb 27, 2025
Collaborator

Replies: 2 comments 7 replies

MarmikChaudhari
Feb 27, 2025

wattenberg Feb 28, 2025
Collaborator Author

QKVDecoder
Mar 1, 2025

wendlerc Mar 3, 2025
Collaborator