Skip to content

Experiment: Attributing the final output of reasoning model to its intermediate thinking #19

@yc015

Description

@yc015

When an R1 model produces an extensive chain-of-thought, how important is each segment of its reasoning for the final answer? Is it only the thought after the Aha moment matters, or do the earlier, possibly flawed, steps also crucial? Can we trace the impact of each segment on the model’s final output?"

Start from the easiest to the hardest way of answering this question:

  1. If we ablate a word/sentence/reasoning step from the model's generated CoT, how much does the logit of the final choice change?
  2. Can we use some existing gradient-based saliency method to visualize the contribution of each reasoning token's importance to the final output?
  3. Alternatively, can we optimize a sparse attention mask that we could apply on the computation of the final choice token?
    • We want to maintain the original answer output by the model while increasing the sparsity of the attention (so only a few tokens in the reasoning are being attended when computing the hidden states of the answer tokens).
    • Certainly, the information from the masked out tokens may flow into the hidden states of the unmasked tokens in the intermediate layers. Even so, it's still interesting to see which words are necessary for model to get what they got.

Of course, the multiple choice questions are the easiest to work with since you only need to track the contribution of different thinking segments to one token, namely the final choice made by the model.

I assigned this experiment to @aghyad-deeb, but let me know if you need any help from me!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions