Skip to content

Experiment: Reproduce "Think dot by dot" with R1 models; Elicit the inner monologue of reasoning model #20

@yc015

Description

@yc015

(https://openreview.net/pdf?id=NikbrdtYvG) Pfau et al. designed a simple yet experiment that demonstrates that the textual content of the CoT may not matter for the model's final output. However the additional inference pass on the CoT tokens are necessary for the model to derive the correct answer.

Pfau et al. tested this by training a toy LLaMa model on the synthetic 3SUM problems:

  • Can we reproduce the observations made by Pfau et al. using a R1 model?
    • We can use the same 3SUM dataset generated by Pfau et al. Alternatively, we can test this using MMLU dataset.
  • Assuming we could: can we find a way to elicit the inner monologue/hidden reasoning from the model's intermediate hidden states?
    • For example, given a chain-of-thought with filler tokens, can we use the next-token predicted from the middle-layer hidden states (like LogitLens) as the predicted next word and let model continue the generation always using the intermediate layer predictions? To some extend, you can imagine you are ablating the last few transformer layers of the model.
  • Assuming we couldn't: this potentially means that the textual level reasoning of R1 model has a stronger connection with the model's behavior (which is ideal!).
    • Can you fool the reasoning model by injecting irrelevant information/misleading information its reasoning (e.g., how the model's final choice of a multiple choice question change if you append two sentences of CoT it generated for a different question at the end of its reasoning)? In other words, what's the total effect of the "textual reasoning" (independent var) on the model's output (dependent var)?

I assigned this experiment to @akommula, but let me know if you need any help from me!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions