Tuning an interpretability wrapper around R1 #18

wendlerc · 2025-03-03T19:56:40Z

wendlerc
Mar 3, 2025
Collaborator

last edit 03/03/25.

Research Questions

Often finetuned models are surprisingly similar to their base models. E.g., finetuning seems to enhance existing mechanisms and in our recent work we showed that finetuning sometimes makes it easier to find model mechanisms. In this project, I'd like to find out whether we can leverage this insight to make R1-interpretability more feasible.

Current Status

https://docs.google.com/document/d/1IDR5tCXx7zhWPuVuOTpY24xp5-xKGhRs8yx_HeK4s-A/edit?tab=t.0#heading=h.2nl8zv2bg5wu

Question 1: Can we tune R1 to exhibit instruction following capabilities within its thought process?

By default, R1 does not obey instructions within its thought process:

However, as can be seen (and obvious from using R1) it follows the instructions in its final response.

Can we perform a small finetuning step. E.g., using LoRAs to teach R1 to also follow instructions inside of its thought process?

Question 2: If so, can we use the resulting model's thought-instruction following to derive mechanistic insights?

Can we, e.g., find a verification circuit / branching points in the CoT / etc. by instructing our tuned-R1 model to indicate these parts of its computation with XML tags (or some other parsable way)? Once we have our tuned R1 we can simply try this out.

For a given behavior we'd follow steps like:

ask R1 kindly to annotate its own thoughts for us
run some mechint technique to, e.g., compute DAS-subspaces with strong intervention power on the tuned-R1 (like we did here).
go back to R1 and see whether our DAS-subspaces are also effective there

Owner

Chris Wendler is happy to help advise on this, and made the original request for a project! However, he doesn't have time to do these experiments himself.

Contributors

Project lead: Harsh Raj,
Advisors: Chris Wendler, Robert West, Ashton Anderson, David Bau

Action Items

Create a thought-instruction-following dataset: e.g., by appending suffix "Thought instruction: " to the user input (where could be something like "think in Chinese" for which we easily can generate synthetic data.
Set up a supervised finetuning pipeline (probably including LoRA support) for R1.
LoRA tune R1 on the thought-instruction-following dataset and evaluate (OOD) generalization.
If our "wrapper" generalizes nicely: test it for interpretability applications with an emphasis on faithfulness (= transferability of the interpretability results back to normal R1). For this evaluation probably it is the easiest to compute something like the double-checking / wrapping up steering vector but based on token positions marked by the "wrapper"-model.

Project status:

Robert West trained a prototype on synthetic CoT instructions using LoRA and obtained some initial promising results. Now he is advising the project.
Harsh Raj created multiple data generation pipelines based on programmatically inserting and checking constraints into thought process, based on Qwen-instruction following recipe and one based on hintsight labelling.
Harsh Raj trained a relatively strong thought instruction following model using full finetuning
Harsh Raj created several evaluation scripts for assessing though instruction following capabilities and reasoning performance. Our best model has about 73% instruction following accuracy (in distribution) and maintains task performance (in distribution). OOD evaluations look promising but look like we need to increase data diversity substantially to get meaningful instruction following generalization.
Harsh Raj set up a data pipeline for generating interpretability-related CoT instructions by following Prof. He's recent work.

harshraj172 · 2025-03-11T21:58:27Z

harshraj172
Mar 11, 2025

I have a data generation pipeline for instruction tuning in thoughts. It is inspired by recent synthetic data generation techniques for instruction generation from the Qwen team (see arXiv ) with some modifications.

My pipeline performs the following steps:

It takes a dataset of R1 instruction and response pairs, such as MagPie-250K.
It creates a mechanism to detect various structural constraints in the R1 thought process—for example, counting the number of paragraphs in the thought process or identifying how many palindromes the model produces.
It uses an auxiliary model (Llama-8B-Instruct) by instructing it to modify the original instruction so that it directs R1 to adhere to these structural constraints in its thought process.

Although we are currently focusing on structural constraints, as suggested by @wendlerc, ideally we should develop more complex instructions, such as requiring the model's reasoning process to follow an induction process. We plan to extend this by using another auxiliary LLM to identify such elements in the R1 thought process and incorporate them into the instructions.

4 replies

wendlerc Mar 11, 2025
Collaborator Author

very cool! do you think it makes sense to start a shared repo already?

harshraj172 Mar 11, 2025

Here is a cherry-picked sample from the dataset.

Original Instruction:

A committee is formed by randomly selecting \( m \) members from a group of \( n \) people. In how many ways can a committee of \( m \) people be formed from \( n \) people, where order does not matter?

Modified Instruction:

Formulate the approach for determining the number of ways a committee of \( m \) people can be formed from a group of \( n \) individuals, considering that order does not matter. Be meticulous with your explanation and acknowledge the significance of each step involved, but avoid unnecessary verbosity since this response should contain approximately **1,048 words**, neither more nor less. Throughout your reasoning process, maintain consistency in tense usage; refrain from switching between past and future tenses unless absolutely necessary due to mixed-tense check requirements. As you progress through your argumentation, ensure there are no instances where statements have mirrored meanings (no palindrome count), and remember that hyperbolic expressions are also excluded from your narrative. Proceed methodically until you reach the conclusion.

Current Issue:

One problem with the data generation process is that when instructing the auxiliary model to modify the instruction, it sometimes responds by answering the question instead of simply modifying the instruction. A potential remedy could be using a better model, such as Claude, or refining the prompting strategy to explicitly enforce modification without answering.

harshraj172 Mar 11, 2025

very cool! do you think it makes sense to start a shared repo already?

Sure

wendlerc Mar 11, 2025
Collaborator Author

I think this problem might go away if you use a stronger model for the creating the instructions.

Alternatively, we could think of chunking the CoT and asking for instructions per chunk.

ajyl · 2025-03-23T01:41:33Z

ajyl
Mar 23, 2025
Collaborator

This seems relevant? https://www.theproxycompany.com/pse

0 replies

wendlerc · 2025-05-17T21:42:09Z

wendlerc
May 17, 2025
Collaborator Author

Updated the main text.

0 replies

wendlerc · 2025-05-29T02:50:22Z

wendlerc
May 29, 2025
Collaborator Author

added a status report that is maintained by @harshraj172
https://docs.google.com/document/d/1IDR5tCXx7zhWPuVuOTpY24xp5-xKGhRs8yx_HeK4s-A/edit?tab=t.0#heading=h.2nl8zv2bg5wu

0 replies

Tuning an interpretability wrapper around R1 #18

Uh oh!

Uh oh!

wendlerc Mar 3, 2025 Collaborator

Research Questions

Current Status

Question 1: Can we tune R1 to exhibit instruction following capabilities within its thought process?

Question 2: If so, can we use the resulting model's thought-instruction following to derive mechanistic insights?

Owner

Contributors

Action Items

Project status:

Replies: 4 comments · 4 replies

Uh oh!

Uh oh!

harshraj172 Mar 11, 2025

Uh oh!

wendlerc Mar 11, 2025 Collaborator Author

Uh oh!

harshraj172 Mar 11, 2025

Original Instruction:

Modified Instruction:

Current Issue:

Uh oh!

harshraj172 Mar 11, 2025

Uh oh!

wendlerc Mar 11, 2025 Collaborator Author

Uh oh!

ajyl Mar 23, 2025 Collaborator

Uh oh!

wendlerc May 17, 2025 Collaborator Author

Uh oh!

wendlerc May 29, 2025 Collaborator Author

wendlerc
Mar 3, 2025
Collaborator

Replies: 4 comments 4 replies

harshraj172
Mar 11, 2025

wendlerc Mar 11, 2025
Collaborator Author

wendlerc Mar 11, 2025
Collaborator Author

ajyl
Mar 23, 2025
Collaborator

wendlerc
May 17, 2025
Collaborator Author

wendlerc
May 29, 2025
Collaborator Author