Replies: 4 comments 4 replies
-
|
I have a data generation pipeline for instruction tuning in thoughts. It is inspired by recent synthetic data generation techniques for instruction generation from the Qwen team (see arXiv ) with some modifications. My pipeline performs the following steps:
Although we are currently focusing on structural constraints, as suggested by @wendlerc, ideally we should develop more complex instructions, such as requiring the model's reasoning process to follow an induction process. We plan to extend this by using another auxiliary LLM to identify such elements in the R1 thought process and incorporate them into the instructions. |
Beta Was this translation helpful? Give feedback.
-
|
This seems relevant? https://www.theproxycompany.com/pse |
Beta Was this translation helpful? Give feedback.
-
|
Updated the main text. |
Beta Was this translation helpful? Give feedback.
-
|
added a status report that is maintained by @harshraj172 |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
last edit 03/03/25.
Research Questions
Often finetuned models are surprisingly similar to their base models. E.g., finetuning seems to enhance existing mechanisms and in our recent work we showed that finetuning sometimes makes it easier to find model mechanisms. In this project, I'd like to find out whether we can leverage this insight to make R1-interpretability more feasible.
Current Status
https://docs.google.com/document/d/1IDR5tCXx7zhWPuVuOTpY24xp5-xKGhRs8yx_HeK4s-A/edit?tab=t.0#heading=h.2nl8zv2bg5wu
Question 1: Can we tune R1 to exhibit instruction following capabilities within its thought process?
By default, R1 does not obey instructions within its thought process:

However, as can be seen (and obvious from using R1) it follows the instructions in its final response.
Can we perform a small finetuning step. E.g., using LoRAs to teach R1 to also follow instructions inside of its thought process?
Question 2: If so, can we use the resulting model's thought-instruction following to derive mechanistic insights?
Can we, e.g., find a verification circuit / branching points in the CoT / etc. by instructing our tuned-R1 model to indicate these parts of its computation with XML tags (or some other parsable way)? Once we have our tuned R1 we can simply try this out.
For a given behavior we'd follow steps like:
Owner
Chris Wendler is happy to help advise on this, and made the original request for a project! However, he doesn't have time to do these experiments himself.
Contributors
Project lead: Harsh Raj,
Advisors: Chris Wendler, Robert West, Ashton Anderson, David Bau
Action Items
Project status:
Beta Was this translation helpful? Give feedback.
All reactions