Model Diffing of Reasoning Models #9
Replies: 8 comments 17 replies
-
|
I'd be also very interested in this, can I add myself as collaborator? |
Beta Was this translation helpful? Give feedback.
-
|
Nice project! Could you add a brief outline? What models are you looking at? Are you looking for collaborators? I'd also be curious if others in the community are working on this? I heard crosscoders for model-diffing coming up multiple times since deepseek-R1 was released. Might make sense to collaborate, eg. standardize hyperparameters like width, architecture, etc? That was very valuable in the SAEBench project... |
Beta Was this translation helpful? Give feedback.
-
|
Hi, I've done some preliminary work on this, which I submitted to an ICLR workshop. I would love to collaborate on this project and explore any further extensions! |
Beta Was this translation helpful? Give feedback.
-
|
Hey sorry for the delay! we'll post a bit more details soon. We'll try to get back to everyone next week! |
Beta Was this translation helpful? Give feedback.
-
|
I am curious: how do you choose the token positions at which to do the decomposition? Seems a bit infeasible to consider all of them. E.g., R1-Llama-8B quite often needed ~12K tokens to solve math500 examples when I tested it. Also it seems we should be able to do better than sampling them randomly. Curious to hear your approach to this. |
Beta Was this translation helpful? Give feedback.
-
|
hey, i am very curious to work on this problem and would be working on this problem and somewhat related experiments have ideas around this topic in the coming days. |
Beta Was this translation helpful? Give feedback.
-
|
Some update from @jkminder and me. We've been suspicious of the effectiveness of crosscoder as a good dictionary learning method for model diffing and are spending some more time properly comparing them to proper baselines like vanilla SAEs before switching to reasoning models. We will update you soon |
Beta Was this translation helpful? Give feedback.
-
|
Hey the paper is out! We focused on chat / base model rather than reasoning model but developed a better way to train crosscoders. We plan to apply them, along with some other variations to reasoning models to see if we can gain any extra insight that SAE based approach (e.g. https://arxiv.org/abs/2503.18878) don't. Here is the full announcement: Excited to share our new paper "Robustly identifying concepts introduced during chat fine-tuning using crosscoders" with Julian Minder supervised by Bilal Chughtai & Neel Nanda and with help of Caden Juang ! 📄 What do chat LLMs learn during fine-tuning? We investigated this question by building on Anthropic's crosscoder approach, which creates a shared sparse dictionary across base and chat models to identify what's unique to each. Key findings:
Our work advances model diffing methodology while providing concrete insights into how chat fine-tuning modifies language model behavior. The "special sauce" of chat models may be in how they use these structural tokens! More details & illustration in our paper thread: https://x.com/Butanium_/status/1909240511572947188 / https://bsky.app/profile/butanium.bsky.social/post/3lmaejautds2u We'll be presenting at the ICLR SLLM workshop! |
Beta Was this translation helpful? Give feedback.


Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Research Question
How different are the representations of the reasoning models? What new features can we find if we compare a base model and that same model after reasoning distillation or RL training?
Owners
Dmitrii Troitskii (@mitroitskii), Clement Dumas (@Butanium), Julian Minder (@jkminder), Neel Nanda (@neelnanda-io)
Project Status
Work in Progress. We are figuring out our collaboration approach at the moment.
Experiments
Discord
https://discord.gg/2FmVhkMZbZ
Beta Was this translation helpful? Give feedback.
All reactions