Model Diffing of Reasoning Models #9

mitroitskii · 2025-02-18T04:00:32Z

mitroitskii
Feb 18, 2025

Research Question

How different are the representations of the reasoning models? What new features can we find if we compare a base model and that same model after reasoning distillation or RL training?

Owners

Dmitrii Troitskii (@mitroitskii), Clement Dumas (@Butanium), Julian Minder (@jkminder), Neel Nanda (@neelnanda-io)

Project Status

Work in Progress. We are figuring out our collaboration approach at the moment.

Experiments

Experiment: Diffing Reasoning Models on Math #15

Discord

https://discord.gg/2FmVhkMZbZ

wendlerc · 2025-02-18T15:59:37Z

wendlerc
Feb 18, 2025
Collaborator

I'd be also very interested in this, can I add myself as collaborator?

1 reply

mitroitskii Feb 21, 2025
Author

Hey Chris! The experiment I am running right now is a part of my MATS application so I will have to run it on my own, but I am totally open to colab on the future experiments.

@jkminder and @Butanium may be open to colab on their ongoing project - they should post here when they have figured out their colab approach.

canrager · 2025-02-18T16:09:58Z

canrager
Feb 18, 2025
Collaborator

Nice project! Could you add a brief outline? What models are you looking at? Are you looking for collaborators?

I'd also be curious if others in the community are working on this? I heard crosscoders for model-diffing coming up multiple times since deepseek-R1 was released. Might make sense to collaborate, eg. standardize hyperparameters like width, architecture, etc? That was very valuable in the SAEBench project...

1 reply

mitroitskii Feb 21, 2025
Author

Hey Can! Added more information about my ongoing experiment here. I am not looking for collaborators for my current experiment (since it's an app for MATS), but I am totally open to colab on the future experiments.

@jkminder and @Butanium may be open to colab on their ongoing project - they should post here when they have figured out their colab approach.

david-baek · 2025-02-20T20:20:49Z

david-baek
Feb 20, 2025

Hi, I've done some preliminary work on this, which I submitted to an ICLR workshop. I would love to collaborate on this project and explore any further extensions!

6 replies

david-baek Mar 6, 2025

My current write-up is available here: https://arxiv.org/abs/2503.03730
Also excited to share that the submission has been accepted to ICLR BuildingTrust workshop! I hope your MATS application went well, and looking forward to colab!

Butanium Mar 6, 2025

thanks for sharing! Any insight why you used the L1 norm of the decoder rather than the L2 like in the anthropic blog post?
https://transformer-circuits.pub/2024/crosscoders/index.html#crosscoder-basics

david-baek Mar 6, 2025

Sorry this is a typo :( Used the same setup as anthropic blog post!

Butanium Mar 6, 2025

Cool ty! another question:

Is the question truncated here? I don't see the options

david-baek Mar 6, 2025

It's not truncated, I only gave one sentence as a prefix prompt to the model. Let's move to slack for further discussion to prevent bombarding people in this thread with notifications! I just sent you a message.

Butanium · 2025-02-22T11:10:21Z

Butanium
Feb 22, 2025

Hey sorry for the delay! we'll post a bit more details soon. We'll try to get back to everyone next week!

0 replies

wendlerc · 2025-02-24T03:59:03Z

wendlerc
Feb 24, 2025
Collaborator

I am curious: how do you choose the token positions at which to do the decomposition? Seems a bit infeasible to consider all of them. E.g., R1-Llama-8B quite often needed ~12K tokens to solve math500 examples when I tested it. Also it seems we should be able to do better than sampling them randomly.

Curious to hear your approach to this.

3 replies

Butanium Feb 24, 2025

do you think we would miss some important stuff by filtering traces that are too long?

wendlerc Feb 25, 2025
Collaborator

Not sure, but even if you'd keep only answers with average length I am assuming it's >1K toks per sequence.

mitroitskii Feb 25, 2025
Author

For the dataset I am using the average length of an input is around 800 tokens, with 1024 tokens covering more than 80% of the inputs in the training set.

idhantgulati · 2025-03-03T08:32:24Z

idhantgulati
Mar 3, 2025

hey,

i am very curious to work on this problem and would be working on this problem and somewhat related experiments have ideas around this topic in the coming days.

3 replies

idhantgulati Mar 3, 2025

i did some preliminary exps that was mostly pre-filling tokens (primarly used DeepSeek-R1-Distill-Llama-8B for this)
i havent tried anything with the base model yet but would be doing that soon

Butanium Mar 3, 2025

hey this project is mostly focused on the representation and mechanisms rather than blackbox token level intervention

idhantgulati Mar 13, 2025

yes that makes sense. really sorry for the reply. i am working on a activation patching exp on reasoning models. i will get back to diffing in a while.

Butanium · 2025-03-03T09:00:08Z

Butanium
Mar 3, 2025

Some update from @jkminder and me. We've been suspicious of the effectiveness of crosscoder as a good dictionary learning method for model diffing and are spending some more time properly comparing them to proper baselines like vanilla SAEs before switching to reasoning models.

We will update you soon

3 replies

jasonlim131 Apr 17, 2025

For what reasons are you suspicious of the effectiveness of crosscoder as a good dictionary learning method?
I am filtering SAE latents on reasoning tasks but was thinking crosscoder analysis (similar to the one you did on base-instruct) would be an improved method.

jasonlim131 Apr 17, 2025

Oh I see the bottom one; did you observe a similar improvement on finetune-exclusive latents with BatchTopK between base v. reason model?
Willing to collaborate on this as an extension project and support compute to train new crosscoders as well.

Butanium Apr 18, 2025

@mitroitskii is working on getting batchtopk crosscoder on reasoning model trained. Hopefully we'll have results to share soon!

Butanium · 2025-04-07T17:41:01Z

Butanium
Apr 7, 2025

Hey the paper is out! We focused on chat / base model rather than reasoning model but developed a better way to train crosscoders. We plan to apply them, along with some other variations to reasoning models to see if we can gain any extra insight that SAE based approach (e.g. https://arxiv.org/abs/2503.18878) don't. Here is the full announcement:

Excited to share our new paper "Robustly identifying concepts introduced during chat fine-tuning using crosscoders" with Julian Minder supervised by Bilal Chughtai & Neel Nanda and with help of Caden Juang ! 📄

What do chat LLMs learn during fine-tuning? We investigated this question by building on Anthropic's crosscoder approach, which creates a shared sparse dictionary across base and chat models to identify what's unique to each.

Key findings:

The original L1 crosscoder has two theoretical issues (Complete Shrinkage and Latent Decoupling) that misclassify many latents as "chat-only" when they actually exist in both models
Using BatchTopK loss instead of L1 solves these issues and reveals genuinely chat-specific latents
These chat-specific latents are highly interpretable, representing concepts like false information detection, knowledge boundaries, personal experience handling, and various refusal mechanisms
40% of robust chat-specific latents primarily activate on template tokens (like <end_of_turn>), highlighting their critical role in chat model behavior

Our work advances model diffing methodology while providing concrete insights into how chat fine-tuning modifies language model behavior. The "special sauce" of chat models may be in how they use these structural tokens!

More details & illustration in our paper thread: https://x.com/Butanium_/status/1909240511572947188 / https://bsky.app/profile/butanium.bsky.social/post/3lmaejautds2u
Full paper: https://arxiv.org/abs/2504.02922

We'll be presenting at the ICLR SLLM workshop!

0 replies

Model Diffing of Reasoning Models #9

Uh oh!

Uh oh!

Research Question

Owners

Project Status

Experiments

Discord

Replies: 8 comments · 17 replies

Uh oh!

wendlerc Feb 18, 2025 Collaborator

Uh oh!

Uh oh!

mitroitskii Feb 21, 2025 Author

Uh oh!

Uh oh!

canrager Feb 18, 2025 Collaborator

Uh oh!

Uh oh!

mitroitskii Feb 21, 2025 Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wendlerc Feb 24, 2025 Collaborator

Uh oh!

Uh oh!

Uh oh!

wendlerc Feb 25, 2025 Collaborator

Uh oh!

Uh oh!

mitroitskii Feb 25, 2025 Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Replies: 8 comments 17 replies

wendlerc
Feb 18, 2025
Collaborator

mitroitskii Feb 21, 2025
Author

canrager
Feb 18, 2025
Collaborator

mitroitskii Feb 21, 2025
Author

wendlerc
Feb 24, 2025
Collaborator

wendlerc Feb 25, 2025
Collaborator

mitroitskii Feb 25, 2025
Author