[Verification Mechanism] Probing at earlier timesteps

## In [Steering R1: "Verification" Vectors](https://ajyl.github.io/reasoning/2025/02/16/steering-R1.html), we were probing at the "(" timestep -- right before it produces "not" or "this". What happens if we probe even earlier? ie, what is the earliest timestep in which the model knows it has found a solution?

 In particular, the model's CoT looks something like "10 + 20 - 5 = 30 - 5 = 25 (this works)".
our [attention head analysis](https://ajyl.github.io/reasoning/2025/03/18/R1-attention.html) suggests that after the final number is evaluated (ie, "25") is when the model does verification. What about after the first equation? (at the end of 10 + 20 - 5). What about after the 2nd equation? Can we still decode high accuracy at these timesteps?

If so, this implies that there are multiple circuits for verification, outside of our verification heads.

Relevant code: https://github.com/ajyl/verify_circuit/blob/main/src/probe.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Verification Mechanism] Probing at earlier timesteps #27

In Steering R1: "Verification" Vectors, we were probing at the "(" timestep -- right before it produces "not" or "this". What happens if we probe even earlier? ie, what is the earliest timestep in which the model knows it has found a solution?

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[Verification Mechanism] Probing at earlier timesteps #27

Description

In Steering R1: "Verification" Vectors, we were probing at the "(" timestep -- right before it produces "not" or "this". What happens if we probe even earlier? ie, what is the earliest timestep in which the model knows it has found a solution?

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions