In Steering R1: "Verification" Vectors, we were probing at the "(" timestep -- right before it produces "not" or "this". What happens if we probe even earlier? ie, what is the earliest timestep in which the model knows it has found a solution?
In particular, the model's CoT looks something like "10 + 20 - 5 = 30 - 5 = 25 (this works)".
our attention head analysis suggests that after the final number is evaluated (ie, "25") is when the model does verification. What about after the first equation? (at the end of 10 + 20 - 5). What about after the 2nd equation? Can we still decode high accuracy at these timesteps?
If so, this implies that there are multiple circuits for verification, outside of our verification heads.
Relevant code: https://github.com/ajyl/verify_circuit/blob/main/src/probe.py
In Steering R1: "Verification" Vectors, we were probing at the "(" timestep -- right before it produces "not" or "this". What happens if we probe even earlier? ie, what is the earliest timestep in which the model knows it has found a solution?
In particular, the model's CoT looks something like "10 + 20 - 5 = 30 - 5 = 25 (this works)".
our attention head analysis suggests that after the final number is evaluated (ie, "25") is when the model does verification. What about after the first equation? (at the end of 10 + 20 - 5). What about after the 2nd equation? Can we still decode high accuracy at these timesteps?
If so, this implies that there are multiple circuits for verification, outside of our verification heads.
Relevant code: https://github.com/ajyl/verify_circuit/blob/main/src/probe.py