Skip to content

[Verification Mechanism] Probing at earlier timesteps #27

@ajyl

Description

@ajyl

In Steering R1: "Verification" Vectors, we were probing at the "(" timestep -- right before it produces "not" or "this". What happens if we probe even earlier? ie, what is the earliest timestep in which the model knows it has found a solution?

In particular, the model's CoT looks something like "10 + 20 - 5 = 30 - 5 = 25 (this works)".
our attention head analysis suggests that after the final number is evaluated (ie, "25") is when the model does verification. What about after the first equation? (at the end of 10 + 20 - 5). What about after the 2nd equation? Can we still decode high accuracy at these timesteps?

If so, this implies that there are multiple circuits for verification, outside of our verification heads.

Relevant code: https://github.com/ajyl/verify_circuit/blob/main/src/probe.py

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    Status

    In Progress

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions