The Role of Attention for TinyZero Countdown suggests that attention heads do verification when the solution is given in the context. Let's better understand these heads.
These seem similar to induction heads -- or more precisely, the "previous token head" that attends to previous occurrences of a token. Can we verify that the verification heads are previous token heads, in contexts outside of our task?
Relevant code:
https://github.com/ajyl/verify_circuit/blob/main/notebooks/explore_attention.sync.ipynb
The Role of Attention for TinyZero Countdown suggests that attention heads do verification when the solution is given in the context. Let's better understand these heads.
These seem similar to induction heads -- or more precisely, the "previous token head" that attends to previous occurrences of a token. Can we verify that the verification heads are previous token heads, in contexts outside of our task?
Use random sentences from Wikipedia or something, and check the attention pattern of these heads when a repeated token is given.
Check the attention patterns of these heads when given repeated tokens in our Countdown task that do not correspond to the solution (ex: "=" token or other numeric tokens)
Relevant code:
https://github.com/ajyl/verify_circuit/blob/main/notebooks/explore_attention.sync.ipynb