Muon for multi-head attention weights

Should the weights for separate attention *heads* be separated before applying muon? I ask because I see in the [blog post](https://kellerjordan.github.io/posts/muon/#:~:text=Muon%20works%20better%20for%20optimizing%20transformers%20if%20it%20is%20applied%20to%20their%20Q%2C%20K%2C%20V%20parameters%20separately) that "Muon works better for optimizing transformers if it is applied to their Q, K, V parameters separately", which makes me wonder there would be any benefit going further by splitting out the heads, too.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Muon for multi-head attention weights #50

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Muon for multi-head attention weights #50

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions