Skip to content

Muon for multi-head attention weights #50

@schmidt-ai

Description

@schmidt-ai

Should the weights for separate attention heads be separated before applying muon? I ask because I see in the blog post that "Muon works better for optimizing transformers if it is applied to their Q, K, V parameters separately", which makes me wonder there would be any benefit going further by splitting out the heads, too.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions