Skip to content

Does Muon support FSDP Zero3 #45

@lucaswychan

Description

@lucaswychan

When I was using FSDP Zero3 with two devices, an assertion is failed in zeropower_via_newtonschulz5:

assert G.ndim >= 2 # batched Muon implementation by @scottjmaddox, and put into practice in the record by @YouJiacheng

Turns out that G is flattened into a 1D parameter, which I got:

[localhost:0]:update shape in muon_update: torch.Size([1573760])                                                                                  
  [localhost:1]:update shape in muon_update: torch.Size([3145728])   

Since I am using Qwen3-0.6B to train, the hidden shape of the parameters should be [3072, 1024], instead I get 3,072 * 1,024 = 3,145,728 in size. Also, the parameters are sharded into two devices, so one of them is 3,145,728 / 2 = 1,572,864 in size.

To conclude, I have two questions:

  1. Does Muon support FSDP Zero3?
  2. Since Muon advancement is based on matrix orthogonalization, a complete matrix must be needed to get its corresponding spectral norm. However, the original MuonWithAuxAdam did not seem to gather a complete matrix when updating. So I would like to know if Muon do not support sharded parameters?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions