-
Notifications
You must be signed in to change notification settings - Fork 106
Open
Description
When I was using FSDP Zero3 with two devices, an assertion is failed in zeropower_via_newtonschulz5:
assert G.ndim >= 2 # batched Muon implementation by @scottjmaddox, and put into practice in the record by @YouJiachengTurns out that G is flattened into a 1D parameter, which I got:
[localhost:0]:update shape in muon_update: torch.Size([1573760])
[localhost:1]:update shape in muon_update: torch.Size([3145728])
Since I am using Qwen3-0.6B to train, the hidden shape of the parameters should be [3072, 1024], instead I get 3,072 * 1,024 = 3,145,728 in size. Also, the parameters are sharded into two devices, so one of them is 3,145,728 / 2 = 1,572,864 in size.
To conclude, I have two questions:
- Does Muon support FSDP Zero3?
- Since Muon advancement is based on matrix orthogonalization, a complete matrix must be needed to get its corresponding spectral norm. However, the original
MuonWithAuxAdamdid not seem to gather a complete matrix when updating. So I would like to know if Muon do not support sharded parameters?
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels