Does Muon support FSDP Zero3

When I was using FSDP Zero3 with two devices, an assertion is failed in `zeropower_via_newtonschulz5`:

```python
assert G.ndim >= 2 # batched Muon implementation by @scottjmaddox, and put into practice in the record by @YouJiacheng
```

Turns out that `G` is flattened into a 1D parameter, which I got:

```
[localhost:0]:update shape in muon_update: torch.Size([1573760])                                                                                  
  [localhost:1]:update shape in muon_update: torch.Size([3145728])   
```

Since I am using Qwen3-0.6B to train, the hidden shape of the parameters should be `[3072, 1024]`, instead I get `3,072 * 1,024 = 3,145,728` in size. Also, the parameters are sharded into two devices, so one of them is `3,145,728 / 2 = 1,572,864` in size. 

To conclude, I have two questions:
1. Does Muon support FSDP Zero3?
2. Since Muon advancement is based on matrix orthogonalization, a complete matrix must be needed to get its corresponding spectral norm. However, the original `MuonWithAuxAdam` did not seem to gather a complete matrix when updating. So I would like to know if Muon do not support sharded parameters?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does Muon support FSDP Zero3 #45

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Does Muon support FSDP Zero3 #45

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions