Skip to content

sablin39/manifold-muon

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

32 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Variants of Muon Optimizer

Warning

Only verified with DDPStrategy, FSDP and deepspeed are still untested. Known issue exists in deepspeed_stage_2 due to sharded grad.

This repo tests out Muon optimizer and its variant applications. Including convenient mixture with AdamW, Scion, or under constraints like Stiefel Manifold, Spectral sphere,.etc.

Installation

This project can be easily installed via pip install -e .. For developers, you may install via pip install --no-build-isolation -e .[dev] to obtain pytest and related toolkits like tilelang for developing custom kernels.

Implementations & Key References

  • Official implementation: https://github.com/MoonshotAI/Moonlight
  • Core contribution:
    • An extra "update-RMS equalization" step so the per-parameter update RMS lines up across matrix vs. non-matrix params, allowing a more unified LR strategy across groups (Muon vs AdamW).
  • [TODO] Unofficial implementation here
  • Core contribution:
    • QK-Clip for controlling max-logit explosion (reference)

Muon on Spectral Sphere

$$ \min_{\Phi \in \mathbb{R}^{m \times n}} \mathstrut \text{tr}(G^\top \Phi) \quad \text{s.t.} \quad |\Phi|_2 = 1,\ |W|_2 = 1, |W-\eta \Phi |_2 = 1 $$

StiefelMoonlight

$$ \min_{\Phi \in \mathbb{R}^{m \times n}} \underbrace{\mathstrut \text{tr}(G^\top \Phi)}_{\text{linearization of cost}} \quad \text{s.t.} \quad \underbrace{|\Phi|_2 = 1}_{\text{spectral constraint}}, W^\top W=1, \\ \underbrace{\mathstrut \Phi^\top W + W^\top \Phi = 0}_{\text{tangent space constraint}}. $$

Note

This implementation has not been exposed due to unsatisfying speed and accuracy. However, you may import it via from manifold_muon.stiefel.stiefel_moonlight import StiefelMoonlight to try it out.

Dual Ascent based method (source code) referred from modula's blog.

Fixed Point based method (source code) referred from 《流形上的最速下降:3. Muon + Stiefel》.

[!] Weight decay is still under development

Example Usage

Our ManifoldMoonlight optimizer uses a new parameter grouping method, which is different from classical Moonlight or Muon implementation. Current valid grouping choices are ["use_muon", "use_adamw", "use_spectral_muon"].

from manifold_muon import ManifoldMoonlight, deduplicate_and_check_missing_params
params = {
    "use_muon": [p
            for name, p in model.named_parameters()
            if ((p.ndim >= 2 and "embed_tokens" not in name and "lm_head" not in name) and not ("q_proj" in name and "k_proj" in name))
    ],
    "use_adamw": [p
            for name, p in model.named_parameters()
            if not ((p.ndim >= 2 and "embed_tokens" not in name and "lm_head" not in name) and ("q_proj" not in name and "k_proj" not in name))
    ],
    "use_spectral_muon": [p
            for name, p in model.named_parameters()
            if (("q_proj" in name and "k_proj" in name) and not (p.ndim >= 2 and "embed_tokens" not in name and "lm_head" not in name))
    ],
}

# We highly suggest to add this line to check if any missing or duplicate params exists among groups 
deduplicate_and_check_missing_params(model, params) 

optimizer = ManifoldMoonlight(
    grouped_params = params,
    ...
)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages