Skip to content

Norm-Based Adaptive Moment Estimation with Orthogonalized Momentum#107

Draft
mkhona-nvidia wants to merge 4 commits intoNVIDIA-NeMo:mainfrom
mkhona-nvidia:mkhona/namo
Draft

Norm-Based Adaptive Moment Estimation with Orthogonalized Momentum#107
mkhona-nvidia wants to merge 4 commits intoNVIDIA-NeMo:mainfrom
mkhona-nvidia:mkhona/namo

Conversation

@mkhona-nvidia
Copy link
Contributor

Build Namo as another method to normalize Muon updates

From Adam improves Muon (https://arxiv.org/abs/2602.17080)

Signed-off-by: mikail <mkhona@nvidia.com>
Signed-off-by: mikail <mkhona@nvidia.com>
Signed-off-by: mikail <mkhona@nvidia.com>
Signed-off-by: mikail <mkhona@nvidia.com>
@mkhona-nvidia mkhona-nvidia self-assigned this Feb 20, 2026
@copy-pr-bot
Copy link

copy-pr-bot bot commented Feb 20, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@greptile-apps
Copy link

greptile-apps bot commented Feb 20, 2026

Greptile Summary

Implements NAMO (scalar adaptive scaling) as a third moment2_method option for AdaptiveMuon, following the approach from "Adam improves Muon" (arXiv:2602.17080). NAMO scales orthogonalized momentum by the Frobenius-norm ratio of pre-orthogonalization gradient to the EMA of raw gradient norms.

Key Changes:

  • Extended moment2_method parameter to accept "namo" alongside existing "adamuon" and "normuon"
  • Added scalar buffer initialization for NAMO (EMA of ||G_t||_F^2)
  • Implemented NAMO scaling logic: α_t = ||g_t^pre-orth||_F / (√v_t + ε)
  • Captured gradient norms before and after momentum+Nesterov updates
  • Added comprehensive docstrings with mathematical notation
  • Updated tests to cover NAMO across all test cases

Confidence Score: 4/5

  • Safe to merge with minor review of gradient norm capture logic
  • Implementation follows established patterns, includes comprehensive tests and documentation. Small deduction for potential ambiguity in which gradient tensor is used for pre_orth_norm calculation
  • Pay close attention to adaptive_muon.py:280 - verify the gradient norm calculation uses the correct tensor

Important Files Changed

Filename Overview
emerging_optimizers/orthogonalized_optimizers/adaptive_muon.py Adds NAMO method for scalar adaptive scaling via Frobenius-norm ratio, extends moment2_method parameter, includes proper documentation
tests/test_adaptive_muon.py Comprehensive test coverage for NAMO including smoke tests and shape validation
docs/apidocs/orthogonalized-optimizers.md Documentation properly updated to include AdaptiveMuon in API docs

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    Start[Start step] --> CheckMethod{moment2_method?}
    
    CheckMethod -->|namo| CaptureNorm1[Capture grad_fro_sq = ‖G_t‖²_F<br/>before momentum]
    CheckMethod -->|adamuon/normuon| StandardPath[Standard path]
    
    CaptureNorm1 --> UpdateMomentum[Update momentum buffer<br/>exp_avg.lerp_]
    StandardPath --> UpdateMomentum
    
    UpdateMomentum --> Nesterov{use_nesterov?}
    Nesterov -->|Yes| NesterovGrad[grad = grad.lerp exp_avg]
    Nesterov -->|No| UseExpAvg[grad = exp_avg]
    
    NesterovGrad --> Orthogonalize[orth_grad = orthogonalize]
    UseExpAvg --> Orthogonalize
    
    Orthogonalize --> ApplyMethod{moment2_method?}
    
    ApplyMethod -->|namo| NAMOPath[NAMO: Update v_t with grad_fro_sq<br/>Compute α_t = ‖grad‖_F / √v_t + ε<br/>Return orth_grad * α_t]
    ApplyMethod -->|adamuon| AdamPath[AdamUon: Update elementwise v_t<br/>Return orth_grad / √v_t + ε]
    ApplyMethod -->|normuon| NorPath[NorMuon: Update row/col v_t<br/>Return orth_grad * rsqrt v_t]
    
    NAMOPath --> WeightUpdate[p.add_ update, alpha=-lr]
    AdamPath --> WeightUpdate
    NorPath --> WeightUpdate
    
    WeightUpdate --> End[End step]
Loading

Last reviewed commit: 6511afa

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@mkhona-nvidia mkhona-nvidia changed the title Mkhona/namo Namo Feb 20, 2026
@mkhona-nvidia mkhona-nvidia changed the title Namo Norm-Based Adaptive Moment Estimation with Orthogonalized Momentum Feb 20, 2026
@skyw skyw marked this pull request as draft February 23, 2026 16:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant