[dev] refactor to support emerging optimizers beyond muon#3618
[dev] refactor to support emerging optimizers beyond muon#3618FDecaYed wants to merge 9 commits intoNVIDIA:devfrom
Conversation
skyw
left a comment
There was a problem hiding this comment.
Looks good overall.
Terminology "plain" can be improved.
| setattr(optimizer, 'tp_group', tp_group) | ||
| result = optimizer | ||
| else: | ||
| fallback_config = copy.copy(config) |
There was a problem hiding this comment.
Q: Does this need deepcopy? there could be very heavy structure in config.
| "num_ns_steps": config.muon_num_ns_steps, | ||
| "scale_mode": config.muon_scale_mode, | ||
| "extra_scale_factor": config.muon_extra_scale_factor, | ||
| "mode": config.muon_tp_mode, |
There was a problem hiding this comment.
how about change this to tp_mode?
so the contract is everything with muon_ prefix in config, translate to kwargs by stripping out the prefix. code can also be generalized.
There was a problem hiding this comment.
SG. we can do it in next PR when we bump emerging_optimizers and really add support for other optimizers
| # Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. | ||
|
|
||
| """Megatron muon optimizer wrapper to handle tensor-parallel.""" | ||
| """Backward-compatible shim — all code now lives in ``emerging_optimizers``.""" |
There was a problem hiding this comment.
Q: Does everything exposed through muon.py still work? just with a deprecation warning?
There was a problem hiding this comment.
I'll test it. haven't run the draft yet
| # Muon optimizer check | ||
| if 'muon' in args.optimizer: | ||
| # Muon / emerging optimizer check | ||
| if args.optimizer in ('muon', 'dist_muon'): |
There was a problem hiding this comment.
Consider creating a emerging optimizer group for everything with muon_, soap_ or other prefixs.
There was a problem hiding this comment.
same, we'll change these part properly in next step when we add support
Signed-off-by: Hao Wu <skyw@nvidia.com>
Signed-off-by: Hao Wu <skyw@nvidia.com>
Signed-off-by: Hao Wu <skyw@nvidia.com>
1ab146d to
6b21428
Compare
|
/ok to test 3ce76a9 |
|
@FDecaYed : I think Megatron-Bridge needs some api support change to cover this PR as well? |
yes. but I'm hoping this would be last refactor |
Signed-off-by: Hao Wu <skyw@nvidia.com> Co-authored-by: Hao Wu <skyw@nvidia.com>
|
🔄 Merge queue validation started! You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/22701547164 |
Signed-off-by: Hao Wu <skyw@nvidia.com> Co-authored-by: Hao Wu <skyw@nvidia.com>
|
🔄 Merge queue validation started! You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/22701733585 |
What does this PR do ?
refactor optimizer config, distributed-optimizer dispatch and megatron optimizer get() in preparation for allowing more optimizers.
build on top of #3325
main PR: #3638
Summary
Unify optimizer creation so standard (Adam/SGD) and emerging (Muon) optimizers go through a single
get_megatron_optimizer()entry point._get_megatron_emerging_optimizer()in__init__.py. It reuses the same param-grouping and config-override mechanism as standard optimizers, removing the need to manually group parameters into separate optimizers with freeze/unfreeze hack.AdamOptimizerConfig/SGDOptimizerConfigsubclasses back into oneOptimizerConfig. Since every param group can override viaconfig_overridesanyway, one default config + overrides is cleaner than passing multiple config objects.emerging_optimizers.pywith a pluggable registry. Each supported emerging optimizer maps to{optimizer_cls, config_to_kwargs, default_param_overrides, init_state_fn}.TensorParallelMuonand all Muon helpers moved here frommuon.py(only backward-compat shim left in muon.py).dist_muondeprecated —muon+--use-distributed-optimizeris resolved at the argument level into a newuse_layer_wise_distributed_optimizerflag. this replaces dist_muon.use_distributed_optimizerreset to False to avoid side effects in the standard distributed-optimizer code path.LayerWiseDistributedOptimizernow expects plain torch optimizers (wrapping happens internally). cleaner than unwrapping megatron optimizers.Contribution process
flowchart LR A[Pre-checks] --> B[PR Tests] subgraph Code Review/Approval C1[Expert Review] --> C2[Final Review] end B --> C1 C2 --> D[Merge]Pre-checks
Core 0.8)Code review
The following process is enforced via the CODEOWNERS file for changes into
megatron/core. For changes outside ofmegatron/core, it is up to the PR author whether or not to tag the Final Reviewer team.For MRs into `main` branch
Feel free to message or comment the @mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!
(Step 1): Add PR label
Expert Review(Step 2): Collect the expert reviewers reviews
Expert Reviewlabel when your PR is ready for review.Final Review might get declined if these requirements are not fulfilled.
(Step 3): Final Review
Final Reviewlabel(Optional Step 4): Cherry-pick into release branch
If this PR also needs to be merged into
core_r*release branches, after this PR has been merged, selectCherry-pickto open a new PR into the release branch.For MRs into `dev` branch
The proposed review process for `dev` branch is under active discussion.MRs are mergable after one approval by either
eharper@nvidia.comorzijiey@nvidia.com.Merging your PR
Any member of core-adlr and
core-nemowill be able to merge your PR.