-
Notifications
You must be signed in to change notification settings - Fork 3.7k
Open
Labels
call for contributiondev branchDev branch related issues and developmentDev branch related issues and developmentmodule: moe
Description
Description
The focus for Megatron Core MoE is to provide comprehensive support for latest MoE architectures, advanced parallelism strategies, and performance optimizations for Blackwell. This is a tentative roadmap and subject to change.
🎉 This Roadmap is based on the dev branch; please see the details in its README.
Model Support
- ✅ DeepSeek
- ✅ DeepSeek-V2
- ✅ DeepSeek-V3, including MTP
- ✅ DeepSeek-V3.2 [dev] DeepSeek V3.2 support #2154
- ✅ Qwen
- ✅ Qwen2-57B-A14B
- ✅ Qwen3-235B-A22B
- ✅ (🚀New!) Qwen3-Next
- ✅ Mixtral
Core MoE Functionality
- ✅ Token dropless MoE - Advanced routing without token dropping
- ✅ Top-K Router with flexible K selection
- ✅ Load balancing losses for expert load balancing optimization
Advanced Parallelism
- ✅ Expert Parallel (EP) with 3D parallelism integration
- ✅ Full parallelism combo: EP + DP + TP + PP + SP support
- ✅ Context Parallel (CP) for long sequence MoE training
- ✅ Parallel Folding Heterogeneous Parallelism Mappings for Efficient Large-Scale MoE Model Training
- ✅ Distributed Optimizer for MoE (ZeRO-1 equivalent)
- ✅ (🚀New!) Megatron FSDP/HSDP with full expert parallel support
Optimizations
- ✅ Memory Efficient token permutation
- ✅ Fine-grained Recomputations (mla, moe, mlp, moe_act, norm)
- ✅ GroupedGEMM and Gradient Accumulation Fusion
- ✅ DP/PP/TP/EP Communication Overlapping
- ✅ Advanced fusions for Router; Permutation; MLA Rope; FP8 casting, etc
- ✅ cuDNN fused Attention and FlashAttn integration
- ✅ (🚀New!) 1F1B EP A2A Overlap - Hiding Expert Parallel Communication with 1F1B Pipeline Schedule
- ✅ (🚀New!) Muon and Layer-wise distributed optimizer
- ✅ (🚀New!) Pipeline-aware fine-grained activation offloading [Dev] feat(moe): Fine-grained activation offloading #1912
- ✅ (🚀New!) Production-ready cudaGraph support for MoE
Precision Support
- ✅ GroupedGEMM including FP8/MXFP8 support
- ✅ FP8 weights with BF16 optimizer states
- ✅ FP8 training full support
Optimized Expert Parallel Communication Support
- ✅ DeepEP support for H100 and B200
- ✅ (🚀New!) HybridEP for GB200
Developer Experience
- ✅ MoE Model Zoo with pre-training best practices
- ✅ MCore2HF Converter for ecosystem compatibility in megatron-bridge
- ✅ Distributed Checkpointing Support
- ✅ Runtime Upcycling Support for efficient model scaling
- ✅ Layer-wise logging for detailed monitoring
Next Release Roadmap (MCore v0.17)
Performance & Kernel Optimizations
- Split-K Indexer Kernels - Avoid materializing [seqlen_q, seqlen_k] tensor with split-K kernels [WIP Feat] Split-K Indexer Kernels #2869 (draft)
- Absorbed MLA - MLA computation optimization for DSA Add absorbed-mla & fused dsa #3044
- HybridEP preprocess optimization
Long Context & Context Parallel
- Hybrid CP Part 2 - Enhanced hybrid data x context parallelism [Dev] feat: Dynamic CP (part 2) #2000
- THD Format E2E Support - End-to-end THD format support [Dev] Add E2E support for THD format #2924 (draft)
Model & Architecture
- Manifold Hyper Connection (mHC) - [dev] feat(mHC): Add basic pytorch implementation of manifold hyper connection(mHC). #2943
- GDN THD Support - Packed sequence support for gated delta net [dev] feat(moe): Support packed sequence for gated delta net (GDN) #2644
- GDN Refinement - Refine gated delta net implementation [dev] perf(moe): Refine gated delta net implementation #3040
Advanced Functionality
- Router replay support for RL training (in progress) feat: add routing replay for Mcore #2693
- Megatron FSDP performance optimization for MoE training
CUDA Graph Enhancements
- MoE ECHO - Elastic Cloning for sync-free, full CUDA-graph dropless MoE training MoE ECHO: Unlocking Sync-Free, Full CUDA-Graph Support for Dropless MoE via Elastic Cloning #2368 (draft)
- Paged Stashing - Dynamic tensor support for dropless MoE with CUDA graph Paged Stashing #2690 (draft)
- CUDA Graph + Offloading - Support CUDA Graph capture with offloading modules Support CUDA Graph capture offloading modules #2437 (draft)
- Optimizer CUDA Graph - Enable CUDA graph for ADAM optimizer Enable Optimizer CUDA graph for ADAM optimizer #2931 (draft)
Ongoing Long-term Features
- E2E Performance optimization for DeepSeek-V3, Qwen-3 and other fine-grained MoEs
- Sync-Free and Full-Iter cudaGraph MoE Training
- MoE ECHO for dropless MoE load balancing MoE ECHO: Unlocking Sync-Free, Full CUDA-Graph Support for Dropless MoE via Elastic Cloning #2368
- Paged Stashing for dynamic tensor support Paged Stashing #2690
- CPU Overhead Optimizations for Blackwell Performance
- MLA Optimizations
- Absorbed MLA Add absorbed-mla & fused dsa #3044
- MLA CP 2.0 - MLA CP Enhancement for Longer Sequence Training
- THD and Long Context
- THD Format E2E Support - End-to-end THD format support [Dev] Add E2E support for THD format #2924
- Dynamic Context Parallel for Imbalanced Long-Sequence Training
- Megatron FSDP Performance Optimization for MoE Training
- Kernel fusions and optimizations for MoE models from TE MoE training optimization TransformerEngine#2438
- New Architecture Support
- Manifold Hyper Connection (mHC) [dev] feat(mHC): Add basic pytorch implementation of manifold hyper connection(mHC). #2943
v0.16 Update Highlights
Performance & Memory
- 🚀 Fused Linear and Cross Entropy - Fuse lm_head and CE to avoid materializing intermediate logits, reducing memory [Dev] Feature: linear cross entropy fusion #2256
- 🚀 Optimizer State Offloading - Offload optimizer states and master weights to CPU for significant GPU memory savings [Dev] [Reapply] Optimizer State and Master Weight Offloading #2987
- 🚀 MTP Standalone Stages - Support placing MTP layers into standalone pipeline stages for better VPP balance [Dev] feat(moe): Support placing MTP layers into standalone stages #1916
- 🚀 DeepSeek V3.2 Support - Performance optimizations for DeepSeek V3.2 [dev] DeepSeek V3.2 support #2154
- DeepSeek V3 Pre-training Performance Guide (~960 TFLOPS/GPU on 256 GB200s) [Dev] A Guide to Reproduce DeepSeek-V3 Pre-training Performance on GB200 #1996
CUDA Graph
- 🚀 cuda_graph_scope Refactoring [Dev] feat(MoE): Refactor cuda_graph_scope - part2 #2353 [Dev] TE cudagraph recompute #2694
- 🚀 Partial CUDA Graph for EP Overlap - Release CPU pressure within selected scope for EP A2A overlap [Dev](Reapply) Partial CUDA Graph support for EP Overlap #2810
- TE cudagraph input memory optimization - Reuse static input memory buffer among microbatches [Dev] Optimize TE cudagraph input memory #2391
- CudaGraph compatibility with FP8 params (tensorwise & blockwise) [DEV] Make CUDA graph compatible with FP8 params (tensorwise & blockwise). #2087
- NVFP4 MOE CUDA Graph support with 128 zero padding [DEV][NVFP4][MOE] 128 Zero Padding for Grouped Quantization kernels and Cuda Graph Support #2654
Model & Parallelism
- 🚀 Qwen3-Next Enhancements - QK layernorm weight decay support and Gated Delta Net CP for long context [dev] feat(moe): Support apply wd to qk layernorm for Qwen3-Next #2825 [Dev] Feat(moe): Gated delta net context parallel (CP) #2614
- 🚀 Hybrid Data x Context Parallelism - New parallelism strategy combining DP and CP [Dev] Hybrid Data x Context Parallelism Feature #2054
- 🚀 Router Replay - Deterministic routing mechanism for debugging and RL training feat: add routing replay for Mcore #2693
- 🚀 Fake Distributed Process Group - Skip all distributed comm ops with
--fake-process-groupfor profiling [DEV] Add support of fake distributed process group #2254 - Remove padding token in MoE routing loss - Improve aux loss correctness and efficiency [Dev] Remove calculation of padding token in moe routing loss #2121
- Context parallel support for eager attention implementation [Community][Dev] feat(moe): Adding context parallel support to eager attention implementation #1859
- Packed sequence support in MTP module [Dev] Support packed seq in MTP #2043
Fine-grained Activation Offloading Enhancement
- 🚀 OOP Refactoring - Object-oriented redesign for fine-grained activation offloading [Dev]feat(moe): code refactor for fine grained activation offloading #2905
- Fix accuracy mismatch when offloading and recomputing same module [Dev] fix(offloading): Accuracy mismatch when offloading and recomputing same module #2123
- Bug fix for fine-grained activation offloading in evaluate() [Dev] [fix] Bug fix for fine-grained activation offloading in evaluate() #3041
Megatron-FSDP
- 🚀 FP8 Params Support - MXFP8/Blockwise FP8 params for Megatron-FSDP [dev] Reapply fsdp mxfp8 #2828
- 🚀 HSDP Support - Hybrid Sharded Data Parallel with EP submesh registration Fix HSDP Registering Device Mesh #2388
- Megatron-FSDP user guide documentation [Dev] docs(megatron-fsdp): add Megatron-FSDP user guide #2397
Communication
- 🚀 Hybrid-EP Upgrade - Latest Hybrid-EP with kernel optimizations for EP64 and NVL8+IB [Dev] Use the latest Hybrid-EP #2424
- HybridEP memory overhead reduction for 1F1B A2A overlap [Dev] fix(moe): Support HybridEP and reduce memory overhead for 1F1B A2A overlap #2201
Optimizer
- 🚀 LayerWise DistOpt - LayerWiseDistributedOptimizer with torch_dist checkpoint format and Muon support [Dev] Support LayerWiseDistributedOptimizer with torch_dist checkpoint format #1928 [DEV] Update emerging optimizers #2261
Critical Bug Fixes
- Megatron-FSDP hang fix - Resolve hang caused by non-deterministic reduce-scatter [Dev] fix(megatron-fsdp): Resolve hang caused by non-deterministic reduce-scatter #2252
- EP Overlap correctness - Fix missing final layernorm in EP overlap [Dev] Fix ep overlap missing final layernorm #2691
- Hybrid-EP hotfix - Fix bug of hybrid-ep backend in flex-dispatcher [DEV] [HOT FIX] Fix bug of hybrid-ep backend in flex-dispatcher #2287
- CUDA RNG Tracker - Fix RNG tracker to use expert-parallel-rng correctly [Dev] Fix CUDA RNG Tracker #2640
Call for Community Contributions
- Model implementations - Additional MoE model variants
- Performance testing - Performance tests across different platforms and workloads
- Documentation and tutorials - Best practices and optimization guides
- Bug fixes
This roadmap reflects the collective efforts of NVIDIA and our collaborators
Credits: MCore MoE Team and @sbhavani
Labels: roadmap, moe, call-for-contribution
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
call for contributiondev branchDev branch related issues and developmentDev branch related issues and developmentmodule: moe