Skip to content

[ROADMAP][Updated on Jan 26] Megatron Core MoE Roadmap #1729

@yanring

Description

@yanring

Description

The focus for Megatron Core MoE is to provide comprehensive support for latest MoE architectures, advanced parallelism strategies, and performance optimizations for Blackwell. This is a tentative roadmap and subject to change.

🎉 This Roadmap is based on the dev branch; please see the details in its README.


Model Support

  • DeepSeek
  • Qwen
    • ✅ Qwen2-57B-A14B
    • ✅ Qwen3-235B-A22B
    • (🚀New!) Qwen3-Next
  • Mixtral

Core MoE Functionality

  • Token dropless MoE - Advanced routing without token dropping
  • Top-K Router with flexible K selection
  • Load balancing losses for expert load balancing optimization

Advanced Parallelism

  • Expert Parallel (EP) with 3D parallelism integration
  • Full parallelism combo: EP + DP + TP + PP + SP support
  • Context Parallel (CP) for long sequence MoE training
  • Parallel Folding Heterogeneous Parallelism Mappings for Efficient Large-Scale MoE Model Training
  • Distributed Optimizer for MoE (ZeRO-1 equivalent)
  • (🚀New!) Megatron FSDP/HSDP with full expert parallel support

Optimizations

  • Memory Efficient token permutation
  • Fine-grained Recomputations (mla, moe, mlp, moe_act, norm)
  • GroupedGEMM and Gradient Accumulation Fusion
  • DP/PP/TP/EP Communication Overlapping
  • Advanced fusions for Router; Permutation; MLA Rope; FP8 casting, etc
  • cuDNN fused Attention and FlashAttn integration
  • ✅ (🚀New!) 1F1B EP A2A Overlap - Hiding Expert Parallel Communication with 1F1B Pipeline Schedule
  • (🚀New!) Muon and Layer-wise distributed optimizer
  • (🚀New!) Pipeline-aware fine-grained activation offloading [Dev] feat(moe): Fine-grained activation offloading #1912
  • (🚀New!) Production-ready cudaGraph support for MoE

Precision Support

  • GroupedGEMM including FP8/MXFP8 support
  • FP8 weights with BF16 optimizer states
  • FP8 training full support

Optimized Expert Parallel Communication Support

  • DeepEP support for H100 and B200
  • (🚀New!) HybridEP for GB200

Developer Experience

  • MoE Model Zoo with pre-training best practices
  • MCore2HF Converter for ecosystem compatibility in megatron-bridge
  • Distributed Checkpointing Support
  • Runtime Upcycling Support for efficient model scaling
  • Layer-wise logging for detailed monitoring

Next Release Roadmap (MCore v0.17)

Performance & Kernel Optimizations

Long Context & Context Parallel

Model & Architecture

Advanced Functionality

CUDA Graph Enhancements

Ongoing Long-term Features


v0.16 Update Highlights

Performance & Memory

CUDA Graph

Model & Parallelism

Fine-grained Activation Offloading Enhancement

Megatron-FSDP

Communication

Optimizer

Critical Bug Fixes


Call for Community Contributions

  • Model implementations - Additional MoE model variants
  • Performance testing - Performance tests across different platforms and workloads
  • Documentation and tutorials - Best practices and optimization guides
  • Bug fixes

This roadmap reflects the collective efforts of NVIDIA and our collaborators

Credits: MCore MoE Team and @sbhavani

Labels: roadmap, moe, call-for-contribution

Metadata

Metadata

Assignees

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions