Skip to content

Best practices for training DeepSeek, Mixtral, Qwen and other MoE models using Megatron Core.

License

Notifications You must be signed in to change notification settings

yanring/Megatron-MoE-ModelZoo

Repository files navigation

Megatron MoE Model Zoo

Production-ready training recipes for state-of-the-art MoE models — DeepSeek-V3, Qwen3, and Mixtral — built on 🚀 Megatron-Core DEV branch.

✅ Performance-tuned configs for H100, B200, and GB200 clusters
✅ Model-specific best practices for training MoE models ✅ One-command launch with sensible defaults
✅ Dry-run mode to validate arguments before submitting jobs

Best Practices

Ready-to-run scripts with optimized configurations:

Model Hardware Scripts
DeepSeek-V3 H100, B200, GB200 best_practice/DeepSeekV3/
Qwen3 H100 best_practice/Qwen3/
Mixtral H100 best_practice/Mixtral/

See best_practice/ for detailed guides.

Quick Start

Prerequisites

Install yq for YAML processing (one-time setup):

mkdir -p ~/.local/bin && wget -qO ~/.local/bin/yq https://github.com/mikefarah/yq/releases/download/v4.27.5/yq_linux_amd64 && chmod +x ~/.local/bin/yq
echo 'export PATH="$HOME/.local/bin:$PATH"' >> ~/.bashrc && source ~/.bashrc

Environment Variables

Variable Description Example
MEGATRON_PATH Path to Megatron-LM /path/to/Megatron-LM
CONTAINER_IMAGE Container image path /path/to/image.sqsh
CLUSTER Name of the cluster; used to load cluster-specific settings such as data paths EOS, CW
WANDB_API_KEY (Optional) WandB key From wandb.ai/authorize

Container

Dockerfile: dockers/Dockerfile (also available: B200.Dockerfile, GB200.Dockerfile)

Performance Benchmarking

Supported Models

Mixtral-8x2B, Mixtral-8x7B, Mixtral-8x22B, DeepSeek-V2, DeepSeek-V2-Lite, DeepSeek-V3, Qwen2-57B-A14B, Qwen3-235B-A22B, Qwen3-30B-A3B, Qwen3-Next-80B-A3B

Launch

Basic launch:

MODEL=DeepSeek-V3 bash ./sbatch_benchmarking.sh

With custom/overwritten parameters:

MODEL=DeepSeek-V3 TP=2 PP=8 EP=64 VPP=1 RUN_TIME=00:60:00 NNODES=64 \
  bash sbatch_benchmarking.sh --recompute-granularity selective --recompute-modules mla_up_proj layernorm

💡 Tip: Dry Run Mode — Preview the generated SLURM script and training command without submitting to the cluster:

DRY_RUN=1 MODEL=DeepSeek-V3 bash ./sbatch_benchmarking.sh

This is highly recommended before submitting jobs to verify configurations.

Configuration

Runtime configs (runtime_configs/benchmarking/runtime.conf):

  • Parallelism: TP, PP, EP, CP, VPP, PP_FIRST, PP_LAST, LAYERS_PER_VP
  • Batch sizes: MBS, GBS
  • Training: NNODES, RUN_TIME, NUM_LAYERS, SEQ_LEN
  • MoE: MOE_TOKEN_DISPATCHER, MOE_GROUPED_GEMM

Cluster configs (cluster_configs/benchmarking/template.conf):

  • Slurm: ACCOUNT, PARTITION, RUN_NAME, CONTAINER_MOUNTS
  • Paths: OUTPUT_PATH, DATA_PATH, TOKENIZER_MODEL, LOAD_PATH

Job Monitoring

watch -n 1 squeue -u $USER

Checkpoint Conversion

For HF↔MCore conversion, consider MBridge or Megatron-Bridge.

DeepSeek-V3

1. Download and convert to BF16:

git lfs install && git clone https://huggingface.co/deepseek-ai/DeepSeek-V3
python inference/fp8_cast_bf16.py --input-fp8-hf-path /input/fp8/path --output-bf16-hf-path /output/bf16/path

2. Convert to Megatron legacy checkpoint:

MODEL=DeepSeek-V3 bash ./ckpt_convert_scripts/DeepSeek-V3/convert_deepseek_v3.sh

3. Convert to distributed checkpoint:

MODEL=DeepSeek-V3 TP=1 PP=4 EP=64 VPP=1 PP_FIRST=16 PP_LAST=13 NNODES=32 LOAD_PATH=/path/to/legacy/ckpt \
  bash ./sbatch_benchmarking.sh --ckpt-convert-save /path/to/dist/ckpt --ckpt-convert-format torch_dist --no-save-optim

Storage: Legacy ~3.4T, Distributed ~1.4T

References

  • Design Docs - Implementation details for MTP, VPP, EP overlapping, etc.

About

Best practices for training DeepSeek, Mixtral, Qwen and other MoE models using Megatron Core.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 8