-
Notifications
You must be signed in to change notification settings - Fork 199
Description
[DistTrain] Disaggregated training system
This issue is used to track implementation of [DistTrain] Disaggregated training system for multi-modality model training.
We plan setup an end-to-end example for training QWen3-VL-235B-A22B-Instruct with MDP.
Note:
Both MDPand DistTrain make encoder and LLM backend have different DP/TP/PP size.
The main difference between MDP and DistTrain is that MDP makes Encoder and LLM backbone collocated on GPUs, while DistTrain places Encoder and LLM backbone on separative GPUs.
reference
- End2End demo based on old version Megatron before M4 Add DistTrain, Allow Encoder to Have Different DP Size NVIDIA/Megatron-LM#1605
- A simple toy demo based on new version Megatron after M4 [draft]DistTrain demo for mutil module training NVIDIA/Megatron-LM#1995
dependcy
Describe the solution you'd like
A clear and concise description of what you want to happen.
Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
Additional context
Add any other context or screenshots about the feature request here.