[QWen3_VL] [DistTrain] Disaggregated training system

# [DistTrain] Disaggregated training system
This issue is used to track implementation of [DistTrain] Disaggregated training system for multi-modality model training.
We plan setup an end-to-end example for training QWen3-VL-235B-A22B-Instruct with MDP.

> Note:
> Both [MDP](https://arxiv.org/abs/2511.00279)and [DistTrain](https://arxiv.org/abs/2408.04275) make encoder and LLM backend have different DP/TP/PP size.
> The main difference between MDP and DistTrain is that MDP makes Encoder and LLM backbone collocated on GPUs, while DistTrain places Encoder and LLM backbone on separative GPUs.

# reference
- End2End demo based on old version Megatron before M4 https://github.com/NVIDIA/Megatron-LM/pull/1605
- A simple toy demo based on new version Megatron after M4 https://github.com/NVIDIA/Megatron-LM/pull/1995

# dependcy
- https://github.com/shifangx/Megatron-Bridge/commit/2f33a190ae812b207dbb672ca7ddea5f49e7ec77
- https://github.com/shifangx/Megatron-LM/pull/2

**Describe the solution you'd like**
A clear and concise description of what you want to happen.

**Describe alternatives you've considered**
A clear and concise description of any alternative solutions or features you've considered.

**Additional context**
Add any other context or screenshots about the feature request here.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QWen3_VL] [DistTrain] Disaggregated training system #1632

[DistTrain] Disaggregated training system

reference

dependcy

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[QWen3_VL] [DistTrain] Disaggregated training system #1632

Description

[DistTrain] Disaggregated training system

reference

dependcy

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions