Skip to content

Commit 963cb15

Browse files
authored
[megatron] support multimodal MTP (#8390)
1 parent 0f5fb00 commit 963cb15

12 files changed

Lines changed: 31 additions & 12 deletions

File tree

docs/source/BestPractices/Qwen3_5-Best-Practice.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -309,7 +309,7 @@ swift infer \
309309

310310
Megatron-SWIFT训练Qwen3.5的提示:
311311
- 全参数训练:参考[这个例子](https://github.com/modelscope/ms-swift/tree/main/examples/models/qwen3_5/mcore_full.sh)
312-
- 关于MTP训练:ms-swift暂不支持多模态MTP的训练。如果你只训练纯文本数据,请设置`SKIP_MULTIMODAL_MTP_VALIDATION=1`环境变量,忽略检查
312+
- 关于MTP训练:"mcore-bridge>=1.1.0"支持了多模态MTP的训练(暂时需安装[main分支](https://github.com/modelscope/mcore-bridge/pull/14)),请安装对应版本
313313
- TP 限制解除:使用 "megatron-core>=0.16" 可解除 TP 受到的 `num_query_groups` 限制。
314314
- 默认 `GatedDeltaNet` 使用 Megatron 实现,需使用 "megatron-core>=0.16"(ms-swift>=4.1.0,之前版本默认使用transformers实现)。设置环境变量 `USE_MCORE_GDN=0`可切换至 transformers 实现,transformers实现不支持packing和GDN的TP。
315315
- padding_free/packing的支持:packing可以提升训练速度。参考[这个例子](https://github.com/modelscope/ms-swift/tree/main/examples/models/qwen3_5/packing.sh)

docs/source/GetStarted/SWIFT-installation.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,9 @@
77
```shell
88
# 推荐
99
pip install 'ms-swift' -U
10-
# 使用评测
10+
# 额外安装megatron依赖
11+
pip install 'ms-swift[megatron]' -U
12+
# 额外安装评测依赖
1113
pip install 'ms-swift[eval]' -U
1214
# 全能力
1315
pip install 'ms-swift[all]' -U

docs/source/Megatron-SWIFT/Command-line-parameters.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -206,6 +206,7 @@
206206
**MTP参数**
207207
- mtp_num_layers: 多token预测(MTP)层的数量。MTP将每个位置的预测范围扩展到多个未来token。此MTP实现使用D个顺序模块依次预测D个额外的token。默认为None。(需要"megatron-core>=0.14")
208208
- 注意:mtp_num_layers的值,将不自动从config.json获取,需手动设置。你可以参考config.json中的`num_nextn_predict_layers`字段填写该值。使用mcore-bridge时,将优先从safetensors文件中加载MTP权重,若无法找到,则进行随机初始化。(若要使用blockwise fp8 + mtp,请使用mcore>=0.15)
209+
- 多模态MTP的支持: 需安装"mcore-bridge>=1.1.0"。
209210
- mtp_loss_scaling_factor: 多token预测(MTP)损失的缩放因子。我们计算所有深度上MTP损失的平均值,然后乘以该缩放因子得到总体MTP损失,它将作为一个额外的训练目标。默认为0.1。
210211

211212
**Tuner参数**:

docs/source/Megatron-SWIFT/Quick-start.md

Lines changed: 8 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -32,8 +32,13 @@ git clone https://github.com/NVIDIA/apex
3232
cd apex
3333
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./
3434

35-
# mcore-bridge megatron-core
36-
pip install "megatron-core==0.16.*" mcore-bridge -U
35+
# mcore-bridge
36+
pip install mcore-bridge -U
37+
# 安装main分支
38+
# pip install git+https://github.com/modelscope/mcore-bridge.git
39+
40+
# megatron-core
41+
pip install "megatron-core==0.16.*" -U
3742

3843
# 若使用多机训练,请额外设置`MODELSCOPE_CACHE`环境变量为共享存储路径
3944
# 这将确保数据集缓存共享,而加速预处理速度。
@@ -67,7 +72,7 @@ modelscope-registry.us-west-1.cr.aliyuncs.com/modelscope-repo/modelscope:ubuntu2
6772
| transformer-engine | >=2.3 | 2.12.0 | |
6873
| apex | | 0.1 | |
6974
| megatron-core | >=0.12,<0.17 | 0.16 | |
70-
| mcore-bridge | >=1.0.1 | | |
75+
| mcore-bridge | >=1.0.2 | | |
7176
| flash-attn | | 2.8.3/3.0.0b1 | |
7277
| transformers | >=4.33 | 4.57.6/5.2.0 | |
7378
| modelscope | >=1.23 | | |

docs/source_en/BestPractices/Qwen3_5-Best-Practice.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -307,7 +307,7 @@ swift infer \
307307
Tips for training Qwen3.5 with Megatron-SWIFT:
308308

309309
- Full parameter training: Refer to [this example](https://github.com/modelscope/ms-swift/tree/main/examples/models/qwen3_5/mcore_full.sh).
310-
- Regarding MTP training: ms-swift currently does not support multimodal MTP training. If you are only training on pure text data, please set the `SKIP_MULTIMODAL_MTP_VALIDATION=1` environment variable to skip the validation check.
310+
- Regarding MTP training: `mcore-bridge>=1.1.0` supports multimodal MTP training (currently requires installing the [main branch](https://github.com/modelscope/mcore-bridge/pull/14)). Please install the corresponding version.
311311
- TP Limitation Removed: Using `megatron-core>=0.16` removes the `num_query_groups` limitation on TP.
312312
- By default, `GatedDeltaNet` uses the Megatron implementation, which requires "megatron-core>=0.16" (ms-swift>=4.1.0; previous versions defaulted to the transformers implementation). Set the environment variable `USE_MCORE_GDN=0` to switch to the transformers implementation. Note that the transformers implementation does not support packing and GDN's TP.
313313
- Support for padding_free/packing: Packing can improve training speed. Refer to [this example](https://github.com/modelscope/ms-swift/tree/main/examples/models/qwen3_5/packing.sh).

docs/source_en/GetStarted/SWIFT-installation.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,9 @@ You can install it using pip:
77
```shell
88
# recommend
99
pip install 'ms-swift' -U
10-
# For evaluation usage
10+
# Install additional Megatron dependencies
11+
pip install 'ms-swift[megatron]' -U
12+
# Install additional evaluation dependencies
1113
pip install 'ms-swift[eval]' -U
1214
# Full capabilities
1315
pip install 'ms-swift[all]' -U

docs/source_en/Megatron-SWIFT/Command-line-parameters.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -218,6 +218,7 @@ For guidance on selecting parallelization strategies, please refer to the [Train
218218
**MTP Parameters**
219219
- mtp_num_layers: Number of Multi-Token Prediction (MTP) layers. MTP extends the prediction scope at each position to multiple future tokens. This MTP implementation uses D sequential modules to sequentially predict D additional tokens. Default is None. (requires "megatron-core>=0.14")
220220
- Note: The value of mtp_num_layers will not be automatically retrieved from config.json and must be set manually. You can refer to the `num_nextn_predict_layers` field in config.json to fill in this value. When using mcore-bridge, MTP weights will be loaded from safetensors files first. If not found, random initialization will be performed. (To use blockwise fp8 + mtp, please use mcore>=0.15)
221+
- Multimodal MTP support: Requires installing "mcore-bridge>=1.1.0".
221222
- mtp_loss_scaling_factor: Scaling factor of Multi-Token Prediction (MTP) loss. We compute the average of MTP losses across all depths, then multiply it by this scaling factor to obtain the overall MTP loss, which serves as an additional training objective. Default is 0.1.
222223

223224
**Tuner Parameters**:

docs/source_en/Megatron-SWIFT/Quick-start.md

Lines changed: 8 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -31,8 +31,13 @@ git clone https://github.com/NVIDIA/apex
3131
cd apex
3232
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./
3333

34-
# mcore-bridge megatron-core
35-
pip install "megatron-core==0.16.*" mcore-bridge -U
34+
# mcore-bridge
35+
pip install mcore-bridge -U
36+
# Install from main branch
37+
# pip install git+https://github.com/modelscope/mcore-bridge.git
38+
39+
# megatron-core
40+
pip install "megatron-core==0.16.*" -U
3641

3742
# If you are using multi-node training, please additionally set the `MODELSCOPE_CACHE` environment variable to a shared storage path.
3843
# This will ensure that the dataset cache is shared, thereby speeding up preprocessing.
@@ -67,7 +72,7 @@ Recommended Operating Environment:
6772
| transformer-engine | >=2.3 | 2.12.0 | |
6873
| apex | | 0.1 | |
6974
| megatron-core | >=0.12,<0.17 | 0.16 | |
70-
| mcore-bridge | >=1.0.1 | | |
75+
| mcore-bridge | >=1.0.2 | | |
7176
| flash-attn | | 2.8.3/3.0.0b1 | |
7277
| transformers | >=4.33 | 4.57.6/5.2.0 | |
7378
| modelscope | >=1.23 | | |

examples/models/qwen3_5/packing.sh

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,6 @@ CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
55
IMAGE_MAX_TOKEN_NUM=1024 \
66
VIDEO_MAX_TOKEN_NUM=128 \
77
FPS_MAX_FRAMES=12 \
8-
SKIP_MULTIMODAL_MTP_VALIDATION=1 \
98
megatron sft \
109
--model Qwen/Qwen3.5-35B-A3B \
1110
--save_safetensors true \

requirements/megatron.txt

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
mcore-bridge>=1.0.2
2+
megatron-core>=0.12
3+
peft>=0.15

0 commit comments

Comments
 (0)