Skip to content

Commit 59914e1

Browse files
authored
[megatron] support qwen3.5 CP (#9022)
1 parent 963cb15 commit 59914e1

2 files changed

Lines changed: 2 additions & 0 deletions

File tree

docs/source/BestPractices/Qwen3_5-Best-Practice.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -311,6 +311,7 @@ Megatron-SWIFT训练Qwen3.5的提示:
311311
- 全参数训练:参考[这个例子](https://github.com/modelscope/ms-swift/tree/main/examples/models/qwen3_5/mcore_full.sh)
312312
- 关于MTP训练:"mcore-bridge>=1.1.0"支持了多模态MTP的训练(暂时需安装[main分支](https://github.com/modelscope/mcore-bridge/pull/14)),请安装对应版本。
313313
- TP 限制解除:使用 "megatron-core>=0.16" 可解除 TP 受到的 `num_query_groups` 限制。
314+
- CP支持:"mcore-bridge>=1.1.0"支持了GDN的CP训练(暂时需安装[main分支](https://github.com/modelscope/mcore-bridge/pull/16)),此外需安装megatron-core dev分支。
314315
- 默认 `GatedDeltaNet` 使用 Megatron 实现,需使用 "megatron-core>=0.16"(ms-swift>=4.1.0,之前版本默认使用transformers实现)。设置环境变量 `USE_MCORE_GDN=0`可切换至 transformers 实现,transformers实现不支持packing和GDN的TP。
315316
- padding_free/packing的支持:packing可以提升训练速度。参考[这个例子](https://github.com/modelscope/ms-swift/tree/main/examples/models/qwen3_5/packing.sh)
316317
- apply_wd_to_qk_layernorm:对 qk layernorm 应用权重衰减。默认为False。

docs/source_en/BestPractices/Qwen3_5-Best-Practice.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -309,6 +309,7 @@ Tips for training Qwen3.5 with Megatron-SWIFT:
309309
- Full parameter training: Refer to [this example](https://github.com/modelscope/ms-swift/tree/main/examples/models/qwen3_5/mcore_full.sh).
310310
- Regarding MTP training: `mcore-bridge>=1.1.0` supports multimodal MTP training (currently requires installing the [main branch](https://github.com/modelscope/mcore-bridge/pull/14)). Please install the corresponding version.
311311
- TP Limitation Removed: Using `megatron-core>=0.16` removes the `num_query_groups` limitation on TP.
312+
- CP support: "mcore-bridge>=1.1.0" supports CP training for GDN (currently requires installing the [main branch](https://github.com/modelscope/mcore-bridge/pull/16)). Additionally, the megatron-core dev branch needs to be installed.
312313
- By default, `GatedDeltaNet` uses the Megatron implementation, which requires "megatron-core>=0.16" (ms-swift>=4.1.0; previous versions defaulted to the transformers implementation). Set the environment variable `USE_MCORE_GDN=0` to switch to the transformers implementation. Note that the transformers implementation does not support packing and GDN's TP.
313314
- Support for padding_free/packing: Packing can improve training speed. Refer to [this example](https://github.com/modelscope/ms-swift/tree/main/examples/models/qwen3_5/packing.sh).
314315
- apply_wd_to_qk_layernorm: Apply weight decay to qk layernorm. Default is False.

0 commit comments

Comments
 (0)