训练适配GMM NZ特性 #1

tina-wen · 2025-11-26T07:38:20Z

特性描述

xtuner框架使能GMM NZ的训练优化特性，FSDP2 中融合copy_out和转NZ操作，采用NPU亲和的NZ格式做GMM运算以取得性能收益

具体改进

适配SliceNz融合算子：获取相应cann算子aclnn接口，基于2.6.0及之后的主线PTA结构化适配，取得torch_npu.npu_special_slice接口
修改PTA中copy_out的原始实现：基于PTA源码补丁，针对GMM权重应用SliceNz
使能512对齐：仅对每层的MLP模块进行切分，确保GMM NZ性能不会劣化
PTA中GMM算子前反向适配：torch_npu.grouped_matmul支持NZ格式权重输入，但需手动实现反向
冗余Transpose消除：修改xtuner框架中权重初始化和加载逻辑，提前转置in_feature和out_feature

用户使能

参考test_qwen3_235b_npu.sh脚本，打开GROUPED_MATMUL_NZ_TRANSPOSE开关和512对齐开关LINEAR_ONLY_SHARD

测试

集群规模：昇腾910A3，32机512die
模型场景: Qwen3 235B MoE模型的sft
关键超参配置：序列长度32k，不开ep

验证结果

测试项	GPU基线	优化前	优化后	提升幅度
吞吐率 tokens/s/卡	2100+	2050	2140	2%

acat-rw · 2025-11-26T09:11:03Z

run包不应该直接放到代码仓

acat-rw · 2025-11-26T09:14:19Z

应该按feature级分PR合入，另外PR 标题要有关键信息

tina-wen · 2025-11-28T08:29:55Z

run包不应该直接放到代码仓

最新commit已经移除run包

tina-wen · 2025-11-28T08:30:22Z

应该按feature级分PR合入，另外PR 标题要有关键信息

dcp.save优化特性已经拆分至另一pr：#2

icerain-alt · 2025-12-09T03:28:17Z

test_qwen3_235b_npu.sh

+source ${CANN_DIR}/set_env.sh
+
+# 安装PTA 2.6.0版本GMM 切K轴补丁
+pip install /path/to/torch_npu-custom.whl --force-reinstall


Q3的torch_npu已经支持切K轴了，建议改一下install路径，或者删掉

该pta包是适配自定义SliceNz融合算子的包，由于后续PTA主线可能不接受该算子，此处表明需要用户在训练环境中自行安装custom包来调用该接口

Shangwei-Li · 2025-12-04T11:39:18Z

pta_patch/_fsdp_collectives.py

+    dim: int,
+    out: List[torch.Tensor],
+) -> None:
+    if len(all_gather_input_split_sizes) > 1 and out[-1].shape[0] * out[-1].shape[1] >= 128 * 4096*1536:


这里写死是因为fsdp对moe分布的切分轴的问题吗？这部分代码是可以被其他模型复用的，请尽量用人类可读的方式写明。

最新commit中已经将判断条件表示为了专家数*hidden_size*moe_intermediate_size的方式。用于判断当前copy_out的权重后续是否用于GMM运算

Shangwei-Li · 2025-12-04T11:48:46Z

special_op.py

@@ -0,0 +1 @@
+from torch_npu import npu_special_slice


这个文件是否必要？

必要的，需要对PTA内部copy_out实现做补丁时调用npu_special_slice的接口，这个接口本身也适配在PTA里。由于不能自调用，所以在外部做了一个import动作

Shangwei-Li · 2025-12-04T11:52:42Z

xtuner/v1/model/base.py

            hf_keys_start = int(fsdp_start / hf_key_size)
            hf_keys_end = math.ceil(fsdp_end / hf_key_size)

+            if int(os.getenv("GROUPMM_NZ_TRANSPOSE","0")) == 1 and len(hf_keys) == 128*2 and torch.distributed.get_world_size() == 512: # gate & up的情况，down的情况需要排除


这里是不是可以直接用hf_keys的内容判断？这种写法容易导致误判。另外对于world size大于512的情况怎么处理？

hf_keys的内容为模块名的列表，包括每层专家的gate、up或者down映射权重的名称。最新commit已经修改为判断hf_keys长度是否为2*专家数（即gate&up的情况），以及world_size>=专家数时，才需要走该分支

Shangwei-Li · 2025-12-04T12:14:44Z

pta_patch/_fsdp_collectives.py

+            out[-1].resize_(num_exp,out_dim1,in_dim1)
+            out[-2].resize_(num_exp,out_dim2,in_dim2)
+
+        npu_special_slice(all_gather_output, dim, weight_1_start, total_size, out[-1])


是在这里面做了nz吗？special slice这种命名太宽泛了

Shangwei-Li · 2025-12-09T03:25:14Z

pta_patch/_fsdp_collectives.py

+        out[-2].resize_(num_expert,hidden_size,moe_intermediate_size*2)
+
+        # GMM权重切分和转NZ使用融合算子
+        npu_special_slice(all_gather_output, dim, up_start, total_size, out[-1])


这里命名是否应该更具有人类可读性？

tina-wen force-pushed the feature_wtg branch from 3fb0844 to 46da32a Compare November 28, 2025 07:17

tina-wen changed the title ~~Feature wtg~~ PTA框架适配训练GMM NZ特性 Nov 28, 2025

tina-wen changed the base branch from main to ascend November 28, 2025 09:04

[FEATURE] NPU训练GMM NZ

84d1cc1

tina-wen force-pushed the feature_wtg branch from 46da32a to 84d1cc1 Compare December 8, 2025 14:05

tina-wen changed the title ~~PTA框架适配训练GMM NZ特性~~ 训练适配GMM NZ特性 Dec 9, 2025

icerain-alt reviewed Dec 9, 2025

View reviewed changes

Shangwei-Li reviewed Dec 9, 2025

View reviewed changes

acat-rw merged commit 13bb9ca into Eco-Sphere:ascend Dec 11, 2025

训练适配GMM NZ特性 #1

训练适配GMM NZ特性 #1

Uh oh!

Conversation

tina-wen commented Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

特性描述

具体改进

用户使能

测试

验证结果

Uh oh!

acat-rw commented Nov 26, 2025

Uh oh!

acat-rw commented Nov 26, 2025

Uh oh!

tina-wen commented Nov 28, 2025

Uh oh!

tina-wen commented Nov 28, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tina-wen Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

tina-wen commented Nov 26, 2025 •

edited

Loading

tina-wen Dec 9, 2025 •

edited

Loading