Skip to content

Conversation

@tina-wen
Copy link

@tina-wen tina-wen commented Nov 26, 2025

特性描述

xtuner框架使能GMM NZ的训练优化特性,FSDP2 中融合copy_out和转NZ操作,采用NPU亲和的NZ格式做GMM运算以取得性能收益

具体改进

  1. 适配SliceNz融合算子:获取相应cann算子aclnn接口,基于2.6.0及之后的主线PTA结构化适配,取得torch_npu.npu_special_slice接口
  2. 修改PTA中copy_out的原始实现:基于PTA源码补丁,针对GMM权重应用SliceNz
  3. 使能512对齐:仅对每层的MLP模块进行切分,确保GMM NZ性能不会劣化
  4. PTA中GMM算子前反向适配:torch_npu.grouped_matmul支持NZ格式权重输入,但需手动实现反向
  5. 冗余Transpose消除:修改xtuner框架中权重初始化和加载逻辑,提前转置in_feature和out_feature

用户使能

参考test_qwen3_235b_npu.sh脚本,打开GROUPED_MATMUL_NZ_TRANSPOSE开关和512对齐开关LINEAR_ONLY_SHARD

测试

  • 集群规模:昇腾910A3,32机512die
  • 模型场景: Qwen3 235B MoE模型的sft
  • 关键超参配置:序列长度32k,不开ep

验证结果

测试项 GPU基线 优化前 优化后 提升幅度
吞吐率 tokens/s/卡 2100+ 2050 2140 2%

@acat-rw
Copy link
Collaborator

acat-rw commented Nov 26, 2025

run包不应该直接放到代码仓

@acat-rw
Copy link
Collaborator

acat-rw commented Nov 26, 2025

应该按feature级分PR合入,另外PR 标题要有关键信息

@tina-wen tina-wen changed the title Feature wtg PTA框架适配训练GMM NZ特性 Nov 28, 2025
@tina-wen
Copy link
Author

run包不应该直接放到代码仓

最新commit已经移除run包

@tina-wen
Copy link
Author

应该按feature级分PR合入,另外PR 标题要有关键信息

dcp.save优化特性已经拆分至另一pr:#2

@tina-wen tina-wen changed the base branch from main to ascend November 28, 2025 09:04
@tina-wen tina-wen changed the title PTA框架适配训练GMM NZ特性 训练适配GMM NZ特性 Dec 9, 2025
source ${CANN_DIR}/set_env.sh

# 安装PTA 2.6.0版本GMM 切K轴补丁
pip install /path/to/torch_npu-custom.whl --force-reinstall

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q3的torch_npu已经支持切K轴了,建议改一下install路径,或者删掉

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

该pta包是适配自定义SliceNz融合算子的包,由于后续PTA主线可能不接受该算子,此处表明需要用户在训练环境中自行安装custom包来调用该接口

dim: int,
out: List[torch.Tensor],
) -> None:
if len(all_gather_input_split_sizes) > 1 and out[-1].shape[0] * out[-1].shape[1] >= 128 * 4096*1536:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里写死是因为fsdp对moe分布的切分轴的问题吗?这部分代码是可以被其他模型复用的,请尽量用人类可读的方式写明。

Copy link
Author

@tina-wen tina-wen Dec 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

最新commit中已经将判断条件表示为了专家数*hidden_size*moe_intermediate_size的方式。用于判断当前copy_out的权重后续是否用于GMM运算

@@ -0,0 +1 @@
from torch_npu import npu_special_slice

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个文件是否必要?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

必要的,需要对PTA内部copy_out实现做补丁时调用npu_special_slice的接口,这个接口本身也适配在PTA里。由于不能自调用,所以在外部做了一个import动作

hf_keys_start = int(fsdp_start / hf_key_size)
hf_keys_end = math.ceil(fsdp_end / hf_key_size)

if int(os.getenv("GROUPMM_NZ_TRANSPOSE","0")) == 1 and len(hf_keys) == 128*2 and torch.distributed.get_world_size() == 512: # gate & up的情况,down的情况需要排除

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里是不是可以直接用hf_keys的内容判断?这种写法容易导致误判。另外对于world size大于512的情况怎么处理?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hf_keys的内容为模块名的列表,包括每层专家的gate、up或者down映射权重的名称。最新commit已经修改为判断hf_keys长度是否为2*专家数(即gate&up的情况),以及world_size>=专家数时,才需要走该分支

out[-1].resize_(num_exp,out_dim1,in_dim1)
out[-2].resize_(num_exp,out_dim2,in_dim2)

npu_special_slice(all_gather_output, dim, weight_1_start, total_size, out[-1])

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

是在这里面做了nz吗?special slice这种命名太宽泛了

out[-2].resize_(num_expert,hidden_size,moe_intermediate_size*2)

# GMM权重切分和转NZ使用融合算子
npu_special_slice(all_gather_output, dim, up_start, total_size, out[-1])

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里命名是否应该更具有人类可读性?

@acat-rw acat-rw merged commit 13bb9ca into Eco-Sphere:ascend Dec 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants