dcp.save性能优化：使能torch 2.7.1增量保存 #2

tina-wen · 2025-11-28T08:28:44Z

问题描述

当前在910A3的训练使用PyTorch2.7.1，每次分布式保存dcp.save均需要处理全量meta_data。meta_data是多重嵌套的字典格式，arm cpu对改格式序列化和反序列化的耗时较长、且无法与训练主进程并行，显著影响训练效率。

解决方案

对xtuner框架进行补丁（不修改PyTorch及PTA源码），实现meta_data的增量保存功能。具体改动包括：

增量保存逻辑：

仅修改第二次及之后的ckpt保存，仅保存与第一次相比，meta_data发生变化的部分，大量节省（反）序列化时间

xtuner框架补丁：

在训练超参配置中添加patch_for_dcp_finish开关，用户可选择开启或关闭
dcp.save入参由权重和优化器保存路径checkpoint_dir，改为storage_writer和planner实例

兼容性：

此修改不影响dcp的ckpt读取
增量保存的ckpt读取恢复训练10步loss和梯度无精度问题

测试结果

昇腾910A3上，基于Qwen3 235B模型的512卡分布式保存

权重保存（450G）端到端总耗时从132.6s下降至20.13s，性能提升约85%
优化器状态保存（921G）端到端总耗时从302.44s下降至34.46s，性能提升约11.39%
细粒度拆分：CPU上进程间通信（all_gather_object、scatter_object和broadcast_object）耗时由百秒级下降至毫秒级

tina-wen mentioned this pull request Nov 28, 2025

训练适配GMM NZ特性 #1

Merged

tina-wen changed the base branch from main to ascend November 28, 2025 09:04

[FEATURE] 针对torch=2.7.1的dcp.save耗时优化，使能增量保存功能

c4734e9

tina-wen force-pushed the pr_1_commits branch from 664a06a to c4734e9 Compare December 10, 2025 14:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

dcp.save性能优化：使能torch 2.7.1增量保存 #2

dcp.save性能优化：使能torch 2.7.1增量保存 #2

Uh oh!

tina-wen commented Nov 28, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

dcp.save性能优化：使能torch 2.7.1增量保存 #2

Are you sure you want to change the base?

dcp.save性能优化：使能torch 2.7.1增量保存 #2

Uh oh!

Conversation

tina-wen commented Nov 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

问题描述

解决方案

测试结果

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

tina-wen commented Nov 28, 2025 •

edited

Loading