Skip to content

Conversation

@tina-wen
Copy link

@tina-wen tina-wen commented Nov 28, 2025

问题描述

当前在910A3的训练使用PyTorch2.7.1,每次分布式保存dcp.save均需要处理全量meta_data。meta_data是多重嵌套的字典格式,arm cpu对改格式序列化和反序列化的耗时较长、且无法与训练主进程并行,显著影响训练效率。

解决方案

  • 对xtuner框架进行补丁(不修改PyTorch及PTA源码),实现meta_data的增量保存功能。具体改动包括:
  1. 增量保存逻辑
  • 仅修改第二次及之后的ckpt保存,仅保存与第一次相比,meta_data发生变化的部分,大量节省(反)序列化时间
  1. xtuner框架补丁
  • 在训练超参配置中添加patch_for_dcp_finish开关,用户可选择开启或关闭
  • dcp.save入参由权重和优化器保存路径checkpoint_dir,改为storage_writer和planner实例
  1. 兼容性
  • 此修改不影响dcp的ckpt读取
  • 增量保存的ckpt读取恢复训练10步loss和梯度无精度问题

测试结果

昇腾910A3上,基于Qwen3 235B模型的512卡分布式保存

  • 权重保存(450G)端到端总耗时从132.6s下降至20.13s,性能提升约85%
  • 优化器状态保存(921G)端到端总耗时从302.44s下降至34.46s,性能提升约11.39%
  • 细粒度拆分:CPU上进程间通信(all_gather_object、scatter_object和broadcast_object)耗时由百秒级下降至毫秒级

@tina-wen tina-wen mentioned this pull request Nov 28, 2025
@tina-wen tina-wen changed the base branch from main to ascend November 28, 2025 09:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant