fix loading dcp OOM by zjjott · Pull Request #1747 · NVIDIA/Megatron-LM

zjjott · 2025-08-14T03:56:53Z

BestJuly · 2025-08-14T04:24:26Z

megatron/core/transformer/mlp.py

-                gc.collect()
-                torch.cuda.empty_cache()
-                return merged_sub_state_dict
+            merged_sub_state_dict = torch.cat([t.cpu() for t in sub_state_dict])


This part forces to use cpu to load the state_dict. Is it because without this part, it will occupy additional memory but the OOM error will be reported somewhere else?

yes,megatron checkpoint.py will resume model first,then optimizer, we should let resume model save memory otherwise resume optimizer will OOM

slyviacassell · 2025-10-01T07:23:13Z

Dear all, is there any update for this bug?

Phlip79 · 2026-03-04T23:01:49Z

We are changing our review process and marking all open, unlabeled PRs as draft. This change will go in effect starting once #3659 is merged.

Moving forward, all PRs will be required to start as draft PRs. If you wish to get your PR merged, mark your PR as “Ready for review”. Read more about the new process at submit.md.

fix loading dcp OOM

bd6a660

BestJuly self-requested a review August 14, 2025 04:00

BestJuly reviewed Aug 14, 2025

View reviewed changes

sbhavani added the bug Something isn't working label Sep 9, 2025

ko3n1g requested review from a team as code owners February 18, 2026 09:18

Phlip79 marked this pull request as draft March 4, 2026 23:01

github-actions bot added the community-request label Mar 4, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix loading dcp OOM#1747

fix loading dcp OOM#1747
zjjott wants to merge 1 commit intoNVIDIA:mainfrom
zjjott:main

zjjott commented Aug 14, 2025

Uh oh!

BestJuly Aug 14, 2025

Uh oh!

zjjott Aug 15, 2025

Uh oh!

slyviacassell commented Oct 1, 2025

Uh oh!

Phlip79 commented Mar 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

zjjott commented Aug 14, 2025

Uh oh!

BestJuly Aug 14, 2025

Choose a reason for hiding this comment

Uh oh!

zjjott Aug 15, 2025

Choose a reason for hiding this comment

Uh oh!

slyviacassell commented Oct 1, 2025

Uh oh!

Phlip79 commented Mar 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants