Skip to content

fix loading dcp OOM#1747

Draft
zjjott wants to merge 1 commit intoNVIDIA:mainfrom
zjjott:main
Draft

fix loading dcp OOM#1747
zjjott wants to merge 1 commit intoNVIDIA:mainfrom
zjjott:main

Conversation

@zjjott
Copy link

@zjjott zjjott commented Aug 14, 2025

@BestJuly BestJuly self-requested a review August 14, 2025 04:00
gc.collect()
torch.cuda.empty_cache()
return merged_sub_state_dict
merged_sub_state_dict = torch.cat([t.cpu() for t in sub_state_dict])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part forces to use cpu to load the state_dict. Is it because without this part, it will occupy additional memory but the OOM error will be reported somewhere else?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes,megatron checkpoint.py will resume model first,then optimizer, we should let resume model save memory otherwise resume optimizer will OOM

@sbhavani sbhavani added the bug Something isn't working label Sep 9, 2025
@slyviacassell
Copy link

Dear all, is there any update for this bug?

@ko3n1g ko3n1g requested review from a team as code owners February 18, 2026 09:18
@Phlip79
Copy link
Member

Phlip79 commented Mar 4, 2026

We are changing our review process and marking all open, unlabeled PRs as draft. This change will go in effect starting once #3659 is merged.

Moving forward, all PRs will be required to start as draft PRs. If you wish to get your PR merged, mark your PR as “Ready for review”. Read more about the new process at submit.md.

@Phlip79 Phlip79 marked this pull request as draft March 4, 2026 23:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working community-request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants