Conversation
| gc.collect() | ||
| torch.cuda.empty_cache() | ||
| return merged_sub_state_dict | ||
| merged_sub_state_dict = torch.cat([t.cpu() for t in sub_state_dict]) |
There was a problem hiding this comment.
This part forces to use cpu to load the state_dict. Is it because without this part, it will occupy additional memory but the OOM error will be reported somewhere else?
There was a problem hiding this comment.
yes,megatron checkpoint.py will resume model first,then optimizer, we should let resume model save memory otherwise resume optimizer will OOM
|
Dear all, is there any update for this bug? |
|
We are changing our review process and marking all open, unlabeled PRs as draft. This change will go in effect starting once #3659 is merged. Moving forward, all PRs will be required to start as draft PRs. If you wish to get your PR merged, mark your PR as “Ready for review”. Read more about the new process at submit.md. |
#1746