Skip to content

Conversation

@ShoufaChen
Copy link

Related to #41.

When pretraining with MDETR_CPU_REDUCE=1, GPU memory before and after torch.load are:

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A   2390563      C   ...da3/envs/mdetr/bin/python     9333MiB |
|    1   N/A  N/A   2390564      C   ...da3/envs/mdetr/bin/python     9031MiB |
|    2   N/A  N/A   2390565      C   ...da3/envs/mdetr/bin/python     8651MiB |
|    3   N/A  N/A   2390566      C   ...da3/envs/mdetr/bin/python     9857MiB |
+-----------------------------------------------------------------------------+

and

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A   2337080      C   ...da3/envs/mdetr/bin/python    13587MiB |
|    0   N/A  N/A   2337081      C   ...da3/envs/mdetr/bin/python     1103MiB |
|    0   N/A  N/A   2337082      C   ...da3/envs/mdetr/bin/python     1103MiB |
|    0   N/A  N/A   2337083      C   ...da3/envs/mdetr/bin/python     1103MiB |
|    1   N/A  N/A   2337080      C   ...da3/envs/mdetr/bin/python     1103MiB |
|    1   N/A  N/A   2337081      C   ...da3/envs/mdetr/bin/python    13301MiB |
|    1   N/A  N/A   2337082      C   ...da3/envs/mdetr/bin/python     1103MiB |
|    1   N/A  N/A   2337083      C   ...da3/envs/mdetr/bin/python     1103MiB |
|    2   N/A  N/A   2337080      C   ...da3/envs/mdetr/bin/python     1103MiB |
|    2   N/A  N/A   2337081      C   ...da3/envs/mdetr/bin/python     1103MiB |
|    2   N/A  N/A   2337082      C   ...da3/envs/mdetr/bin/python    11397MiB |
|    2   N/A  N/A   2337083      C   ...da3/envs/mdetr/bin/python     1103MiB |
|    3   N/A  N/A   2337080      C   ...da3/envs/mdetr/bin/python     1103MiB |
|    3   N/A  N/A   2337081      C   ...da3/envs/mdetr/bin/python     1103MiB |
|    3   N/A  N/A   2337082      C   ...da3/envs/mdetr/bin/python     1103MiB |
|    3   N/A  N/A   2337083      C   ...da3/envs/mdetr/bin/python    13251MiB |
+-----------------------------------------------------------------------------+

Use map_location=device solves this issue.

@linhuixiao
Copy link

Suffered the same bug. How was it resolved? Thank you!

@linhuixiao
Copy link

@ShoufaChen according to the author's reply in #65, maybe this issue is hard to solve, we can only to control the batchsize and GPUs num in a small level.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants