Skip to content

[BUG] Memory consumption grows progressively during training resulting in OOM error #665

@radekosmulski

Description

@radekosmulski

Bug description

When training with (as best as I can tell) everything else held constant and the Dataset on the CPU, GPU memory consumption doesn't fluctuate during training. Throughout the training, it stays exactly the same, down to a single MB.

If we move the Dataset to the GPU, GPU memory consumption fluctuates during training. For every reading from nvidia-smi -l (I prefer this to dmon) it can go a little bit up or down, but the trend is positive -- over time it steadily increases up to the point where it results in an OOM.

An example of what the readings look like when training with the Dataset on the GPU:

Fri Aug 19 01:38:46 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.82.01    Driver Version: 470.82.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:8A:00.0 Off |                    0 |
| N/A   40C    P0   147W / 160W |  12434MiB / 16160MiB |     73%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+
Fri Aug 19 01:38:51 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.82.01    Driver Version: 470.82.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:8A:00.0 Off |                    0 |
| N/A   41C    P0   131W / 160W |  12734MiB / 16160MiB |     73%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+
Fri Aug 19 01:38:56 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.82.01    Driver Version: 470.82.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:8A:00.0 Off |                    0 |
| N/A   42C    P0   183W / 160W |  13334MiB / 16160MiB |     46%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

Steps/Code to reproduce bug

To reproduce, please download the code for the evalRS challenge. This is the notebook to run. Data will be downloaded automatically for you. That notebook will train with a Dataset on the CPU.

Unfortunately, I cannot attach an ipynb file to this issue. The modifications to run the training with the Dataset on the GPU occur in several places. Both notebooks are engineered to run on a 16GB GPU where the memory increase is best visible. Please download the modified notebook from this location

Expected behavior

Memory consumption doesn't increase when training over time.

Environment details

merlin-tensorflow image, 22.07

Metadata

Metadata

Assignees

Labels

P1bugSomething isn't working

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions