-
Notifications
You must be signed in to change notification settings - Fork 54
Description
Bug description
When training with (as best as I can tell) everything else held constant and the Dataset on the CPU, GPU memory consumption doesn't fluctuate during training. Throughout the training, it stays exactly the same, down to a single MB.
If we move the Dataset to the GPU, GPU memory consumption fluctuates during training. For every reading from nvidia-smi -l (I prefer this to dmon) it can go a little bit up or down, but the trend is positive -- over time it steadily increases up to the point where it results in an OOM.
An example of what the readings look like when training with the Dataset on the GPU:
Fri Aug 19 01:38:46 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.82.01 Driver Version: 470.82.01 CUDA Version: 11.7 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... On | 00000000:8A:00.0 Off | 0 |
| N/A 40C P0 147W / 160W | 12434MiB / 16160MiB | 73% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
Fri Aug 19 01:38:51 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.82.01 Driver Version: 470.82.01 CUDA Version: 11.7 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... On | 00000000:8A:00.0 Off | 0 |
| N/A 41C P0 131W / 160W | 12734MiB / 16160MiB | 73% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
Fri Aug 19 01:38:56 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.82.01 Driver Version: 470.82.01 CUDA Version: 11.7 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... On | 00000000:8A:00.0 Off | 0 |
| N/A 42C P0 183W / 160W | 13334MiB / 16160MiB | 46% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
Steps/Code to reproduce bug
To reproduce, please download the code for the evalRS challenge. This is the notebook to run. Data will be downloaded automatically for you. That notebook will train with a Dataset on the CPU.
Unfortunately, I cannot attach an ipynb file to this issue. The modifications to run the training with the Dataset on the GPU occur in several places. Both notebooks are engineered to run on a 16GB GPU where the memory increase is best visible. Please download the modified notebook from this location
Expected behavior
Memory consumption doesn't increase when training over time.
Environment details
merlin-tensorflow image, 22.07