[BUG] Memory consumption grows progressively during training resulting in OOM error

### Bug description
When training with (as best as I can tell) everything else held constant and the Dataset on the CPU, GPU memory consumption doesn't fluctuate during training. Throughout the training, it stays exactly the same, down to a single MB.

If we move the Dataset to the GPU, GPU memory consumption fluctuates during training. For every reading from `nvidia-smi -l` (I prefer this to `dmon`) it can go a little bit up or down, but the trend is positive -- over time it steadily increases up to the point where it results in an OOM.

An example of what the readings look like when training with the Dataset on the GPU:
```
Fri Aug 19 01:38:46 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.82.01    Driver Version: 470.82.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:8A:00.0 Off |                    0 |
| N/A   40C    P0   147W / 160W |  12434MiB / 16160MiB |     73%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+
Fri Aug 19 01:38:51 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.82.01    Driver Version: 470.82.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:8A:00.0 Off |                    0 |
| N/A   41C    P0   131W / 160W |  12734MiB / 16160MiB |     73%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+
Fri Aug 19 01:38:56 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.82.01    Driver Version: 470.82.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:8A:00.0 Off |                    0 |
| N/A   42C    P0   183W / 160W |  13334MiB / 16160MiB |     46%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+
```

### Steps/Code to reproduce bug
To reproduce, please download the code for the evalRS challenge. [This](https://github.com/RecList/evalRS-CIKM-2022/blob/main/notebooks/merlin_tutorial/evalrs_tutorial_retrieval_models_with_merlin.ipynb) is the notebook to run. Data will be downloaded automatically for you. That notebook will train with a Dataset on the CPU.

Unfortunately, I cannot attach an `ipynb` file to this issue. The modifications to run the training with the Dataset on the GPU occur in several places. Both notebooks are engineered to run on a 16GB GPU where the memory increase is best visible. Please download the modified notebook from this [location](https://drive.google.com/file/d/1G0IgVkansKHG4nU7Jb_LijxayQ56xw9W/view?usp=sharing)

### Expected behavior
Memory consumption doesn't increase when training over time.

### Environment details
merlin-tensorflow image, 22.07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Memory consumption grows progressively during training resulting in OOM error #665

Bug description

Steps/Code to reproduce bug

Expected behavior

Environment details

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] Memory consumption grows progressively during training resulting in OOM error #665

Description

Bug description

Steps/Code to reproduce bug

Expected behavior

Environment details

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions