Hi all,
I am currently experimenting with your provided code. Your plot indicating memory usage for the different batch sizes & max_length seems to fit perfectly for our setup for training. However, when monitoring the memory usage two things are noticeable:
- Memory seems to not be freed after training
- Memory seems to accumulate during validation.
I could not find a solution for 1.
For 2. it seems to work, to set eval_accumulation_steps, which is transferring the model outputs to CPU.
Do you have an idea?
Keep up the great work.
Best wishes,
Frederik