-
Notifications
You must be signed in to change notification settings - Fork 8
Description
Hello,
First of all, thank you very much for this work and your efforts! The repository and guidelines are succinct and pretty effective!
I've encountered a recurring issue while training the large parallel model on the Flickr dataset. The training process unexpectedly hangs - no updates appear in the terminal or the wandb logs. This occurred at approximately 2.7k steps during the first run and around 32k steps in the second. The Conda environment I am using has Python3.10 set, and I was running the experiments on 4 A5000 GPUs.
Currently, I am resuming training from the latest checkpoint by using the resume flag in the training script as a workaround, whenever the training process halts.
I am curious if this is a known issue. Are there components in the code that might cause such behavior, particularly with my setup? Additionally, is resuming training a recommended approach, or are there other flags/settings I should consider?
Any insights or suggestions you can provide would be greatly appreciated.
Thank you!