Training on Flickr Dataset Unexpectedly Hangs

Hello,

First of all, thank you very much for this work and your efforts! The repository and guidelines are succinct and pretty effective!

I've encountered a recurring issue while training the large parallel model on the Flickr dataset. The training process unexpectedly hangs - no updates appear in the terminal or the wandb logs. This occurred at approximately 2.7k steps during the first run and around 32k steps in the second. The Conda environment I am using has Python3.10 set, and I was running the experiments on 4 A5000 GPUs.

Currently, I am resuming training from the latest checkpoint by using the resume flag in the training script as a workaround, whenever the training process halts.

I am curious if this is a known issue. Are there components in the code that might cause such behavior, particularly with my setup? Additionally, is resuming training a recommended approach, or are there other flags/settings I should consider?

Any insights or suggestions you can provide would be greatly appreciated.

Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training on Flickr Dataset Unexpectedly Hangs #6

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Training on Flickr Dataset Unexpectedly Hangs #6

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions