-
Notifications
You must be signed in to change notification settings - Fork 10
Open
Description
Hi, I have some training questions to ask.
When I train the AO model at 2 A100 80G, Batch_size=8, and train the lrs3 and lrs2 data sets simultaneously, errors will be reported out of memory and the training time will reach an epoch of 5 hours.
When only the ls2 data set is used, Batch_size=16, num_workers=4, an error will also be reported out of memory. In this case, the epoch is 20 minutes.
When only ls2 data set is used, Batch_size=8, num_workers=4, the epoch is 40 minutes, but loss=nan occurs when multiple epochs are used.
All the above problems have caused me to be unable to train normally. Could you please tell me the time and details of your training of this model in detail? Thank you very much!!
Metadata
Metadata
Assignees
Labels
No labels