-
Notifications
You must be signed in to change notification settings - Fork 54
Open
Description
I used to successfully train the StyleGAN2-ADA and StyleGAN3 on my device. However the distributed training for SOAT failed due to out of the cuda memory. I modify the code a little bit which don't involving any training codes, then I use the Slurm to submit my training job to the server and check the model has been successfully distributed to different GPUs. Before the first epoch completes, the job aborts.
The information below is my training environment:
CPU: Intel Xeon 6348
GPU: NVIDIA A100 40G PCIe*8
Script: python -m torch.distributed.launch --nproc_per_node=8 train.py --dataset=[My Dataset(Grayscale in 1024x1024, and I convert it into RGB when loading dataset)] --batch=X --size=1024 --iter=40000
BTW, I set the batch size as 64, 32, 16. All of them abort. When I using a single GPU to train the SOAT with batch size 8, it succeeds.
Looking for your reply and see if there's any possible solution.
Metadata
Metadata
Assignees
Labels
No labels