CUDA Out Of Memory in Distributed Training

    I used to successfully train the StyleGAN2-ADA and StyleGAN3 on my device. However the distributed training for SOAT failed due to out of the cuda memory. I modify the code a little bit which don't involving any training codes, then I use the Slurm to submit my training job to the server and check the model has been successfully distributed to different GPUs. Before the first epoch completes, the job aborts.
    The information below is my training environment:
        CPU: Intel Xeon 6348
        GPU: NVIDIA A100 40G PCIe*8
        Script:  python -m torch.distributed.launch --nproc_per_node=8 train.py --dataset=[My Dataset(Grayscale in 1024x1024, and I convert it into RGB when loading dataset)] --batch=X --size=1024 --iter=40000
    BTW, I set the batch size as 64, 32, 16. All of them abort. When I using a single GPU to train the SOAT with batch size 8, it succeeds.
    Looking for your reply and see if there's any possible solution.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA Out Of Memory in Distributed Training #14

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

CUDA Out Of Memory in Distributed Training #14

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions