Skip to content

Issues with training on a single node #101

@andics

Description

@andics

Hi all,

essentially I spent the day today trying to figure out why the code exits with this error message when ran with 8 GPU-s on one node. This is the command I ran:

`python3 -m torch.distributed.launch --nproc_per_node=8 --master_port=1312 --use_env /home/main.py --dataset_config configs/gqa.json --ema --epochs 10 --do_qa --split_qa_heads --resume https://zenodo.org/record/4721981/files/gqa_resnet101_checkpoint.pth --batch_size 32 --no_aux_loss --no_contrastive_align_loss --qa_loss_coef 25 --lr 1.75e-5 --lr_backbone 3.5e-6 --text_encoder_lr 1.75e-5 --output-dir /home/dir

TERM_THREADLIMIT: job killed after reaching LSF thread limit.
Exited with exit code 1.

Resource usage summary:

CPU time :                                   1294.38 sec.
Max Memory :                                 66465 MB
Average Memory :                             3062.34 MB
Total Requested Memory :                     256000.00 MB
Delta Memory :                               189535.00 MB
Max Swap :                                   -
Max Processes :                              35
Max Threads :                                2482
Run time :                                   482 sec.
Turnaround time :                            526 sec.

The output (if any) is above this job summary.`

This is the exit output of the cluster I am running this on. Th thread limit was very high, so that was not the issue. The issue seems to be with the DataLoader: the code seems to be creating way more threads than necessary, resulting in this error. To fix this, you should add a --num_workers 0 argument! Hope that helps someone

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions