Issues with training on a single node

Hi all, 

essentially I spent the day today trying to figure out why the code exits with this error message when ran with 8 GPU-s on one node. This is the command I ran:

`python3 -m torch.distributed.launch --nproc_per_node=8 --master_port=1312 --use_env /home/main.py --dataset_config configs/gqa.json --ema --epochs 10 --do_qa --split_qa_heads --resume https://zenodo.org/record/4721981/files/gqa_resnet101_checkpoint.pth --batch_size 32 --no_aux_loss --no_contrastive_align_loss --qa_loss_coef 25 --lr 1.75e-5 --lr_backbone 3.5e-6 --text_encoder_lr 1.75e-5 --output-dir /home/dir
------------------------------------------------------------

TERM_THREADLIMIT: job killed after reaching LSF thread limit.
Exited with exit code 1.

Resource usage summary:

    CPU time :                                   1294.38 sec.
    Max Memory :                                 66465 MB
    Average Memory :                             3062.34 MB
    Total Requested Memory :                     256000.00 MB
    Delta Memory :                               189535.00 MB
    Max Swap :                                   -
    Max Processes :                              35
    Max Threads :                                2482
    Run time :                                   482 sec.
    Turnaround time :                            526 sec.

The output (if any) is above this job summary.`

This is the exit output of the cluster I am running this on. Th thread limit was very high, so that was not the issue. The issue seems to be with the DataLoader: the code seems to be creating way more threads than necessary, resulting in this error. To fix this, you should add a **--num_workers 0** argument! Hope that helps someone

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues with training on a single node #101

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issues with training on a single node #101

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions