-
Notifications
You must be signed in to change notification settings - Fork 91
Open
Description
I am using slurm with two nodes, each having one GPU. However, the mpiexec command does not work and I get CUDA error: invalid device ordinal. I have tried srun command as well to make sure resources are allocated to each node correctly but I get the same error.
Traceback (most recent call last):
File "train.py", line 136, in <module>
main()
File "train.py", line 126, in main
H, logprint = set_up_hyperparams()
File "/lustre06/project/6054857/mehranag/orig/train_helpers.py", line 115, in set_up_hyperparams
setup_mpi(H)
File "/lustre06/project/6054857/mehranag/orig/train_helpers.py", line 78, in setup_mpi
torch.cuda.set_device(H.local_rank)
File "/home/mehranag/stuff/venv/lib/python3.8/site-packages/torch/cuda/__init__.py", line 261, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
I am able to run on one node without MPI though.
I was wondering if anyone has any ideas of how to solve this?
Metadata
Metadata
Assignees
Labels
No labels