CUDA error: invalid device ordinal

I am using slurm with two nodes, each having one GPU. However, the mpiexec command does not work and I get `CUDA error: invalid device ordinal`. I have tried `srun` command as well to make sure resources are allocated to each node correctly but I get the same error.
```
Traceback (most recent call last):
  File "train.py", line 136, in <module>
    main()
  File "train.py", line 126, in main
    H, logprint = set_up_hyperparams()
  File "/lustre06/project/6054857/mehranag/orig/train_helpers.py", line 115, in set_up_hyperparams
    setup_mpi(H)
  File "/lustre06/project/6054857/mehranag/orig/train_helpers.py", line 78, in setup_mpi
    torch.cuda.set_device(H.local_rank)
  File "/home/mehranag/stuff/venv/lib/python3.8/site-packages/torch/cuda/__init__.py", line 261, in set_device
    torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
```
I am able to run on one node without MPI though.
I was wondering if anyone has any ideas of how to solve this?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CUDA error: invalid device ordinal #15

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

CUDA error: invalid device ordinal #15

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions