Skip to content

CUDA error: invalid device ordinal #15

@mehranagh20

Description

@mehranagh20

I am using slurm with two nodes, each having one GPU. However, the mpiexec command does not work and I get CUDA error: invalid device ordinal. I have tried srun command as well to make sure resources are allocated to each node correctly but I get the same error.

Traceback (most recent call last):
  File "train.py", line 136, in <module>
    main()
  File "train.py", line 126, in main
    H, logprint = set_up_hyperparams()
  File "/lustre06/project/6054857/mehranag/orig/train_helpers.py", line 115, in set_up_hyperparams
    setup_mpi(H)
  File "/lustre06/project/6054857/mehranag/orig/train_helpers.py", line 78, in setup_mpi
    torch.cuda.set_device(H.local_rank)
  File "/home/mehranag/stuff/venv/lib/python3.8/site-packages/torch/cuda/__init__.py", line 261, in set_device
    torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal

I am able to run on one node without MPI though.
I was wondering if anyone has any ideas of how to solve this?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions