Skip to content

Multinode launch ERROR on exit #58

@gajagajago

Description

@gajagajago

Multinode training executes well but at the exit, errors as below.
This is disturbing as it prevents NSYS profiling from ending well.

[rank5]:[E1127 20:23:47.240435043 ProcessGroupNCCL.cpp:1060] [PG 26 Rank 5] Exception thrown when waitng for future ProcessGroup abort: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at /opt/conda/conda-bld/pytorch_1724789116784/work/c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f9eeed23f86 in /home/s6/junyeol/miniconda3/envs/pytorch-2.4-cuda-12.4-python-3.8/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f9eeecd2d10 in /home/s6/junyeol/miniconda3/envs/pytorch-2.4-cuda-12.4-python-3.8/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f9eeedfff08 in /home/s6/junyeol/miniconda3/envs/pytorch-2.4-cuda-12.4-python-3.8/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10::cuda::ExchangeDevice(signed char) + 0x4d (0x7f9eeee001ad in /home/s6/junyeol/miniconda3/envs/pytorch-2.4-cuda-12.4-python-3.8/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #4: c10d::ProcessGroupNCCL::abortCommsFromMap(std::unordered_map<std::string, std::shared_ptr<c10d::NCCLComm>, std::hash<std::string>, std::equal_to<std::string>, std::allocator<std::pair<std::string const, std::shared_ptr<c10d::NCCLComm> > > >&, std::optional<std::string>) + 0x40 (0x7f9ef000ca20 in /home/s6/junyeol/miniconda3/envs/pytorch-2.4-cuda-12.4-python-3.8/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::abort(std::optional<std::string>) + 0xc0 (0x7f9ef000ce70 in /home/s6/junyeol/miniconda3/envs/pytorch-2.4-cuda-12.4-python-3.8/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0x11bbfae (0x7f9ef000cfae in /home/s6/junyeol/miniconda3/envs/pytorch-2.4-cuda-12.4-python-3.8/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0x101b5ab (0x7f9eefe6c5ab in /home/s6/junyeol/miniconda3/envs/pytorch-2.4-cuda-12.4-python-3.8/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #8: <unknown function> + 0x114df (0x7f9f4823a4df in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #9: <unknown function> + 0x11b3b76 (0x7f9ef0004b76 in /home/s6/junyeol/miniconda3/envs/pytorch-2.4-cuda-12.4-python-3.8/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #10: <unknown function> + 0xdbbf4 (0x7f9f3fae0bf4 in /home/s6/junyeol/miniconda3/envs/pytorch-2.4-cuda-12.4-python-3.8/lib/python3.8/site-packages/torch/lib/../../../.././libstdc++.so.6)
frame #11: <unknown function> + 0x8609 (0x7f9f48231609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #12: clone + 0x43 (0x7f9f47ffc353 in /lib/x86_64-linux-gnu/libc.so.6)

[v01:447879] *** Process received signal ***
[v01:447879] Signal: Aborted (6)
[v01:447879] Signal code:  (-6)
[v01:447879] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x14420)[0x7f9f4823d420]
[v01:447879] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7f9f47f2000b]
[v01:447879] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7f9f47eff859]
[v01:447879] [ 3] /home/s6/junyeol/miniconda3/envs/pytorch-2.4-cuda-12.4-python-3.8/lib/python3.8/site-packages/torch/lib/../../../.././libstdc++.so.6(+0xb135a)[0x7f9f3fab635a]
[v01:447879] [ 4] /home/s6/junyeol/miniconda3/envs/pytorch-2.4-cuda-12.4-python-3.8/lib/python3.8/site-packages/torch/lib/../../../.././libstdc++.so.6(+0xb03b9)[0x7f9f3fab53b9]
[v01:447879] [ 5] /home/s6/junyeol/miniconda3/envs/pytorch-2.4-cuda-12.4-python-3.8/lib/python3.8/site-packages/torch/lib/../../../.././libstdc++.so.6(__gxx_personality_v0+0x87)[0x7f9f3fab5ae7]
[v01:447879] [ 6] /home/s6/junyeol/miniconda3/envs/pytorch-2.4-cuda-12.4-python-3.8/lib/python3.8/site-packages/torch/lib/../../../.././libgcc_s.so.1(+0x111e4)[0x7f9f3f9fc1e4]
[v01:447879] [ 7] /home/s6/junyeol/miniconda3/envs/pytorch-2.4-cuda-12.4-python-3.8/lib/python3.8/site-packages/torch/lib/../../../.././libgcc_s.so.1(_Unwind_Resume+0x12e)[0x7f9f3f9fcc1e]
[v01:447879] [ 8] /home/s6/junyeol/miniconda3/envs/pytorch-2.4-cuda-12.4-python-3.8/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so(+0xe3cbda)[0x7f9eefc8dbda]
[v01:447879] [ 9] /home/s6/junyeol/miniconda3/envs/pytorch-2.4-cuda-12.4-python-3.8/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so(_ZN4c10d16ProcessGroupNCCLD1Ev+0x291)[0x7f9ef0009d11]
[v01:447879] [10] /home/s6/junyeol/miniconda3/envs/pytorch-2.4-cuda-12.4-python-3.8/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so(_ZN4c10d16ProcessGroupNCCLD0Ev+0x9)[0x7f9ef000a1a9]
[v01:447879] [11] /home/s6/junyeol/miniconda3/envs/pytorch-2.4-cuda-12.4-python-3.8/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so(+0x5942135)[0x7f9f36e25135]
[v01:447879] [12] /home/s6/junyeol/miniconda3/envs/pytorch-2.4-cuda-12.4-python-3.8/lib/python3.8/site-packages/torch/lib/libtorch_python.so(+0xd6cc38)[0x7f9f3f487c38]
[v01:447879] [13] /home/s6/junyeol/miniconda3/envs/pytorch-2.4-cuda-12.4-python-3.8/lib/python3.8/site-packages/torch/lib/libtorch_python.so(+0xd6cc9c)[0x7f9f3f487c9c]
[v01:447879] [14] /home/s6/junyeol/miniconda3/envs/pytorch-2.4-cuda-12.4-python-3.8/lib/python3.8/site-packages/torch/lib/libtorch_python.so(+0x4ad753)[0x7f9f3ebc8753]
[v01:447879] [15] /home/s6/junyeol/miniconda3/envs/pytorch-2.4-cuda-12.4-python-3.8/lib/python3.8/site-packages/torch/lib/libtorch_python.so(+0x4ae6d1)[0x7f9f3ebc96d1]
[v01:447879] [16] python[0x4d38ef]
[v01:447879] [17] python[0x4d391e]
[v01:447879] [18] python[0x4f9476]
[v01:447879] [19] python[0x5849b4]
[v01:447879] [20] python[0x5a35b6]
[v01:447879] [21] python[0x4ccbe4]
[v01:447879] [22] python(_PyGC_CollectNoFail+0x2b)[0x5a7a6b]
[v01:447879] [23] python(Py_FinalizeEx+0x71)[0x5a6751]
[v01:447879] [24] python(Py_RunMain+0x112)[0x5a21f2]
[v01:447879] [25] python(Py_BytesMain+0x39)[0x57a799]
[v01:447879] [26] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7f9f47f01083]
[v01:447879] [27] python[0x57a64d]
[v01:447879] *** End of error message ***
[rank4]:[E1127 20:23:47.511746143 ProcessGroupNCCL.cpp:1060] [PG 26 Rank 4] Exception thrown when waitng for future ProcessGroup abort: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions