-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
Multinode training executes well but at the exit, errors as below.
This is disturbing as it prevents NSYS profiling from ending well.
[rank5]:[E1127 20:23:47.240435043 ProcessGroupNCCL.cpp:1060] [PG 26 Rank 5] Exception thrown when waitng for future ProcessGroup abort: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at /opt/conda/conda-bld/pytorch_1724789116784/work/c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f9eeed23f86 in /home/s6/junyeol/miniconda3/envs/pytorch-2.4-cuda-12.4-python-3.8/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f9eeecd2d10 in /home/s6/junyeol/miniconda3/envs/pytorch-2.4-cuda-12.4-python-3.8/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f9eeedfff08 in /home/s6/junyeol/miniconda3/envs/pytorch-2.4-cuda-12.4-python-3.8/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10::cuda::ExchangeDevice(signed char) + 0x4d (0x7f9eeee001ad in /home/s6/junyeol/miniconda3/envs/pytorch-2.4-cuda-12.4-python-3.8/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #4: c10d::ProcessGroupNCCL::abortCommsFromMap(std::unordered_map<std::string, std::shared_ptr<c10d::NCCLComm>, std::hash<std::string>, std::equal_to<std::string>, std::allocator<std::pair<std::string const, std::shared_ptr<c10d::NCCLComm> > > >&, std::optional<std::string>) + 0x40 (0x7f9ef000ca20 in /home/s6/junyeol/miniconda3/envs/pytorch-2.4-cuda-12.4-python-3.8/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::abort(std::optional<std::string>) + 0xc0 (0x7f9ef000ce70 in /home/s6/junyeol/miniconda3/envs/pytorch-2.4-cuda-12.4-python-3.8/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0x11bbfae (0x7f9ef000cfae in /home/s6/junyeol/miniconda3/envs/pytorch-2.4-cuda-12.4-python-3.8/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0x101b5ab (0x7f9eefe6c5ab in /home/s6/junyeol/miniconda3/envs/pytorch-2.4-cuda-12.4-python-3.8/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #8: <unknown function> + 0x114df (0x7f9f4823a4df in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #9: <unknown function> + 0x11b3b76 (0x7f9ef0004b76 in /home/s6/junyeol/miniconda3/envs/pytorch-2.4-cuda-12.4-python-3.8/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #10: <unknown function> + 0xdbbf4 (0x7f9f3fae0bf4 in /home/s6/junyeol/miniconda3/envs/pytorch-2.4-cuda-12.4-python-3.8/lib/python3.8/site-packages/torch/lib/../../../.././libstdc++.so.6)
frame #11: <unknown function> + 0x8609 (0x7f9f48231609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #12: clone + 0x43 (0x7f9f47ffc353 in /lib/x86_64-linux-gnu/libc.so.6)
[v01:447879] *** Process received signal ***
[v01:447879] Signal: Aborted (6)
[v01:447879] Signal code: (-6)
[v01:447879] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x14420)[0x7f9f4823d420]
[v01:447879] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7f9f47f2000b]
[v01:447879] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7f9f47eff859]
[v01:447879] [ 3] /home/s6/junyeol/miniconda3/envs/pytorch-2.4-cuda-12.4-python-3.8/lib/python3.8/site-packages/torch/lib/../../../.././libstdc++.so.6(+0xb135a)[0x7f9f3fab635a]
[v01:447879] [ 4] /home/s6/junyeol/miniconda3/envs/pytorch-2.4-cuda-12.4-python-3.8/lib/python3.8/site-packages/torch/lib/../../../.././libstdc++.so.6(+0xb03b9)[0x7f9f3fab53b9]
[v01:447879] [ 5] /home/s6/junyeol/miniconda3/envs/pytorch-2.4-cuda-12.4-python-3.8/lib/python3.8/site-packages/torch/lib/../../../.././libstdc++.so.6(__gxx_personality_v0+0x87)[0x7f9f3fab5ae7]
[v01:447879] [ 6] /home/s6/junyeol/miniconda3/envs/pytorch-2.4-cuda-12.4-python-3.8/lib/python3.8/site-packages/torch/lib/../../../.././libgcc_s.so.1(+0x111e4)[0x7f9f3f9fc1e4]
[v01:447879] [ 7] /home/s6/junyeol/miniconda3/envs/pytorch-2.4-cuda-12.4-python-3.8/lib/python3.8/site-packages/torch/lib/../../../.././libgcc_s.so.1(_Unwind_Resume+0x12e)[0x7f9f3f9fcc1e]
[v01:447879] [ 8] /home/s6/junyeol/miniconda3/envs/pytorch-2.4-cuda-12.4-python-3.8/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so(+0xe3cbda)[0x7f9eefc8dbda]
[v01:447879] [ 9] /home/s6/junyeol/miniconda3/envs/pytorch-2.4-cuda-12.4-python-3.8/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so(_ZN4c10d16ProcessGroupNCCLD1Ev+0x291)[0x7f9ef0009d11]
[v01:447879] [10] /home/s6/junyeol/miniconda3/envs/pytorch-2.4-cuda-12.4-python-3.8/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so(_ZN4c10d16ProcessGroupNCCLD0Ev+0x9)[0x7f9ef000a1a9]
[v01:447879] [11] /home/s6/junyeol/miniconda3/envs/pytorch-2.4-cuda-12.4-python-3.8/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so(+0x5942135)[0x7f9f36e25135]
[v01:447879] [12] /home/s6/junyeol/miniconda3/envs/pytorch-2.4-cuda-12.4-python-3.8/lib/python3.8/site-packages/torch/lib/libtorch_python.so(+0xd6cc38)[0x7f9f3f487c38]
[v01:447879] [13] /home/s6/junyeol/miniconda3/envs/pytorch-2.4-cuda-12.4-python-3.8/lib/python3.8/site-packages/torch/lib/libtorch_python.so(+0xd6cc9c)[0x7f9f3f487c9c]
[v01:447879] [14] /home/s6/junyeol/miniconda3/envs/pytorch-2.4-cuda-12.4-python-3.8/lib/python3.8/site-packages/torch/lib/libtorch_python.so(+0x4ad753)[0x7f9f3ebc8753]
[v01:447879] [15] /home/s6/junyeol/miniconda3/envs/pytorch-2.4-cuda-12.4-python-3.8/lib/python3.8/site-packages/torch/lib/libtorch_python.so(+0x4ae6d1)[0x7f9f3ebc96d1]
[v01:447879] [16] python[0x4d38ef]
[v01:447879] [17] python[0x4d391e]
[v01:447879] [18] python[0x4f9476]
[v01:447879] [19] python[0x5849b4]
[v01:447879] [20] python[0x5a35b6]
[v01:447879] [21] python[0x4ccbe4]
[v01:447879] [22] python(_PyGC_CollectNoFail+0x2b)[0x5a7a6b]
[v01:447879] [23] python(Py_FinalizeEx+0x71)[0x5a6751]
[v01:447879] [24] python(Py_RunMain+0x112)[0x5a21f2]
[v01:447879] [25] python(Py_BytesMain+0x39)[0x57a799]
[v01:447879] [26] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7f9f47f01083]
[v01:447879] [27] python[0x57a64d]
[v01:447879] *** End of error message ***
[rank4]:[E1127 20:23:47.511746143 ProcessGroupNCCL.cpp:1060] [PG 26 Rank 4] Exception thrown when waitng for future ProcessGroup abort: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels