-
Notifications
You must be signed in to change notification settings - Fork 1
Description
39.1 M Trainable params
0 Non-trainable params
39.1 M Total params
156.533 Total estimated model params size (MB)
249 Modules in train mode
1 Modules in eval mode
Summoning checkpoint.
[rank1]: Traceback (most recent call last):
[rank1]: File "/root/miniconda3/lib/python3.12/site-packages/pytorch_lightning/trainer/call.py", line 48, in _call_and_handle_interrupt
[rank1]: return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/root/miniconda3/lib/python3.12/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 105, in launch
[rank1]: return function(*args, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/root/miniconda3/lib/python3.12/site-packages/pytorch_lightning/trainer/trainer.py", line 598, in _fit_impl
[rank1]: self._run(model, ckpt_path=ckpt_path)
[rank1]: File "/root/miniconda3/lib/python3.12/site-packages/pytorch_lightning/trainer/trainer.py", line 1004, in _run
[rank1]: self._checkpoint_connector.resume_end()
[rank1]: File "/root/miniconda3/lib/python3.12/site-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 228, in resume_end
[rank1]: torch.cuda.empty_cache()
[rank1]: File "/root/miniconda3/lib/python3.12/site-packages/torch/cuda/memory.py", line 222, in empty_cache
[rank1]: torch._C._cuda_emptyCache()
[rank1]: RuntimeError: CUDA error: an illegal memory access was encountered
[rank1]: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
[rank1]: For debugging consider passing CUDA_LAUNCH_BLOCKING=1
[rank1]: Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
[rank1]: During handling of the above exception, another exception occurred:
[rank1]: Traceback (most recent call last):
[rank1]: File "/root/autodl-tmp/emdrm2/EMRDM-main/main.py", line 1036, in
[rank1]: raise err
[rank1]: File "/root/autodl-tmp/emdrm2/EMRDM-main/main.py", line 1011, in
[rank1]: trainer.fit(model, data, ckpt_path=ckpt_resume_path)
[rank1]: File "/root/miniconda3/lib/python3.12/site-packages/pytorch_lightning/trainer/trainer.py", line 560, in fit
[rank1]: call._call_and_handle_interrupt(
[rank1]: File "/root/miniconda3/lib/python3.12/site-packages/pytorch_lightning/trainer/call.py", line 70, in _call_and_handle_interrupt
[rank1]: trainer._teardown()
[rank1]: File "/root/miniconda3/lib/python3.12/site-packages/pytorch_lightning/trainer/trainer.py", line 1034, in _teardown
[rank1]: self.strategy.teardown()
[rank1]: File "/root/miniconda3/lib/python3.12/site-packages/pytorch_lightning/strategies/ddp.py", line 422, in teardown
[rank1]: super().teardown()
[rank1]: File "/root/miniconda3/lib/python3.12/site-packages/pytorch_lightning/strategies/parallel.py", line 134, in teardown
[rank1]: super().teardown()
[rank1]: File "/root/miniconda3/lib/python3.12/site-packages/pytorch_lightning/strategies/strategy.py", line 536, in teardown
[rank1]: self.lightning_module.cpu()
[rank1]: File "/root/miniconda3/lib/python3.12/site-packages/lightning_fabric/utilities/device_dtype_mixin.py", line 82, in cpu
[rank1]: return super().cpu()
[rank1]: ^^^^^^^^^^^^^
[rank1]: File "/root/miniconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1133, in cpu
[rank1]: return self._apply(lambda t: t.cpu())
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/root/miniconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 915, in _apply
[rank1]: module._apply(fn)
[rank1]: File "/root/miniconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 915, in _apply
[rank1]: module._apply(fn)
[rank1]: File "/root/miniconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 915, in _apply
[rank1]: module._apply(fn)
[rank1]: [Previous line repeated 1 more time]
[rank1]: File "/root/miniconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 942, in _apply
[rank1]: param_applied = fn(param)
[rank1]: ^^^^^^^^^
[rank1]: File "/root/miniconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1133, in
[rank1]: return self._apply(lambda t: t.cpu())
[rank1]: ^^^^^^^
[rank1]: RuntimeError: CUDA error: an illegal memory access was encountered
[rank1]: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
[rank1]: For debugging consider passing CUDA_LAUNCH_BLOCKING=1
[rank1]: Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
[rank1]:[E927 19:35:40.892955591 ProcessGroupNCCL.cpp:1896] [PG ID 0 PG GUID 0(default_pg) Rank 1] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at /pytorch/c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x98 (0x7f199bb1e5e8 in /root/miniconda3/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&) + 0xe0 (0x7f199bab34a2 in /root/miniconda3/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x3c2 (0x7f1a0cba5422 in /root/miniconda3/lib/python3.12/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7f199c88b456 in /root/miniconda3/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x70 (0x7f199c89b6f0 in /root/miniconda3/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x782 (0x7f199c89d282 in /root/miniconda3/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f199c89ee8d in /root/miniconda3/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #7: + 0xdbbf4 (0x7f1aaed68bf4 in /root/miniconda3/bin/../lib/libstdc++.so.6)
frame #8: + 0x94ac3 (0x7f1ab17bdac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #9: clone + 0x44 (0x7f1ab184ea04 in /usr/lib/x86_64-linux-gnu/libc.so.6)
[rank0]:[E927 19:35:40.893303905 ProcessGroupNCCL.cpp:1896] [PG ID 0 PG GUID 0(default_pg) Rank 0] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at /pytorch/c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x98 (0x7fda8a11e5e8 in /root/miniconda3/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&) + 0xe0 (0x7fda8a0b34a2 in /root/miniconda3/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x3c2 (0x7fdafb1a5422 in /root/miniconda3/lib/python3.12/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7fda8ae8b456 in /root/miniconda3/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x70 (0x7fda8ae9b6f0 in /root/miniconda3/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x782 (0x7fda8ae9d282 in /root/miniconda3/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7fda8ae9ee8d in /root/miniconda3/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #7: + 0xdbbf4 (0x7fdb9d33abf4 in /root/miniconda3/bin/../lib/libstdc++.so.6)
frame #8: + 0x94ac3 (0x7fdb9fd91ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #9: clone + 0x44 (0x7fdb9fe22a04 in /usr/lib/x86_64-linux-gnu/libc.so.6)
进程已结束,退出代码为 134