Skip to content

Cuda Graph Capture WA for HIP Runtime#147

Open
sammysun0711 wants to merge 5 commits intozejunchen-zejun:dev/perffrom
sammysun0711:add_hipgraph_capture_wa
Open

Cuda Graph Capture WA for HIP Runtime#147
sammysun0711 wants to merge 5 commits intozejunchen-zejun:dev/perffrom
sammysun0711:add_hipgraph_capture_wa

Conversation

@sammysun0711
Copy link
Collaborator

@sammysun0711 sammysun0711 commented Jan 9, 2026

Motivation

This PR aim to add TORCH_NCCL_BLOCKING_WAIT=1 as temporal workaround for hip runtime issue during cuda graph capture when --enable-aiter-allreduce-fusion enabled.

Capturing batches (bs=2 avail_mem=40.48 GB):  67%|██████████████████████████████               | 4/6 [00:03<00:01,  1.07it/s]

[AITER] /opt/aiter/csrc/include/aiter_hip_common.h:95 fail to call hipModuleLaunchKernel(kernel_func, kargs.gdx, kargs.gdy, kargs.gdz, kargs.bdx, kargs.bdy, kargs.bdz, 0, kargs.stream, nullptr, (void**)&config) ---> [HIP error](operation failed due to a previous error during capture)
[rank0]:[E109 13:42:35.820222821 ProcessGroupNCCL.cpp:2068] [PG ID 2 PG GUID 3 Rank 0] Process group watchdog thread terminated with exception: HIP error: operation not permitted on an event last recorded in a capturing stream
HIP kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing AMD_SERIALIZE_KERNEL=3
Compile with `TORCH_USE_HIP_DSA` to enable device-side assertions.

Exception raised from c10_hip_check_implementation at /opt/pytorch/c10/hip/HIPException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x9c (0x7f41c8f196ac in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x52eb1 (0x7f41f5552eb1 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_hip.so)
frame #2: c10::hip::c10_hip_check_implementation(int, char const*, char const*, int, bool) + 0x1b2 (0x7f41f5552b52 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_hip.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x5e (0x7f41f80016de in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_hip.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x90 (0x7f41f80118e0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_hip.so)
frame #5: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0x9de (0x7f41f8014f3e in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_hip.so)
frame #6: c10d::ProcessGroupNCCL::Watchdog::run() + 0xeb (0x7f41f80171fb in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_hip.so)
frame #7: <unknown function> + 0xdc253 (0x7f41c7218253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: <unknown function> + 0x94ac3 (0x7f420b094ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #9: <unknown function> + 0x126850 (0x7f420b126850 in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'

Please refer more details in: ROCm/hip#3876, we can remove this workaround after fixed released with next ROCm release.

Modification

  • Update Qwen3-Omni doc with cuda graph capture WA and new config for pa_persistent_fwd
  • Update CI config with cuda graph capture WA

Accuracy Tests

Benchmarking and Profiling

Checklist

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant