Cuda Graph Capture WA for HIP Runtime by sammysun0711 · Pull Request #147 · zejunchen-zejun/sglang

sammysun0711 · 2026-01-09T13:34:47Z

Motivation

This PR aim to add TORCH_NCCL_BLOCKING_WAIT=1 as temporal workaround for hip runtime issue during cuda graph capture when --enable-aiter-allreduce-fusion enabled.

Capturing batches (bs=2 avail_mem=40.48 GB):  67%|██████████████████████████████               | 4/6 [00:03<00:01,  1.07it/s]

[AITER] /opt/aiter/csrc/include/aiter_hip_common.h:95 fail to call hipModuleLaunchKernel(kernel_func, kargs.gdx, kargs.gdy, kargs.gdz, kargs.bdx, kargs.bdy, kargs.bdz, 0, kargs.stream, nullptr, (void**)&config) ---> [HIP error](operation failed due to a previous error during capture)
[rank0]:[E109 13:42:35.820222821 ProcessGroupNCCL.cpp:2068] [PG ID 2 PG GUID 3 Rank 0] Process group watchdog thread terminated with exception: HIP error: operation not permitted on an event last recorded in a capturing stream
HIP kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing AMD_SERIALIZE_KERNEL=3
Compile with `TORCH_USE_HIP_DSA` to enable device-side assertions.

Exception raised from c10_hip_check_implementation at /opt/pytorch/c10/hip/HIPException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x9c (0x7f41c8f196ac in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x52eb1 (0x7f41f5552eb1 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_hip.so)
frame #2: c10::hip::c10_hip_check_implementation(int, char const*, char const*, int, bool) + 0x1b2 (0x7f41f5552b52 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_hip.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x5e (0x7f41f80016de in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_hip.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x90 (0x7f41f80118e0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_hip.so)
frame #5: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0x9de (0x7f41f8014f3e in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_hip.so)
frame #6: c10d::ProcessGroupNCCL::Watchdog::run() + 0xeb (0x7f41f80171fb in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_hip.so)
frame #7: <unknown function> + 0xdc253 (0x7f41c7218253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: <unknown function> + 0x94ac3 (0x7f420b094ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #9: <unknown function> + 0x126850 (0x7f420b126850 in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'

Please refer more details in: ROCm/hip#3876, we can remove this workaround after fixed released with next ROCm release.

Modification

Update Qwen3-Omni doc with cuda graph capture WA and new config for pa_persistent_fwd
Update CI config with cuda graph capture WA

Accuracy Tests

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.

… test

…prove acc

…ate CI config

sammysun0711 added 5 commits January 8, 2026 15:45

Fix fix pa metadata args type mismatch

02bde47

Increase --cuda-graph-max-bs 16 to match concurrency used in mmmu acc…

b8384c3

… test

Pass q/k/v descale=None to flash_attn_varlen_fp8_pertensor_func to im…

a31d159

…prove acc

Add temp wa for hip runtime bug during cuda graph capture in doc, upd…

fc6abff

…ate CI config

Merge branch 'dev/perf' into add_hipgraph_capture_wa

2a0000b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cuda Graph Capture WA for HIP Runtime#147

Cuda Graph Capture WA for HIP Runtime#147
sammysun0711 wants to merge 5 commits intozejunchen-zejun:dev/perffrom
sammysun0711:add_hipgraph_capture_wa

sammysun0711 commented Jan 9, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

sammysun0711 commented Jan 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modification

Accuracy Tests

Benchmarking and Profiling

Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

sammysun0711 commented Jan 9, 2026 •

edited

Loading