Skip to content

gpu_decode_step error: CUDA_ERROR_ILLEGAL_ADDRESS - Still happening on RTX3090 running 0.1.66-rc1 #20

@sandbak

Description

@sandbak

Referring to the already reported same issue, I think this was expected to be resolved in the latest release candidates?

I'm still getting the same error. Model is QCN

▸ Warmup (prefill + decode, no HCS)
2026-04-09 10:55:16,543 krasis.server INFO ── Warmup (prefill + decode, no HCS) ──
Triggering lazy CUDA allocations (torch.compile, FlashInfer, cuBLAS)
2026-04-09 10:55:16,543 krasis.gpu_prefill INFO Engine path: Marlin-native DMA copy (zero conversion, zero RAM cache)
2026-04-09 10:55:16,543 krasis.model INFO Building prefill pinned buffers for cuda:0...
2026-04-09 10:55:16,543 krasis.model INFO GPU prefill: 1 managers, threshold=1 tokens
[VRAM before-prefill-warmup] cuda:0: alloc=6317 MB, reserved=8058 MB, used=9635 MB, free=14940 MB
2026-04-09 10:55:16,544 krasis.server INFO VRAM_SNAP [before-prefill-warmup] cuda:0: alloc=6317 MB, reserved=8058 MB, used=9635 MB, free=14940 MB, total=24575 MB
2026-04-09 10:55:16,544 krasis.server INFO Warming up prefill (50K tokens, GPU kernels + CUDA caches)...
2026-04-09 10:55:17,980 krasis.model INFO DMA pipelining ENABLED (1 managers, 48 groups)
2026-04-09 10:55:18,109 krasis.gpu_prefill INFO Layer group loaded: 1 MoE layers = 830.5 MB in 0.13s (GPU total: 7452.8 MB)
2026-04-09 10:56:44,173 krasis.model INFO server_prefill: 37 tokens in 29.21s (1 tok/s), decode_mode=gpu
[VRAM after-prefill-warmup-before-cleanup] cuda:0: alloc=6390 MB, reserved=6874 MB, used=8457 MB, free=16118 MB
2026-04-09 10:56:44,177 krasis.server INFO VRAM_SNAP [after-prefill-warmup-before-cleanup] cuda:0: alloc=6390 MB, reserved=6874 MB, used=8457 MB, free=16118 MB, total=24575 MB
2026-04-09 10:56:44,177 krasis.server INFO Prefill warmup: 37 tokens processed
[VRAM after-prefill-warmup-after-cleanup] cuda:0: alloc=6390 MB, reserved=6874 MB, used=8457 MB, free=16118 MB
2026-04-09 10:56:44,178 krasis.server INFO VRAM_SNAP [after-prefill-warmup-after-cleanup] cuda:0: alloc=6390 MB, reserved=6874 MB, used=8457 MB, free=16118 MB, total=24575 MB
2026-04-09 10:56:44,178 krasis.server INFO Prefill warmup complete (87.6s, 37 tokens)
2026-04-09 10:56:44,178 krasis.server INFO Warming up GPU decode (1 steps)...
[VRAM before-decode-warmup] cuda:0: alloc=6390 MB, reserved=6874 MB, used=8457 MB, free=16118 MB
2026-04-09 10:56:44,179 krasis.server INFO VRAM_SNAP [before-decode-warmup] cuda:0: alloc=6390 MB, reserved=6874 MB, used=8457 MB, free=16118 MB, total=24575 MB
2026-04-09 10:56:44,186 krasis.model INFO DMA pipelining ENABLED (1 managers, 48 groups)
2026-04-09 10:56:44,384 krasis.gpu_prefill INFO Layer group loaded: 1 MoE layers = 830.5 MB in 0.20s (GPU total: 7452.8 MB)
2026-04-09 10:56:47,836 krasis.model INFO server_prefill: 9 tokens in 3.65s (2 tok/s), decode_mode=gpu
[VRAM decode-warmup-after-prefill] cuda:0: alloc=6390 MB, reserved=6874 MB, used=8457 MB, free=16118 MB
2026-04-09 10:56:47,844 krasis.server INFO VRAM_SNAP [decode-warmup-after-prefill] cuda:0: alloc=6390 MB, reserved=6874 MB, used=8457 MB, free=16118 MB, total=24575 MB
2026-04-09 10:56:48,109 krasis.server CRITICAL Uncaught exception
Traceback (most recent call last):
File "/home/xxx/.krasis/venv/lib/python3.10/site-packages/krasis/server.py", line 500, in _warmup_decode
gpu_store.gpu_generate_batch(
RuntimeError: gpu_decode_step error: moe_forward[3]: RuntimeError: route stream sync: CUDA_ERROR_ILLEGAL_ADDRESS

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/home/xxx/.krasis/venv/lib/python3.10/site-packages/krasis/server.py", line 1856, in
main()
File "/home/xxx/.krasis/venv/lib/python3.10/site-packages/krasis/server.py", line 1006, in main
_warmup_decode(_model, num_steps=1)
File "/home/xxx/.krasis/venv/lib/python3.10/site-packages/krasis/server.py", line 521, in _warmup_decode
raise RuntimeError(
RuntimeError: Decode warmup failed: gpu_decode_step error: moe_forward[3]: RuntimeError: route stream sync: CUDA_ERROR_ILLEGAL_ADDRESS
This means decode is broken and the server cannot generate tokens. Fix the underlying issue before starting.
2026-04-09 10:56:48,110 krasis.server ERROR [stderr] Traceback (most recent call last):
Traceback (most recent call last):
2026-04-09 10:56:48,110 krasis.server ERROR [stderr] File "/home/xxx/.krasis/venv/lib/python3.10/site-packages/krasis/server.py", line 500, in _warmup_decode
File "/home/xxx/.krasis/venv/lib/python3.10/site-packages/krasis/server.py", line 500, in _warmup_decode
2026-04-09 10:56:48,110 krasis.server ERROR [stderr] gpu_store.gpu_generate_batch(
gpu_store.gpu_generate_batch(
2026-04-09 10:56:48,110 krasis.server ERROR [stderr] RuntimeError
RuntimeError2026-04-09 10:56:48,110 krasis.server ERROR [stderr] :
: 2026-04-09 10:56:48,111 krasis.server ERROR [stderr] gpu_decode_step error: moe_forward[3]: RuntimeError: route stream sync: CUDA_ERROR_ILLEGAL_ADDRESS
gpu_decode_step error: moe_forward[3]: RuntimeError: route stream sync: CUDA_ERROR_ILLEGAL_ADDRESS
2026-04-09 10:56:48,111 krasis.server ERROR [stderr]
The above exception was the direct cause of the following exception:

The above exception was the direct cause of the following exception:

2026-04-09 10:56:48,111 krasis.server ERROR [stderr] Traceback (most recent call last):
Traceback (most recent call last):
2026-04-09 10:56:48,111 krasis.server ERROR [stderr] File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
2026-04-09 10:56:48,111 krasis.server ERROR [stderr] return _run_code(code, main_globals, None,
return _run_code(code, main_globals, None,
2026-04-09 10:56:48,112 krasis.server ERROR [stderr] File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
2026-04-09 10:56:48,112 krasis.server ERROR [stderr] exec(code, run_globals)
exec(code, run_globals)
2026-04-09 10:56:48,112 krasis.server ERROR [stderr] File "/home/xxx/.krasis/venv/lib/python3.10/site-packages/krasis/server.py", line 1856, in
File "/home/xxx/.krasis/venv/lib/python3.10/site-packages/krasis/server.py", line 1856, in
2026-04-09 10:56:48,113 krasis.server ERROR [stderr] main()
main()
2026-04-09 10:56:48,113 krasis.server ERROR [stderr] File "/home/xxx/.krasis/venv/lib/python3.10/site-packages/krasis/server.py", line 1006, in main
File "/home/xxx/.krasis/venv/lib/python3.10/site-packages/krasis/server.py", line 1006, in main
2026-04-09 10:56:48,113 krasis.server ERROR [stderr] _warmup_decode(_model, num_steps=1)
_warmup_decode(_model, num_steps=1)
2026-04-09 10:56:48,114 krasis.server ERROR [stderr] File "/home/xxx/.krasis/venv/lib/python3.10/site-packages/krasis/server.py", line 521, in _warmup_decode
File "/home/xxx/.krasis/venv/lib/python3.10/site-packages/krasis/server.py", line 521, in _warmup_decode
2026-04-09 10:56:48,114 krasis.server ERROR [stderr] raise RuntimeError(
raise RuntimeError(
2026-04-09 10:56:48,114 krasis.server ERROR [stderr] RuntimeError
RuntimeError2026-04-09 10:56:48,114 krasis.server ERROR [stderr] :
: 2026-04-09 10:56:48,114 krasis.server ERROR [stderr] Decode warmup failed: gpu_decode_step error: moe_forward[3]: RuntimeError: route stream sync: CUDA_ERROR_ILLEGAL_ADDRESS
This means decode is broken and the server cannot generate tokens. Fix the underlying issue before starting.
Decode warmup failed: gpu_decode_step error: moe_forward[3]: RuntimeError: route stream sync: CUDA_ERROR_ILLEGAL_ADDRESS
This means decode is broken and the server cannot generate tokens. Fix the underlying issue before starting.

thread '' (3397) panicked at /root/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/cudarc-0.12.1/src/driver/safe/core.rs:252:76:
called Result::unwrap() on an Err value: DriverError(CUDA_ERROR_ILLEGAL_ADDRESS, "an illegal memory access was encountered")
note: run with RUST_BACKTRACE=1 environment variable to display a backtrace

thread '' (3397) panicked at /root/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/cudarc-0.12.1/src/driver/safe/core.rs:252:76:
called Result::unwrap() on an Err value: DriverError(CUDA_ERROR_ILLEGAL_ADDRESS, "an illegal memory access was encountered")
stack backtrace:
0: 0x7990891ecd63 - <std::sys::backtrace::BacktraceLock::print::DisplayBacktrace as core::fmt::Display>::fmt::h93773fc827e3113d
1: 0x799088d6e6fa - core::fmt::write::hed7b5c73d82ecb7c
2: 0x7990891c39b6 - std::io::Write::write_fmt::h6f0185aecf0ed75f
3: 0x7990891ce66a - std::panicking::default_hook::{{closure}}::h2be84df4f189ae36
4: 0x7990891ce498 - std::panicking::default_hook::hf0ea8939246f43a9
5: 0x7990891ce95b - std::panicking::panic_with_hook::hb4bd9ac1123582a0
6: 0x7990891ce728 - std::panicking::panic_handler::{{closure}}::hde00dd15f5637fe2
7: 0x7990891ca429 - std::sys::backtrace::rust_end_short_backtrace::hb72197fa777c1785
8: 0x7990891b784d - rustc[4425a7e20b4c8619]::rust_begin_unwind
9: 0x799088d7859c - core::panicking::panic_fmt::ha59b517dd231f4da
10: 0x799088d776a2 - core::result::unwrap_failed::hf2d1f30a3ac850fc
11: 0x799088e43cc0 - core::ptr::drop_in_place<cudarc::driver::safe::core::CudaSlice>::hca11ffb819cead4f
12: 0x799088e36f42 - core::ptr::drop_in_place<alloc::vec::Vec<core::option::Optionkrasis::gpu_decode::HcsCacheEntry>>::hcd637a1dfa9f19ec
13: 0x799088e400a1 - core::ptr::drop_in_placekrasis::gpu_decode::GpuDecodeGraph::h1eb1c25d8e6659db
14: 0x799088e41365 - core::ptr::drop_in_placekrasis::gpu_decode::GpuDecodeStore::h4217f77f8eaa126b
15: 0x799088e109fd - <pyo3::pycell::impl
::PyClassObject as pyo3::pycell::impl
::PyClassObjectLayout>::tp_dealloc::hda2991d3db34f052
16: 0x799088e608f0 - pyo3::impl
::trampoline::trampoline_unraisable::h817dea4b9eaf9783
17: 0x799088e63d10 - pyo3::impl
::pyclass::tp_dealloc::h88df8bb0c5a3eba7
18: 0x56303259c873 -
19: 0x5630325d5280 -
20: 0x56303258e51f -
21: 0x563032640ded - _PyModule_ClearDict
22: 0x5630326b6dda -
23: 0x5630326b39e8 - Py_FinalizeEx
24: 0x5630326a6ab3 - Py_RunMain
25: 0x563032680c4d - Py_BytesMain
26: 0x799089a29d90 -
27: 0x799089a29e40 - __libc_start_main
28: 0x563032680b45 - _start
29: 0x0 -

thread '' (3397) panicked at /rustc/4a4ef493e3a1488c6e321570238084b38948f6db/library/core/src/panicking.rs:233:5:
panic in a destructor during cleanup
thread caused non-unwinding panic. aborting.
Aborted (core dumped)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions