GFX (cuda0 utilized in the config): RTX 3090
Driver Version: 590.48.01
Cuda compilation tools, release 13.1, V13.1.115
OS: Ubuntu 24.04.4 LTS
Krasis installed from source.
startup log:
RUST_BACKTRACE=full python3 -m krasis.server --config ./test-config.conf
Loaded config from ./test-config.conf: {'model_path': '/home/genie/.krasis/models/DeepSeek-V2-Lite', 'num_gpus': 1, 'layer_group_size': 2, 'kv_cache_mb': 1000, 'kv_dtype': 'bf16', 'gpu_expert_bits': 4, 'cpu_expert_bits': 4, 'attention_quant': 'bf16', 'shared_expert_quant': 'int8', 'dense_mlp_quant': 'int8', 'lm_head_quant': 'int8', 'krasis_threads': 16, 'host': '0.0.0.0', 'port': 8012, 'gpu_prefill_threshold': 300, 'gguf_path': '', 'vram_safety_margin': 600, 'force_load': False, 'enable_thinking': True}
Archived previous log → logs/krasis_20260324_221112.log
2026-03-24 22:15:06,557 krasis.server INFO Logging to /home/genie/krasis/krasis.log
2026-03-24 22:15:06,557 krasis.server INFO === Config file: ./test-config.conf ===
2026-03-24 22:15:06,557 krasis.server INFO # Krasis saved configuration — 2026-03-24T22:14:44.438002
2026-03-24 22:15:06,557 krasis.server INFO # Re-generated by krasis launcher on each launch
2026-03-24 22:15:06,557 krasis.server INFO MODEL_PATH="/home/genie/.krasis/models/DeepSeek-V2-Lite"
2026-03-24 22:15:06,557 krasis.server INFO CFG_SELECTED_GPUS="0"
2026-03-24 22:15:06,557 krasis.server INFO CFG_PP_PARTITION="27"
2026-03-24 22:15:06,557 krasis.server INFO CFG_LAYER_GROUP_SIZE="2"
2026-03-24 22:15:06,557 krasis.server INFO CFG_KV_CACHE_MB="1000"
2026-03-24 22:15:06,557 krasis.server INFO CFG_KV_DTYPE="bf16"
2026-03-24 22:15:06,557 krasis.server INFO CFG_GPU_EXPERT_BITS="4"
2026-03-24 22:15:06,557 krasis.server INFO CFG_CPU_EXPERT_BITS="4"
2026-03-24 22:15:06,557 krasis.server INFO CFG_ATTENTION_QUANT="bf16"
2026-03-24 22:15:06,557 krasis.server INFO CFG_SHARED_EXPERT_QUANT="int8"
2026-03-24 22:15:06,557 krasis.server INFO CFG_DENSE_MLP_QUANT="int8"
2026-03-24 22:15:06,557 krasis.server INFO CFG_LM_HEAD_QUANT="int8"
2026-03-24 22:15:06,557 krasis.server INFO CFG_KRASIS_THREADS="16"
2026-03-24 22:15:06,557 krasis.server INFO CFG_HOST="0.0.0.0"
2026-03-24 22:15:06,557 krasis.server INFO CFG_PORT="8012"
2026-03-24 22:15:06,557 krasis.server INFO CFG_GPU_PREFILL_THRESHOLD="300"
2026-03-24 22:15:06,557 krasis.server INFO CFG_GGUF_PATH=""
2026-03-24 22:15:06,557 krasis.server INFO CFG_VRAM_SAFETY_MARGIN="600"
2026-03-24 22:15:06,557 krasis.server INFO CFG_FORCE_LOAD=""
2026-03-24 22:15:06,557 krasis.server INFO CFG_ENABLE_THINKING="1"
2026-03-24 22:15:06,557 krasis.server INFO === Resolved arguments ===
2026-03-24 22:15:06,557 krasis.server INFO attention_quant = 'bf16'
2026-03-24 22:15:06,558 krasis.server INFO benchmark = False
2026-03-24 22:15:06,558 krasis.server INFO benchmark_only = False
2026-03-24 22:15:06,558 krasis.server INFO build_cache = False
2026-03-24 22:15:06,558 krasis.server INFO config = None
2026-03-24 22:15:06,558 krasis.server INFO cpu_expert_bits = 4
2026-03-24 22:15:06,558 krasis.server INFO dense_mlp_quant = 'int8'
2026-03-24 22:15:06,558 krasis.server INFO draft_context = 512
2026-03-24 22:15:06,558 krasis.server INFO draft_k = 3
2026-03-24 22:15:06,558 krasis.server INFO draft_model = None
2026-03-24 22:15:06,558 krasis.server INFO enable_thinking = True
2026-03-24 22:15:06,558 krasis.server INFO force_load = False
2026-03-24 22:15:06,558 krasis.server INFO force_rebuild_cache = False
2026-03-24 22:15:06,558 krasis.server INFO gguf_path = ''
2026-03-24 22:15:06,558 krasis.server INFO gpu_decode = True
2026-03-24 22:15:06,558 krasis.server INFO gpu_expert_bits = 4
2026-03-24 22:15:06,558 krasis.server INFO gpu_prefill_threshold = 300
2026-03-24 22:15:06,558 krasis.server INFO hcs = True
2026-03-24 22:15:06,558 krasis.server INFO heatmap_path = None
2026-03-24 22:15:06,558 krasis.server INFO host = '0.0.0.0'
2026-03-24 22:15:06,558 krasis.server INFO krasis_threads = 16
2026-03-24 22:15:06,558 krasis.server INFO kv_cache_mb = 1000
2026-03-24 22:15:06,558 krasis.server INFO kv_dtype = 'bf16'
2026-03-24 22:15:06,558 krasis.server INFO layer_group_size = 2
2026-03-24 22:15:06,558 krasis.server INFO lm_head_quant = 'int8'
2026-03-24 22:15:06,558 krasis.server INFO model_path = '/home/genie/.krasis/models/DeepSeek-V2-Lite'
2026-03-24 22:15:06,558 krasis.server INFO multi_gpu_hcs = False
2026-03-24 22:15:06,558 krasis.server INFO no_stream_attention = False
2026-03-24 22:15:06,558 krasis.server INFO note = None
2026-03-24 22:15:06,558 krasis.server INFO num_gpus = 1
2026-03-24 22:15:06,558 krasis.server INFO perplexity = False
2026-03-24 22:15:06,558 krasis.server INFO port = 8012
2026-03-24 22:15:06,558 krasis.server INFO session_enabled = False
2026-03-24 22:15:06,558 krasis.server INFO shared_expert_quant = 'int8'
2026-03-24 22:15:06,558 krasis.server INFO stream_attention = False
2026-03-24 22:15:06,558 krasis.server INFO stress_test = False
2026-03-24 22:15:06,558 krasis.server INFO temperature = 0.6
2026-03-24 22:15:06,558 krasis.server INFO test_endpoints = False
2026-03-24 22:15:06,558 krasis.server INFO timing = False
2026-03-24 22:15:06,558 krasis.server INFO vram_report = False
2026-03-24 22:15:06,558 krasis.server INFO vram_safety_margin = 600
▸ Krasis — DeepSeek-V2-Lite
2026-03-24 22:15:06,559 krasis.server INFO ── Krasis — DeepSeek-V2-Lite ──
Decode: GPU | HCS: on | GPUs: 1
Experts: GPU INT4 | Attention: bf16 | KV: bf16
Layer groups: 2 | KV cache: 1000 MB | Threads: 16
GPU-only mode: CPU expert weights and CPU decoder skipped
2026-03-24 22:15:06,559 krasis.server INFO HCS strategy: PP=1, 1 GPUs available
2026-03-24 22:15:06,560 krasis.model INFO KrasisModel: 27 layers, PP=[27], 1 GPUs, attn=flashinfer
▸ Loading model weights
2026-03-24 22:15:06,561 krasis.server INFO ── Loading model weights ──
[VRAM before-load] cuda:0: alloc=0 MB, reserved=0 MB, used=273 MB, free=23853 MB, total=24126 MB
2026-03-24 22:15:06,656 krasis.model INFO VRAM_CHECKPOINT [before-load] cuda:0: alloc=0 MB, reserved=0 MB, driver_used=273 MB, free=23853 MB, total=24126 MB
[VRAM before-load] cuda:1: alloc=0 MB, reserved=0 MB, used=588 MB, free=15365 MB, total=15953 MB
2026-03-24 22:15:06,755 krasis.model INFO VRAM_CHECKPOINT [before-load] cuda:1: alloc=0 MB, reserved=0 MB, driver_used=588 MB, free=15365 MB, total=15953 MB
2026-03-24 22:15:06,755 krasis.model INFO RAM watchdog started: will exit if < 5.0% free
▸ Loading GPU weights
2026-03-24 22:15:06,756 krasis.model INFO Phase 1: Loading GPU weights (streaming INT8)...
2026-03-24 22:15:06,757 krasis.model INFO Resident attention: all 27 layers permanently on GPU0, 1 GPUs for EP
2026-03-24 22:15:06,757 krasis.model INFO Loading full base model to cuda:0...
2026-03-24 22:15:06,757 krasis.weight_loader INFO Loading embedding: model.embed_tokens.weight
2026-03-24 22:15:06,967 krasis.weight_loader INFO Layer 0 loaded in 0.2s (GPU alloc: 492 MB, moe=False, type=full_attention)
2026-03-24 22:15:06,967 krasis.attention INFO FlashInfer workspace: 128 MB on cuda:0
2026-03-24 22:15:07,002 krasis.weight_loader INFO Layer 1 loaded in 0.0s (GPU alloc: 671 MB, moe=True, type=full_attention)
2026-03-24 22:15:07,039 krasis.weight_loader INFO Layer 2 loaded in 0.0s (GPU alloc: 725 MB, moe=True, type=full_attention)
2026-03-24 22:15:07,072 krasis.weight_loader INFO Layer 3 loaded in 0.0s (GPU alloc: 778 MB, moe=True, type=full_attention)
2026-03-24 22:15:07,105 krasis.weight_loader INFO Layer 4 loaded in 0.0s (GPU alloc: 832 MB, moe=True, type=full_attention)
2026-03-24 22:15:07,138 krasis.weight_loader INFO Layer 5 loaded in 0.0s (GPU alloc: 885 MB, moe=True, type=full_attention)
2026-03-24 22:15:07,172 krasis.weight_loader INFO Layer 6 loaded in 0.0s (GPU alloc: 939 MB, moe=True, type=full_attention)
2026-03-24 22:15:07,206 krasis.weight_loader INFO Layer 7 loaded in 0.0s (GPU alloc: 991 MB, moe=True, type=full_attention)
2026-03-24 22:15:07,239 krasis.weight_loader INFO Layer 8 loaded in 0.0s (GPU alloc: 1045 MB, moe=True, type=full_attention)
2026-03-24 22:15:07,272 krasis.weight_loader INFO Layer 9 loaded in 0.0s (GPU alloc: 1099 MB, moe=True, type=full_attention)
2026-03-24 22:15:07,305 krasis.weight_loader INFO Layer 10 loaded in 0.0s (GPU alloc: 1153 MB, moe=True, type=full_attention)
2026-03-24 22:15:07,339 krasis.weight_loader INFO Layer 11 loaded in 0.0s (GPU alloc: 1206 MB, moe=True, type=full_attention)
2026-03-24 22:15:07,372 krasis.weight_loader INFO Layer 12 loaded in 0.0s (GPU alloc: 1259 MB, moe=True, type=full_attention)
2026-03-24 22:15:07,405 krasis.weight_loader INFO Layer 13 loaded in 0.0s (GPU alloc: 1312 MB, moe=True, type=full_attention)
2026-03-24 22:15:07,439 krasis.weight_loader INFO Layer 14 loaded in 0.0s (GPU alloc: 1366 MB, moe=True, type=full_attention)
2026-03-24 22:15:07,473 krasis.weight_loader INFO Layer 15 loaded in 0.0s (GPU alloc: 1420 MB, moe=True, type=full_attention)
2026-03-24 22:15:07,506 krasis.weight_loader INFO Layer 16 loaded in 0.0s (GPU alloc: 1473 MB, moe=True, type=full_attention)
2026-03-24 22:15:07,539 krasis.weight_loader INFO Layer 17 loaded in 0.0s (GPU alloc: 1527 MB, moe=True, type=full_attention)
2026-03-24 22:15:07,572 krasis.weight_loader INFO Layer 18 loaded in 0.0s (GPU alloc: 1579 MB, moe=True, type=full_attention)
2026-03-24 22:15:07,605 krasis.weight_loader INFO Layer 19 loaded in 0.0s (GPU alloc: 1633 MB, moe=True, type=full_attention)
2026-03-24 22:15:07,638 krasis.weight_loader INFO Layer 20 loaded in 0.0s (GPU alloc: 1687 MB, moe=True, type=full_attention)
2026-03-24 22:15:07,671 krasis.weight_loader INFO Layer 21 loaded in 0.0s (GPU alloc: 1741 MB, moe=True, type=full_attention)
2026-03-24 22:15:07,705 krasis.weight_loader INFO Layer 22 loaded in 0.0s (GPU alloc: 1794 MB, moe=True, type=full_attention)
2026-03-24 22:15:07,738 krasis.weight_loader INFO Layer 23 loaded in 0.0s (GPU alloc: 1847 MB, moe=True, type=full_attention)
2026-03-24 22:15:07,771 krasis.weight_loader INFO Layer 24 loaded in 0.0s (GPU alloc: 1900 MB, moe=True, type=full_attention)
2026-03-24 22:15:07,804 krasis.weight_loader INFO Layer 25 loaded in 0.0s (GPU alloc: 1954 MB, moe=True, type=full_attention)
2026-03-24 22:15:07,837 krasis.weight_loader INFO Layer 26 loaded in 0.0s (GPU alloc: 2008 MB, moe=True, type=full_attention)
2026-03-24 22:15:07,840 krasis.weight_loader INFO Loading final norm: model.norm.weight
2026-03-24 22:15:07,840 krasis.weight_loader INFO Loading LM head: lm_head.weight (precision=int8)
2026-03-24 22:15:08,248 krasis.model INFO VRAM[after-layers-loaded] GPU0: free=21535 MB, alloc=2217 MB, reserved=2276 MB, non-pytorch=315 MB
2026-03-24 22:15:08,249 krasis.model INFO GPU0: 2218 MB allocated
2026-03-24 22:15:08,249 krasis.model INFO GPU weights loaded in 1.7s
[VRAM after-phase1-gpu-weights] cuda:0: alloc=2217 MB, reserved=2276 MB, used=2591 MB, free=21535 MB, total=24126 MB
2026-03-24 22:15:08,249 krasis.model INFO VRAM_CHECKPOINT [after-phase1-gpu-weights] cuda:0: alloc=2217 MB, reserved=2276 MB, driver_used=2591 MB, free=21535 MB, total=24126 MB
[VRAM after-phase1-gpu-weights] cuda:1: alloc=0 MB, reserved=0 MB, used=588 MB, free=15365 MB, total=15953 MB
2026-03-24 22:15:08,249 krasis.model INFO VRAM_CHECKPOINT [after-phase1-gpu-weights] cuda:1: alloc=0 MB, reserved=0 MB, driver_used=588 MB, free=15365 MB, total=15953 MB
2026-03-24 22:15:08,250 krasis.model INFO Attention resident on GPU: 1144 MB, GPU free: 21535 MB
Attention resident on GPU (1144 MB), 21535 MB free
▸ Loading GPU expert weights from cache
2026-03-24 22:15:08,250 krasis.model INFO Phase 2: Loading GPU expert weights (INT4)...
Loading GPU Marlin cache: 26 layers...
GPU Marlin cache loaded: 7.7 GB in 2s
2026-03-24 22:15:10,434 krasis.model INFO Krasis engine: 26 MoE layers, 64 experts, hidden=2048
2026-03-24 22:15:10,438 krasis.model INFO Routing weights sent to Rust engine (26 MoE layers)
2026-03-24 22:15:10,438 krasis.model INFO Expert weights loaded in 2.2s
Expert weights loaded in 2s.
[VRAM after-phase2-expert-weights] cuda:0: alloc=2217 MB, reserved=2276 MB, used=2591 MB, free=21535 MB, total=24126 MB
2026-03-24 22:15:10,438 krasis.model INFO VRAM_CHECKPOINT [after-phase2-expert-weights] cuda:0: alloc=2217 MB, reserved=2276 MB, driver_used=2591 MB, free=21535 MB, total=24126 MB
[VRAM after-phase2-expert-weights] cuda:1: alloc=0 MB, reserved=0 MB, used=588 MB, free=15365 MB, total=15953 MB
2026-03-24 22:15:10,438 krasis.model INFO VRAM_CHECKPOINT [after-phase2-expert-weights] cuda:1: alloc=0 MB, reserved=0 MB, driver_used=588 MB, free=15365 MB, total=15953 MB
▸ Initializing GPU prefill managers
2026-03-24 22:15:10,438 krasis.gpu_prefill INFO GpuPrefillManager(rank 0/1): expert_slice=[0, 64), local_count=64
2026-03-24 22:15:10,438 krasis.gpu_prefill INFO GPU prefill group_size=128
2026-03-24 22:15:10,438 krasis.gpu_prefill INFO Auto chunk_size: 64 experts (4.5 MB each, 8740.7 MB budget of 22581.3 MB free, 291.8 MB reserved for intermediates)
2026-03-24 22:15:10,438 krasis.gpu_prefill INFO Layer-grouped mode: 2 layers/group, 13 groups, ~570.9 MB per group, 26 total MoE layers
2026-03-24 22:15:10,439 krasis.gpu_prefill INFO GpuPrefillManager(engine): experts=64, hidden=2048, intermediate=1408, chunk_size=64, num_chunks=1, shared=2, scale=1.000, num_bits=4, prefill_mode=layer_grouped, layer_group_size=2
2026-03-24 22:15:10,439 krasis.model INFO GPU prefill manager created for cuda:0 (rank 0/1)
2026-03-24 22:15:10,439 krasis.gpu_prefill INFO Engine path: Marlin-native DMA copy (zero conversion, zero RAM cache)
2026-03-24 22:15:10,439 krasis.model INFO Building prefill pinned buffers for cuda:0...
2026-03-24 22:15:10,439 krasis.gpu_prefill INFO Pre-allocating double-buffered pinned DMA buffers: 2x 285.5 MB (w13p=184.5, w13s=5.8, w2p=92.3, w2s=2.9)
2026-03-24 22:15:10,592 krasis.gpu_prefill INFO Double-buffered pinned DMA allocated: 2x 285.5 MB in 0.2s
2026-03-24 22:15:10,598 krasis.gpu_prefill INFO Prefill direct views: 10/26 layers (2.9 GB)
2026-03-24 22:15:10,598 krasis.gpu_prefill INFO Prefill direct views: 20/26 layers (5.7 GB)
2026-03-24 22:15:10,598 krasis.gpu_prefill INFO Prefill direct views: 26/26 layers (7.4 GB)
2026-03-24 22:15:10,598 krasis.gpu_prefill INFO Prefill direct views built: 26 layers, 7.7 GB (zero-copy, no extra RAM), 0.000s
2026-03-24 22:15:10,598 krasis.model INFO GPU prefill: 1 managers, threshold=1 tokens
[VRAM after-phase3-prefill-managers] cuda:0: alloc=2217 MB, reserved=2276 MB, used=2613 MB, free=21513 MB, total=24126 MB
2026-03-24 22:15:10,599 krasis.model INFO VRAM_CHECKPOINT [after-phase3-prefill-managers] cuda:0: alloc=2217 MB, reserved=2276 MB, driver_used=2613 MB, free=21513 MB, total=24126 MB
[VRAM after-phase3-prefill-managers] cuda:1: alloc=0 MB, reserved=0 MB, used=590 MB, free=15363 MB, total=15953 MB
2026-03-24 22:15:10,599 krasis.model INFO VRAM_CHECKPOINT [after-phase3-prefill-managers] cuda:1: alloc=0 MB, reserved=0 MB, driver_used=590 MB, free=15363 MB, total=15953 MB
2026-03-24 22:15:10,599 krasis.kv_cache INFO KV cache: 1000 MB → 2106 pages (33.7K tokens)
2026-03-24 22:15:10,601 krasis.kv_cache INFO KV cache allocated: 27 layers × 2106 pages × 16 tokens = 1000 MB (mla, mla-split)
[VRAM after-kv-cache-init] cuda:0: alloc=3218 MB, reserved=3278 MB, used=3615 MB, free=20511 MB, total=24126 MB
2026-03-24 22:15:10,602 krasis.model INFO VRAM_CHECKPOINT [after-kv-cache-init] cuda:0: alloc=3218 MB, reserved=3278 MB, driver_used=3615 MB, free=20511 MB, total=24126 MB
[VRAM after-kv-cache-init] cuda:1: alloc=0 MB, reserved=0 MB, used=590 MB, free=15363 MB, total=15953 MB
2026-03-24 22:15:10,602 krasis.model INFO VRAM_CHECKPOINT [after-kv-cache-init] cuda:1: alloc=0 MB, reserved=0 MB, driver_used=590 MB, free=15363 MB, total=15953 MB
2026-03-24 22:15:10,724 krasis.tokenizer INFO Tokenizer loaded: vocab=100000, eos=100001, bos=100000
[VRAM after-full-load] cuda:0: alloc=3218 MB, reserved=3278 MB, used=3615 MB, free=20511 MB, total=24126 MB
2026-03-24 22:15:10,725 krasis.model INFO VRAM_CHECKPOINT [after-full-load] cuda:0: alloc=3218 MB, reserved=3278 MB, driver_used=3615 MB, free=20511 MB, total=24126 MB
[VRAM after-full-load] cuda:1: alloc=0 MB, reserved=0 MB, used=590 MB, free=15363 MB, total=15953 MB
2026-03-24 22:15:10,725 krasis.model INFO VRAM_CHECKPOINT [after-full-load] cuda:1: alloc=0 MB, reserved=0 MB, driver_used=590 MB, free=15363 MB, total=15953 MB
2026-03-24 22:15:10,725 krasis.model INFO Model fully loaded in 4.2s
▸ CUDA runtime warmup
2026-03-24 22:15:10,725 krasis.server INFO ── CUDA runtime warmup ──
2026-03-24 22:15:10,726 krasis.model INFO Warming up CUDA runtime on all devices: ['cuda:0']
2026-03-24 22:15:12,391 krasis.server ERROR [stderr] :1297: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
:1297: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
2026-03-24 22:15:12,391 krasis.server ERROR [stderr] :1297: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
:1297: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
2026-03-24 22:15:12,429 krasis.linear_attention INFO Compiled linear attention chunk step (default/Inductor mode)
2026-03-24 22:15:12,935 krasis.linear_attention INFO Linear attention torch.compile warmup complete on cuda:0
2026-03-24 22:15:13,076 krasis.model INFO CUDA runtime warmup on cuda:0 complete: 31 MB consumed (21508 MB free before → 21476 MB free after)
cuBLAS + Triton kernel compilation done
▸ Setting up GPU decode store
2026-03-24 22:15:13,076 krasis.server INFO ── Setting up GPU decode store ──
2026-03-24 22:15:13,980 krasis.model INFO MLA-only KV cache: max_seq=33696 (2106 pages × 16)
2026-03-24 22:15:13,980 krasis.model INFO GPU decode store configured: 27 layers, store_addr=1350798976
GPU decode store ready (addr=0x50838e80)
VRAM monitor started (tracking warmup)
cuda:0: 0 MB total
▸ Warmup (prefill + decode, no HCS)
2026-03-24 22:15:13,981 krasis.server INFO ── Warmup (prefill + decode, no HCS) ──
Triggering lazy CUDA allocations (torch.compile, FlashInfer, cuBLAS)
2026-03-24 22:15:13,981 krasis.gpu_prefill INFO Engine path: Marlin-native DMA copy (zero conversion, zero RAM cache)
2026-03-24 22:15:13,981 krasis.model INFO Building prefill pinned buffers for cuda:0...
2026-03-24 22:15:13,981 krasis.model INFO GPU prefill: 1 managers, threshold=1 tokens
[VRAM before-prefill-warmup] cuda:0: alloc=3768 MB, reserved=3956 MB, used=4623 MB, free=19503 MB
2026-03-24 22:15:13,981 krasis.server INFO VRAM_SNAP [before-prefill-warmup] cuda:0: alloc=3768 MB, reserved=3956 MB, used=4623 MB, free=19503 MB, total=24126 MB
[VRAM before-prefill-warmup] cuda:1: alloc=0 MB, reserved=0 MB, used=604 MB, free=15349 MB
2026-03-24 22:15:13,983 krasis.server INFO VRAM_SNAP [before-prefill-warmup] cuda:1: alloc=0 MB, reserved=0 MB, used=604 MB, free=15349 MB, total=15953 MB
2026-03-24 22:15:13,983 krasis.server INFO Warming up prefill (50K tokens, GPU kernels + CUDA caches)...
2026-03-24 22:15:13,987 krasis.model INFO DMA pipelining ENABLED (1 managers, 26 groups)
2026-03-24 22:15:14,013 krasis.gpu_prefill INFO Layer group loaded: 1 MoE layers = 294.4 MB in 0.03s (GPU total: 4246.6 MB)
2026-03-24 22:15:14,923 krasis.model INFO server_prefill: 37 tokens in 0.93s (40 tok/s), decode_mode=gpu
[VRAM after-prefill-warmup-before-cleanup] cuda:0: alloc=3768 MB, reserved=3872 MB, used=4545 MB, free=19581 MB
2026-03-24 22:15:14,923 krasis.server INFO VRAM_SNAP [after-prefill-warmup-before-cleanup] cuda:0: alloc=3768 MB, reserved=3872 MB, used=4545 MB, free=19581 MB, total=24126 MB
[VRAM after-prefill-warmup-before-cleanup] cuda:1: alloc=0 MB, reserved=0 MB, used=604 MB, free=15349 MB
2026-03-24 22:15:14,923 krasis.server INFO VRAM_SNAP [after-prefill-warmup-before-cleanup] cuda:1: alloc=0 MB, reserved=0 MB, used=604 MB, free=15349 MB, total=15953 MB
2026-03-24 22:15:14,923 krasis.server INFO Prefill warmup: 37 tokens processed
[VRAM after-prefill-warmup-after-cleanup] cuda:0: alloc=3768 MB, reserved=3872 MB, used=4545 MB, free=19581 MB
2026-03-24 22:15:14,923 krasis.server INFO VRAM_SNAP [after-prefill-warmup-after-cleanup] cuda:0: alloc=3768 MB, reserved=3872 MB, used=4545 MB, free=19581 MB, total=24126 MB
[VRAM after-prefill-warmup-after-cleanup] cuda:1: alloc=0 MB, reserved=0 MB, used=604 MB, free=15349 MB
2026-03-24 22:15:14,924 krasis.server INFO VRAM_SNAP [after-prefill-warmup-after-cleanup] cuda:1: alloc=0 MB, reserved=0 MB, used=604 MB, free=15349 MB, total=15953 MB
2026-03-24 22:15:14,924 krasis.server INFO Prefill warmup complete (0.9s, 37 tokens)
2026-03-24 22:15:14,924 krasis.server INFO Warming up GPU decode (1 steps)...
[VRAM before-decode-warmup] cuda:0: alloc=3768 MB, reserved=3872 MB, used=4545 MB, free=19581 MB
2026-03-24 22:15:14,924 krasis.server INFO VRAM_SNAP [before-decode-warmup] cuda:0: alloc=3768 MB, reserved=3872 MB, used=4545 MB, free=19581 MB, total=24126 MB
[VRAM before-decode-warmup] cuda:1: alloc=0 MB, reserved=0 MB, used=604 MB, free=15349 MB
2026-03-24 22:15:14,924 krasis.server INFO VRAM_SNAP [before-decode-warmup] cuda:1: alloc=0 MB, reserved=0 MB, used=604 MB, free=15349 MB, total=15953 MB
2026-03-24 22:15:14,924 krasis.model INFO DMA pipelining ENABLED (1 managers, 26 groups)
2026-03-24 22:15:14,949 krasis.gpu_prefill INFO Layer group loaded: 1 MoE layers = 294.4 MB in 0.02s (GPU total: 4246.6 MB)
2026-03-24 22:15:15,720 krasis.model INFO server_prefill: 8 tokens in 0.80s (10 tok/s), decode_mode=gpu
[VRAM decode-warmup-after-prefill] cuda:0: alloc=3768 MB, reserved=3872 MB, used=4545 MB, free=19581 MB
2026-03-24 22:15:15,720 krasis.server INFO VRAM_SNAP [decode-warmup-after-prefill] cuda:0: alloc=3768 MB, reserved=3872 MB, used=4545 MB, free=19581 MB, total=24126 MB
[VRAM decode-warmup-after-prefill] cuda:1: alloc=0 MB, reserved=0 MB, used=604 MB, free=15349 MB
2026-03-24 22:15:15,720 krasis.server INFO VRAM_SNAP [decode-warmup-after-prefill] cuda:1: alloc=0 MB, reserved=0 MB, used=604 MB, free=15349 MB, total=15953 MB
2026-03-24 22:15:15,822 krasis.server CRITICAL Uncaught exception
Traceback (most recent call last):
File "/home/genie/krasis/python/krasis/server.py", line 566, in _warmup_decode
gpu_store.gpu_generate_batch(
RuntimeError: gpu_decode_step error: moe_forward[1]: RuntimeError: route stream sync: CUDA_ERROR_ILLEGAL_ADDRESS
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "", line 198, in _run_module_as_main
File "", line 88, in _run_code
File "/home/genie/krasis/python/krasis/server.py", line 2057, in
main()
File "/home/genie/krasis/python/krasis/server.py", line 1084, in main
_warmup_decode(_model, num_steps=1)
File "/home/genie/krasis/python/krasis/server.py", line 587, in _warmup_decode
raise RuntimeError(
RuntimeError: Decode warmup failed: gpu_decode_step error: moe_forward[1]: RuntimeError: route stream sync: CUDA_ERROR_ILLEGAL_ADDRESS
This means decode is broken and the server cannot generate tokens. Fix the underlying issue before starting.
2026-03-24 22:15:15,823 krasis.server ERROR [stderr] Traceback (most recent call last):
Traceback (most recent call last):
2026-03-24 22:15:15,823 krasis.server ERROR [stderr] File "/home/genie/krasis/python/krasis/server.py", line 566, in _warmup_decode
File "/home/genie/krasis/python/krasis/server.py", line 566, in _warmup_decode
2026-03-24 22:15:15,823 krasis.server ERROR [stderr] gpu_store.gpu_generate_batch(
gpu_store.gpu_generate_batch(
2026-03-24 22:15:15,823 krasis.server ERROR [stderr] RuntimeError
RuntimeError2026-03-24 22:15:15,823 krasis.server ERROR [stderr] :
: 2026-03-24 22:15:15,823 krasis.server ERROR [stderr] gpu_decode_step error: moe_forward[1]: RuntimeError: route stream sync: CUDA_ERROR_ILLEGAL_ADDRESS
gpu_decode_step error: moe_forward[1]: RuntimeError: route stream sync: CUDA_ERROR_ILLEGAL_ADDRESS
2026-03-24 22:15:15,823 krasis.server ERROR [stderr] The above exception was the direct cause of the following exception:
The above exception was the direct cause of the following exception:
2026-03-24 22:15:15,823 krasis.server ERROR [stderr] Traceback (most recent call last):
Traceback (most recent call last):
2026-03-24 22:15:15,823 krasis.server ERROR [stderr] File "", line 198, in _run_module_as_main
File "", line 198, in _run_module_as_main
2026-03-24 22:15:15,823 krasis.server ERROR [stderr] File "", line 88, in _run_code
File "", line 88, in _run_code
2026-03-24 22:15:15,823 krasis.server ERROR [stderr] File "/home/genie/krasis/python/krasis/server.py", line 2057, in
File "/home/genie/krasis/python/krasis/server.py", line 2057, in
2026-03-24 22:15:15,823 krasis.server ERROR [stderr] main()
main()
2026-03-24 22:15:15,823 krasis.server ERROR [stderr] File "/home/genie/krasis/python/krasis/server.py", line 1084, in main
File "/home/genie/krasis/python/krasis/server.py", line 1084, in main
2026-03-24 22:15:15,823 krasis.server ERROR [stderr] _warmup_decode(_model, num_steps=1)
_warmup_decode(_model, num_steps=1)
2026-03-24 22:15:15,824 krasis.server ERROR [stderr] File "/home/genie/krasis/python/krasis/server.py", line 587, in _warmup_decode
File "/home/genie/krasis/python/krasis/server.py", line 587, in _warmup_decode
2026-03-24 22:15:15,824 krasis.server ERROR [stderr] raise RuntimeError(
raise RuntimeError(
2026-03-24 22:15:15,824 krasis.server ERROR [stderr] RuntimeError
RuntimeError2026-03-24 22:15:15,824 krasis.server ERROR [stderr] :
: 2026-03-24 22:15:15,824 krasis.server ERROR [stderr] Decode warmup failed: gpu_decode_step error: moe_forward[1]: RuntimeError: route stream sync: CUDA_ERROR_ILLEGAL_ADDRESS
This means decode is broken and the server cannot generate tokens. Fix the underlying issue before starting.
Decode warmup failed: gpu_decode_step error: moe_forward[1]: RuntimeError: route stream sync: CUDA_ERROR_ILLEGAL_ADDRESS
This means decode is broken and the server cannot generate tokens. Fix the underlying issue before starting.
thread '' (28992) panicked at /home/genie/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/cudarc-0.12.1/src/driver/safe/core.rs:252:76:
called Result::unwrap() on an Err value: DriverError(CUDA_ERROR_ILLEGAL_ADDRESS, "an illegal memory access was encountered")
stack backtrace:
0: 0x7117f42552b3 - <std::sys::backtrace::BacktraceLock::print::DisplayBacktrace as core::fmt::Display>::fmt::h93773fc827e3113d
1: 0x7117f3dbac0a - core::fmt::write::hed7b5c73d82ecb7c
2: 0x7117f422bf06 - std::io::Write::write_fmt::h6f0185aecf0ed75f
3: 0x7117f4236bba - std::panicking::default_hook::{{closure}}::h2be84df4f189ae36
4: 0x7117f42369e8 - std::panicking::default_hook::hf0ea8939246f43a9
5: 0x7117f4236eab - std::panicking::panic_with_hook::hb4bd9ac1123582a0
6: 0x7117f4236c78 - std::panicking::panic_handler::{{closure}}::hde00dd15f5637fe2
7: 0x7117f4232979 - std::sys::backtrace::rust_end_short_backtrace::hb72197fa777c1785
8: 0x7117f421fd9d - rustc[4425a7e20b4c8619]::rust_begin_unwind
9: 0x7117f3dc4aac - core::panicking::panic_fmt::ha59b517dd231f4da
10: 0x7117f3dc3bb2 - core::result::unwrap_failed::hf2d1f30a3ac850fc
11: 0x7117f3e92110 - core::ptr::drop_in_place<cudarc::driver::safe::core::CudaSlice>::h9b4be98944a295e5
12: 0x7117f3e84cce - core::ptr::drop_in_place<alloc::vec::Vec<core::option::Optionkrasis::gpu_decode::HcsCacheEntry>>::h546e5990b6df0384
13: 0x7117f3e8e351 - core::ptr::drop_in_placekrasis::gpu_decode::GpuDecodeGraph::hc951e853e0c7ec37
14: 0x7117f3e8f6ba - core::ptr::drop_in_placekrasis::gpu_decode::GpuDecodeStore::h059c6d844c3ab860
15: 0x7117f3e5d6dd - <pyo3::pycell::impl::PyClassObject as pyo3::pycell::impl::PyClassObjectLayout>::tp_dealloc::hc3347620756b38a1
16: 0x7117f3eb212d - pyo3::impl::trampoline::trampoline_unraisable::h810a1319a7141020
17: 0x7117f3eb5510 - pyo3::impl::pyclass::tp_dealloc::h58cce6a81cb2a4df
18: 0x575eae -
19: 0x575bfc -
20: 0x59efd5 -
21: 0x573376 -
22: 0x583404 - _PyModule_Clear
23: 0x6b1999 -
24: 0x6b0e1d - Py_FinalizeEx
25: 0x6bc7d1 - Py_RunMain
26: 0x6bc3ed - Py_BytesMain
27: 0x7117f502a1ca - __libc_start_call_main
at ./csu/../sysdeps/nptl/libc_start_call_main.h:58:16
28: 0x7117f502a28b - __libc_start_main_impl
at ./csu/../csu/libc-start.c:360:3
29: 0x6576c5 - _start
30: 0x0 -
thread '' (28992) panicked at /home/genie/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/cudarc-0.12.1/src/driver/safe/core.rs:252:76:
called Result::unwrap() on an Err value: DriverError(CUDA_ERROR_ILLEGAL_ADDRESS, "an illegal memory access was encountered")
stack backtrace:
0: 0x7117f42552b3 - <std::sys::backtrace::BacktraceLock::print::DisplayBacktrace as core::fmt::Display>::fmt::h93773fc827e3113d
1: 0x7117f3dbac0a - core::fmt::write::hed7b5c73d82ecb7c
2: 0x7117f422bf06 - std::io::Write::write_fmt::h6f0185aecf0ed75f
3: 0x7117f4236bba - std::panicking::default_hook::{{closure}}::h2be84df4f189ae36
4: 0x7117f42369e8 - std::panicking::default_hook::hf0ea8939246f43a9
5: 0x7117f4236eab - std::panicking::panic_with_hook::hb4bd9ac1123582a0
6: 0x7117f4236c78 - std::panicking::panic_handler::{{closure}}::hde00dd15f5637fe2
7: 0x7117f4232979 - std::sys::backtrace::rust_end_short_backtrace::hb72197fa777c1785
8: 0x7117f421fd9d - rustc[4425a7e20b4c8619]::rust_begin_unwind
9: 0x7117f3dc4aac - core::panicking::panic_fmt::ha59b517dd231f4da
10: 0x7117f3dc3bb2 - core::result::unwrap_failed::hf2d1f30a3ac850fc
11: 0x7117f3e92110 - core::ptr::drop_in_place<cudarc::driver::safe::core::CudaSlice>::h9b4be98944a295e5
12: 0x7117f3e84d52 - core::ptr::drop_in_place<alloc::vec::Vec<core::option::Optionkrasis::gpu_decode::HcsCacheEntry>>::h546e5990b6df0384
13: 0x7117f3e8e351 - core::ptr::drop_in_placekrasis::gpu_decode::GpuDecodeGraph::hc951e853e0c7ec37
14: 0x7117f3e8f6ba - core::ptr::drop_in_placekrasis::gpu_decode::GpuDecodeStore::h059c6d844c3ab860
15: 0x7117f3e5d6dd - <pyo3::pycell::impl::PyClassObject as pyo3::pycell::impl::PyClassObjectLayout>::tp_dealloc::hc3347620756b38a1
16: 0x7117f3eb212d - pyo3::impl::trampoline::trampoline_unraisable::h810a1319a7141020
17: 0x7117f3eb5510 - pyo3::impl::pyclass::tp_dealloc::h58cce6a81cb2a4df
18: 0x575eae -
19: 0x575bfc -
20: 0x59efd5 -
21: 0x573376 -
22: 0x583404 - _PyModule_Clear
23: 0x6b1999 -
24: 0x6b0e1d - Py_FinalizeEx
25: 0x6bc7d1 - Py_RunMain
26: 0x6bc3ed - Py_BytesMain
27: 0x7117f502a1ca - __libc_start_call_main
at ./csu/../sysdeps/nptl/libc_start_call_main.h:58:16
28: 0x7117f502a28b - __libc_start_main_impl
at ./csu/../csu/libc-start.c:360:3
29: 0x6576c5 - _start
30: 0x0 -
thread '' (28992) panicked at /rustc/4a4ef493e3a1488c6e321570238084b38948f6db/library/core/src/panicking.rs:233:5:
GFX (cuda0 utilized in the config): RTX 3090
Driver Version: 590.48.01
Cuda compilation tools, release 13.1, V13.1.115
OS: Ubuntu 24.04.4 LTS
Krasis installed from source.
startup log:
RUST_BACKTRACE=full python3 -m krasis.server --config ./test-config.conf
Loaded config from ./test-config.conf: {'model_path': '/home/genie/.krasis/models/DeepSeek-V2-Lite', 'num_gpus': 1, 'layer_group_size': 2, 'kv_cache_mb': 1000, 'kv_dtype': 'bf16', 'gpu_expert_bits': 4, 'cpu_expert_bits': 4, 'attention_quant': 'bf16', 'shared_expert_quant': 'int8', 'dense_mlp_quant': 'int8', 'lm_head_quant': 'int8', 'krasis_threads': 16, 'host': '0.0.0.0', 'port': 8012, 'gpu_prefill_threshold': 300, 'gguf_path': '', 'vram_safety_margin': 600, 'force_load': False, 'enable_thinking': True}
Archived previous log → logs/krasis_20260324_221112.log
2026-03-24 22:15:06,557 krasis.server INFO Logging to /home/genie/krasis/krasis.log
2026-03-24 22:15:06,557 krasis.server INFO === Config file: ./test-config.conf ===
2026-03-24 22:15:06,557 krasis.server INFO # Krasis saved configuration — 2026-03-24T22:14:44.438002
2026-03-24 22:15:06,557 krasis.server INFO # Re-generated by krasis launcher on each launch
2026-03-24 22:15:06,557 krasis.server INFO MODEL_PATH="/home/genie/.krasis/models/DeepSeek-V2-Lite"
2026-03-24 22:15:06,557 krasis.server INFO CFG_SELECTED_GPUS="0"
2026-03-24 22:15:06,557 krasis.server INFO CFG_PP_PARTITION="27"
2026-03-24 22:15:06,557 krasis.server INFO CFG_LAYER_GROUP_SIZE="2"
2026-03-24 22:15:06,557 krasis.server INFO CFG_KV_CACHE_MB="1000"
2026-03-24 22:15:06,557 krasis.server INFO CFG_KV_DTYPE="bf16"
2026-03-24 22:15:06,557 krasis.server INFO CFG_GPU_EXPERT_BITS="4"
2026-03-24 22:15:06,557 krasis.server INFO CFG_CPU_EXPERT_BITS="4"
2026-03-24 22:15:06,557 krasis.server INFO CFG_ATTENTION_QUANT="bf16"
2026-03-24 22:15:06,557 krasis.server INFO CFG_SHARED_EXPERT_QUANT="int8"
2026-03-24 22:15:06,557 krasis.server INFO CFG_DENSE_MLP_QUANT="int8"
2026-03-24 22:15:06,557 krasis.server INFO CFG_LM_HEAD_QUANT="int8"
2026-03-24 22:15:06,557 krasis.server INFO CFG_KRASIS_THREADS="16"
2026-03-24 22:15:06,557 krasis.server INFO CFG_HOST="0.0.0.0"
2026-03-24 22:15:06,557 krasis.server INFO CFG_PORT="8012"
2026-03-24 22:15:06,557 krasis.server INFO CFG_GPU_PREFILL_THRESHOLD="300"
2026-03-24 22:15:06,557 krasis.server INFO CFG_GGUF_PATH=""
2026-03-24 22:15:06,557 krasis.server INFO CFG_VRAM_SAFETY_MARGIN="600"
2026-03-24 22:15:06,557 krasis.server INFO CFG_FORCE_LOAD=""
2026-03-24 22:15:06,557 krasis.server INFO CFG_ENABLE_THINKING="1"
2026-03-24 22:15:06,557 krasis.server INFO === Resolved arguments ===
2026-03-24 22:15:06,557 krasis.server INFO attention_quant = 'bf16'
2026-03-24 22:15:06,558 krasis.server INFO benchmark = False
2026-03-24 22:15:06,558 krasis.server INFO benchmark_only = False
2026-03-24 22:15:06,558 krasis.server INFO build_cache = False
2026-03-24 22:15:06,558 krasis.server INFO config = None
2026-03-24 22:15:06,558 krasis.server INFO cpu_expert_bits = 4
2026-03-24 22:15:06,558 krasis.server INFO dense_mlp_quant = 'int8'
2026-03-24 22:15:06,558 krasis.server INFO draft_context = 512
2026-03-24 22:15:06,558 krasis.server INFO draft_k = 3
2026-03-24 22:15:06,558 krasis.server INFO draft_model = None
2026-03-24 22:15:06,558 krasis.server INFO enable_thinking = True
2026-03-24 22:15:06,558 krasis.server INFO force_load = False
2026-03-24 22:15:06,558 krasis.server INFO force_rebuild_cache = False
2026-03-24 22:15:06,558 krasis.server INFO gguf_path = ''
2026-03-24 22:15:06,558 krasis.server INFO gpu_decode = True
2026-03-24 22:15:06,558 krasis.server INFO gpu_expert_bits = 4
2026-03-24 22:15:06,558 krasis.server INFO gpu_prefill_threshold = 300
2026-03-24 22:15:06,558 krasis.server INFO hcs = True
2026-03-24 22:15:06,558 krasis.server INFO heatmap_path = None
2026-03-24 22:15:06,558 krasis.server INFO host = '0.0.0.0'
2026-03-24 22:15:06,558 krasis.server INFO krasis_threads = 16
2026-03-24 22:15:06,558 krasis.server INFO kv_cache_mb = 1000
2026-03-24 22:15:06,558 krasis.server INFO kv_dtype = 'bf16'
2026-03-24 22:15:06,558 krasis.server INFO layer_group_size = 2
2026-03-24 22:15:06,558 krasis.server INFO lm_head_quant = 'int8'
2026-03-24 22:15:06,558 krasis.server INFO model_path = '/home/genie/.krasis/models/DeepSeek-V2-Lite'
2026-03-24 22:15:06,558 krasis.server INFO multi_gpu_hcs = False
2026-03-24 22:15:06,558 krasis.server INFO no_stream_attention = False
2026-03-24 22:15:06,558 krasis.server INFO note = None
2026-03-24 22:15:06,558 krasis.server INFO num_gpus = 1
2026-03-24 22:15:06,558 krasis.server INFO perplexity = False
2026-03-24 22:15:06,558 krasis.server INFO port = 8012
2026-03-24 22:15:06,558 krasis.server INFO session_enabled = False
2026-03-24 22:15:06,558 krasis.server INFO shared_expert_quant = 'int8'
2026-03-24 22:15:06,558 krasis.server INFO stream_attention = False
2026-03-24 22:15:06,558 krasis.server INFO stress_test = False
2026-03-24 22:15:06,558 krasis.server INFO temperature = 0.6
2026-03-24 22:15:06,558 krasis.server INFO test_endpoints = False
2026-03-24 22:15:06,558 krasis.server INFO timing = False
2026-03-24 22:15:06,558 krasis.server INFO vram_report = False
2026-03-24 22:15:06,558 krasis.server INFO vram_safety_margin = 600
▸ Krasis — DeepSeek-V2-Lite
2026-03-24 22:15:06,559 krasis.server INFO ── Krasis — DeepSeek-V2-Lite ──
Decode: GPU | HCS: on | GPUs: 1
Experts: GPU INT4 | Attention: bf16 | KV: bf16
Layer groups: 2 | KV cache: 1000 MB | Threads: 16
GPU-only mode: CPU expert weights and CPU decoder skipped
2026-03-24 22:15:06,559 krasis.server INFO HCS strategy: PP=1, 1 GPUs available
2026-03-24 22:15:06,560 krasis.model INFO KrasisModel: 27 layers, PP=[27], 1 GPUs, attn=flashinfer
▸ Loading model weights
2026-03-24 22:15:06,561 krasis.server INFO ── Loading model weights ──
[VRAM before-load] cuda:0: alloc=0 MB, reserved=0 MB, used=273 MB, free=23853 MB, total=24126 MB
2026-03-24 22:15:06,656 krasis.model INFO VRAM_CHECKPOINT [before-load] cuda:0: alloc=0 MB, reserved=0 MB, driver_used=273 MB, free=23853 MB, total=24126 MB
[VRAM before-load] cuda:1: alloc=0 MB, reserved=0 MB, used=588 MB, free=15365 MB, total=15953 MB
2026-03-24 22:15:06,755 krasis.model INFO VRAM_CHECKPOINT [before-load] cuda:1: alloc=0 MB, reserved=0 MB, driver_used=588 MB, free=15365 MB, total=15953 MB
2026-03-24 22:15:06,755 krasis.model INFO RAM watchdog started: will exit if < 5.0% free
▸ Loading GPU weights
2026-03-24 22:15:06,756 krasis.model INFO Phase 1: Loading GPU weights (streaming INT8)...
2026-03-24 22:15:06,757 krasis.model INFO Resident attention: all 27 layers permanently on GPU0, 1 GPUs for EP
2026-03-24 22:15:06,757 krasis.model INFO Loading full base model to cuda:0...
2026-03-24 22:15:06,757 krasis.weight_loader INFO Loading embedding: model.embed_tokens.weight
2026-03-24 22:15:06,967 krasis.weight_loader INFO Layer 0 loaded in 0.2s (GPU alloc: 492 MB, moe=False, type=full_attention)
2026-03-24 22:15:06,967 krasis.attention INFO FlashInfer workspace: 128 MB on cuda:0
2026-03-24 22:15:07,002 krasis.weight_loader INFO Layer 1 loaded in 0.0s (GPU alloc: 671 MB, moe=True, type=full_attention)
2026-03-24 22:15:07,039 krasis.weight_loader INFO Layer 2 loaded in 0.0s (GPU alloc: 725 MB, moe=True, type=full_attention)
2026-03-24 22:15:07,072 krasis.weight_loader INFO Layer 3 loaded in 0.0s (GPU alloc: 778 MB, moe=True, type=full_attention)
2026-03-24 22:15:07,105 krasis.weight_loader INFO Layer 4 loaded in 0.0s (GPU alloc: 832 MB, moe=True, type=full_attention)
2026-03-24 22:15:07,138 krasis.weight_loader INFO Layer 5 loaded in 0.0s (GPU alloc: 885 MB, moe=True, type=full_attention)
2026-03-24 22:15:07,172 krasis.weight_loader INFO Layer 6 loaded in 0.0s (GPU alloc: 939 MB, moe=True, type=full_attention)
2026-03-24 22:15:07,206 krasis.weight_loader INFO Layer 7 loaded in 0.0s (GPU alloc: 991 MB, moe=True, type=full_attention)
2026-03-24 22:15:07,239 krasis.weight_loader INFO Layer 8 loaded in 0.0s (GPU alloc: 1045 MB, moe=True, type=full_attention)
2026-03-24 22:15:07,272 krasis.weight_loader INFO Layer 9 loaded in 0.0s (GPU alloc: 1099 MB, moe=True, type=full_attention)
2026-03-24 22:15:07,305 krasis.weight_loader INFO Layer 10 loaded in 0.0s (GPU alloc: 1153 MB, moe=True, type=full_attention)
2026-03-24 22:15:07,339 krasis.weight_loader INFO Layer 11 loaded in 0.0s (GPU alloc: 1206 MB, moe=True, type=full_attention)
2026-03-24 22:15:07,372 krasis.weight_loader INFO Layer 12 loaded in 0.0s (GPU alloc: 1259 MB, moe=True, type=full_attention)
2026-03-24 22:15:07,405 krasis.weight_loader INFO Layer 13 loaded in 0.0s (GPU alloc: 1312 MB, moe=True, type=full_attention)
2026-03-24 22:15:07,439 krasis.weight_loader INFO Layer 14 loaded in 0.0s (GPU alloc: 1366 MB, moe=True, type=full_attention)
2026-03-24 22:15:07,473 krasis.weight_loader INFO Layer 15 loaded in 0.0s (GPU alloc: 1420 MB, moe=True, type=full_attention)
2026-03-24 22:15:07,506 krasis.weight_loader INFO Layer 16 loaded in 0.0s (GPU alloc: 1473 MB, moe=True, type=full_attention)
2026-03-24 22:15:07,539 krasis.weight_loader INFO Layer 17 loaded in 0.0s (GPU alloc: 1527 MB, moe=True, type=full_attention)
2026-03-24 22:15:07,572 krasis.weight_loader INFO Layer 18 loaded in 0.0s (GPU alloc: 1579 MB, moe=True, type=full_attention)
2026-03-24 22:15:07,605 krasis.weight_loader INFO Layer 19 loaded in 0.0s (GPU alloc: 1633 MB, moe=True, type=full_attention)
2026-03-24 22:15:07,638 krasis.weight_loader INFO Layer 20 loaded in 0.0s (GPU alloc: 1687 MB, moe=True, type=full_attention)
2026-03-24 22:15:07,671 krasis.weight_loader INFO Layer 21 loaded in 0.0s (GPU alloc: 1741 MB, moe=True, type=full_attention)
2026-03-24 22:15:07,705 krasis.weight_loader INFO Layer 22 loaded in 0.0s (GPU alloc: 1794 MB, moe=True, type=full_attention)
2026-03-24 22:15:07,738 krasis.weight_loader INFO Layer 23 loaded in 0.0s (GPU alloc: 1847 MB, moe=True, type=full_attention)
2026-03-24 22:15:07,771 krasis.weight_loader INFO Layer 24 loaded in 0.0s (GPU alloc: 1900 MB, moe=True, type=full_attention)
2026-03-24 22:15:07,804 krasis.weight_loader INFO Layer 25 loaded in 0.0s (GPU alloc: 1954 MB, moe=True, type=full_attention)
2026-03-24 22:15:07,837 krasis.weight_loader INFO Layer 26 loaded in 0.0s (GPU alloc: 2008 MB, moe=True, type=full_attention)
2026-03-24 22:15:07,840 krasis.weight_loader INFO Loading final norm: model.norm.weight
2026-03-24 22:15:07,840 krasis.weight_loader INFO Loading LM head: lm_head.weight (precision=int8)
2026-03-24 22:15:08,248 krasis.model INFO VRAM[after-layers-loaded] GPU0: free=21535 MB, alloc=2217 MB, reserved=2276 MB, non-pytorch=315 MB
2026-03-24 22:15:08,249 krasis.model INFO GPU0: 2218 MB allocated
2026-03-24 22:15:08,249 krasis.model INFO GPU weights loaded in 1.7s
[VRAM after-phase1-gpu-weights] cuda:0: alloc=2217 MB, reserved=2276 MB, used=2591 MB, free=21535 MB, total=24126 MB
2026-03-24 22:15:08,249 krasis.model INFO VRAM_CHECKPOINT [after-phase1-gpu-weights] cuda:0: alloc=2217 MB, reserved=2276 MB, driver_used=2591 MB, free=21535 MB, total=24126 MB
[VRAM after-phase1-gpu-weights] cuda:1: alloc=0 MB, reserved=0 MB, used=588 MB, free=15365 MB, total=15953 MB
2026-03-24 22:15:08,249 krasis.model INFO VRAM_CHECKPOINT [after-phase1-gpu-weights] cuda:1: alloc=0 MB, reserved=0 MB, driver_used=588 MB, free=15365 MB, total=15953 MB
2026-03-24 22:15:08,250 krasis.model INFO Attention resident on GPU: 1144 MB, GPU free: 21535 MB
Attention resident on GPU (1144 MB), 21535 MB free
▸ Loading GPU expert weights from cache
2026-03-24 22:15:08,250 krasis.model INFO Phase 2: Loading GPU expert weights (INT4)...
Loading GPU Marlin cache: 26 layers...
GPU Marlin cache loaded: 7.7 GB in 2s
2026-03-24 22:15:10,434 krasis.model INFO Krasis engine: 26 MoE layers, 64 experts, hidden=2048
2026-03-24 22:15:10,438 krasis.model INFO Routing weights sent to Rust engine (26 MoE layers)
2026-03-24 22:15:10,438 krasis.model INFO Expert weights loaded in 2.2s
Expert weights loaded in 2s.
[VRAM after-phase2-expert-weights] cuda:0: alloc=2217 MB, reserved=2276 MB, used=2591 MB, free=21535 MB, total=24126 MB
2026-03-24 22:15:10,438 krasis.model INFO VRAM_CHECKPOINT [after-phase2-expert-weights] cuda:0: alloc=2217 MB, reserved=2276 MB, driver_used=2591 MB, free=21535 MB, total=24126 MB
[VRAM after-phase2-expert-weights] cuda:1: alloc=0 MB, reserved=0 MB, used=588 MB, free=15365 MB, total=15953 MB
2026-03-24 22:15:10,438 krasis.model INFO VRAM_CHECKPOINT [after-phase2-expert-weights] cuda:1: alloc=0 MB, reserved=0 MB, driver_used=588 MB, free=15365 MB, total=15953 MB
▸ Initializing GPU prefill managers
2026-03-24 22:15:10,438 krasis.gpu_prefill INFO GpuPrefillManager(rank 0/1): expert_slice=[0, 64), local_count=64
2026-03-24 22:15:10,438 krasis.gpu_prefill INFO GPU prefill group_size=128
2026-03-24 22:15:10,438 krasis.gpu_prefill INFO Auto chunk_size: 64 experts (4.5 MB each, 8740.7 MB budget of 22581.3 MB free, 291.8 MB reserved for intermediates)
2026-03-24 22:15:10,438 krasis.gpu_prefill INFO Layer-grouped mode: 2 layers/group, 13 groups, ~570.9 MB per group, 26 total MoE layers
2026-03-24 22:15:10,439 krasis.gpu_prefill INFO GpuPrefillManager(engine): experts=64, hidden=2048, intermediate=1408, chunk_size=64, num_chunks=1, shared=2, scale=1.000, num_bits=4, prefill_mode=layer_grouped, layer_group_size=2
2026-03-24 22:15:10,439 krasis.model INFO GPU prefill manager created for cuda:0 (rank 0/1)
2026-03-24 22:15:10,439 krasis.gpu_prefill INFO Engine path: Marlin-native DMA copy (zero conversion, zero RAM cache)
2026-03-24 22:15:10,439 krasis.model INFO Building prefill pinned buffers for cuda:0...
2026-03-24 22:15:10,439 krasis.gpu_prefill INFO Pre-allocating double-buffered pinned DMA buffers: 2x 285.5 MB (w13p=184.5, w13s=5.8, w2p=92.3, w2s=2.9)
2026-03-24 22:15:10,592 krasis.gpu_prefill INFO Double-buffered pinned DMA allocated: 2x 285.5 MB in 0.2s
2026-03-24 22:15:10,598 krasis.gpu_prefill INFO Prefill direct views: 10/26 layers (2.9 GB)
2026-03-24 22:15:10,598 krasis.gpu_prefill INFO Prefill direct views: 20/26 layers (5.7 GB)
2026-03-24 22:15:10,598 krasis.gpu_prefill INFO Prefill direct views: 26/26 layers (7.4 GB)
2026-03-24 22:15:10,598 krasis.gpu_prefill INFO Prefill direct views built: 26 layers, 7.7 GB (zero-copy, no extra RAM), 0.000s
2026-03-24 22:15:10,598 krasis.model INFO GPU prefill: 1 managers, threshold=1 tokens
[VRAM after-phase3-prefill-managers] cuda:0: alloc=2217 MB, reserved=2276 MB, used=2613 MB, free=21513 MB, total=24126 MB
2026-03-24 22:15:10,599 krasis.model INFO VRAM_CHECKPOINT [after-phase3-prefill-managers] cuda:0: alloc=2217 MB, reserved=2276 MB, driver_used=2613 MB, free=21513 MB, total=24126 MB
[VRAM after-phase3-prefill-managers] cuda:1: alloc=0 MB, reserved=0 MB, used=590 MB, free=15363 MB, total=15953 MB
2026-03-24 22:15:10,599 krasis.model INFO VRAM_CHECKPOINT [after-phase3-prefill-managers] cuda:1: alloc=0 MB, reserved=0 MB, driver_used=590 MB, free=15363 MB, total=15953 MB
2026-03-24 22:15:10,599 krasis.kv_cache INFO KV cache: 1000 MB → 2106 pages (33.7K tokens)
2026-03-24 22:15:10,601 krasis.kv_cache INFO KV cache allocated: 27 layers × 2106 pages × 16 tokens = 1000 MB (mla, mla-split)
[VRAM after-kv-cache-init] cuda:0: alloc=3218 MB, reserved=3278 MB, used=3615 MB, free=20511 MB, total=24126 MB
2026-03-24 22:15:10,602 krasis.model INFO VRAM_CHECKPOINT [after-kv-cache-init] cuda:0: alloc=3218 MB, reserved=3278 MB, driver_used=3615 MB, free=20511 MB, total=24126 MB
[VRAM after-kv-cache-init] cuda:1: alloc=0 MB, reserved=0 MB, used=590 MB, free=15363 MB, total=15953 MB
2026-03-24 22:15:10,602 krasis.model INFO VRAM_CHECKPOINT [after-kv-cache-init] cuda:1: alloc=0 MB, reserved=0 MB, driver_used=590 MB, free=15363 MB, total=15953 MB
2026-03-24 22:15:10,724 krasis.tokenizer INFO Tokenizer loaded: vocab=100000, eos=100001, bos=100000
[VRAM after-full-load] cuda:0: alloc=3218 MB, reserved=3278 MB, used=3615 MB, free=20511 MB, total=24126 MB
2026-03-24 22:15:10,725 krasis.model INFO VRAM_CHECKPOINT [after-full-load] cuda:0: alloc=3218 MB, reserved=3278 MB, driver_used=3615 MB, free=20511 MB, total=24126 MB
[VRAM after-full-load] cuda:1: alloc=0 MB, reserved=0 MB, used=590 MB, free=15363 MB, total=15953 MB
2026-03-24 22:15:10,725 krasis.model INFO VRAM_CHECKPOINT [after-full-load] cuda:1: alloc=0 MB, reserved=0 MB, driver_used=590 MB, free=15363 MB, total=15953 MB
2026-03-24 22:15:10,725 krasis.model INFO Model fully loaded in 4.2s
▸ CUDA runtime warmup
2026-03-24 22:15:10,725 krasis.server INFO ── CUDA runtime warmup ──
2026-03-24 22:15:10,726 krasis.model INFO Warming up CUDA runtime on all devices: ['cuda:0']
2026-03-24 22:15:12,391 krasis.server ERROR [stderr] :1297: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
:1297: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
2026-03-24 22:15:12,391 krasis.server ERROR [stderr] :1297: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
:1297: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
2026-03-24 22:15:12,429 krasis.linear_attention INFO Compiled linear attention chunk step (default/Inductor mode)
2026-03-24 22:15:12,935 krasis.linear_attention INFO Linear attention torch.compile warmup complete on cuda:0
2026-03-24 22:15:13,076 krasis.model INFO CUDA runtime warmup on cuda:0 complete: 31 MB consumed (21508 MB free before → 21476 MB free after)
cuBLAS + Triton kernel compilation done
▸ Setting up GPU decode store
2026-03-24 22:15:13,076 krasis.server INFO ── Setting up GPU decode store ──
2026-03-24 22:15:13,980 krasis.model INFO MLA-only KV cache: max_seq=33696 (2106 pages × 16)
2026-03-24 22:15:13,980 krasis.model INFO GPU decode store configured: 27 layers, store_addr=1350798976
GPU decode store ready (addr=0x50838e80)
VRAM monitor started (tracking warmup)
cuda:0: 0 MB total
▸ Warmup (prefill + decode, no HCS)
2026-03-24 22:15:13,981 krasis.server INFO ── Warmup (prefill + decode, no HCS) ──
Triggering lazy CUDA allocations (torch.compile, FlashInfer, cuBLAS)
2026-03-24 22:15:13,981 krasis.gpu_prefill INFO Engine path: Marlin-native DMA copy (zero conversion, zero RAM cache)
2026-03-24 22:15:13,981 krasis.model INFO Building prefill pinned buffers for cuda:0...
2026-03-24 22:15:13,981 krasis.model INFO GPU prefill: 1 managers, threshold=1 tokens
[VRAM before-prefill-warmup] cuda:0: alloc=3768 MB, reserved=3956 MB, used=4623 MB, free=19503 MB
2026-03-24 22:15:13,981 krasis.server INFO VRAM_SNAP [before-prefill-warmup] cuda:0: alloc=3768 MB, reserved=3956 MB, used=4623 MB, free=19503 MB, total=24126 MB
[VRAM before-prefill-warmup] cuda:1: alloc=0 MB, reserved=0 MB, used=604 MB, free=15349 MB
2026-03-24 22:15:13,983 krasis.server INFO VRAM_SNAP [before-prefill-warmup] cuda:1: alloc=0 MB, reserved=0 MB, used=604 MB, free=15349 MB, total=15953 MB
2026-03-24 22:15:13,983 krasis.server INFO Warming up prefill (50K tokens, GPU kernels + CUDA caches)...
2026-03-24 22:15:13,987 krasis.model INFO DMA pipelining ENABLED (1 managers, 26 groups)
2026-03-24 22:15:14,013 krasis.gpu_prefill INFO Layer group loaded: 1 MoE layers = 294.4 MB in 0.03s (GPU total: 4246.6 MB)
2026-03-24 22:15:14,923 krasis.model INFO server_prefill: 37 tokens in 0.93s (40 tok/s), decode_mode=gpu
[VRAM after-prefill-warmup-before-cleanup] cuda:0: alloc=3768 MB, reserved=3872 MB, used=4545 MB, free=19581 MB
2026-03-24 22:15:14,923 krasis.server INFO VRAM_SNAP [after-prefill-warmup-before-cleanup] cuda:0: alloc=3768 MB, reserved=3872 MB, used=4545 MB, free=19581 MB, total=24126 MB
[VRAM after-prefill-warmup-before-cleanup] cuda:1: alloc=0 MB, reserved=0 MB, used=604 MB, free=15349 MB
2026-03-24 22:15:14,923 krasis.server INFO VRAM_SNAP [after-prefill-warmup-before-cleanup] cuda:1: alloc=0 MB, reserved=0 MB, used=604 MB, free=15349 MB, total=15953 MB
2026-03-24 22:15:14,923 krasis.server INFO Prefill warmup: 37 tokens processed
[VRAM after-prefill-warmup-after-cleanup] cuda:0: alloc=3768 MB, reserved=3872 MB, used=4545 MB, free=19581 MB
2026-03-24 22:15:14,923 krasis.server INFO VRAM_SNAP [after-prefill-warmup-after-cleanup] cuda:0: alloc=3768 MB, reserved=3872 MB, used=4545 MB, free=19581 MB, total=24126 MB
[VRAM after-prefill-warmup-after-cleanup] cuda:1: alloc=0 MB, reserved=0 MB, used=604 MB, free=15349 MB
2026-03-24 22:15:14,924 krasis.server INFO VRAM_SNAP [after-prefill-warmup-after-cleanup] cuda:1: alloc=0 MB, reserved=0 MB, used=604 MB, free=15349 MB, total=15953 MB
2026-03-24 22:15:14,924 krasis.server INFO Prefill warmup complete (0.9s, 37 tokens)
2026-03-24 22:15:14,924 krasis.server INFO Warming up GPU decode (1 steps)...
[VRAM before-decode-warmup] cuda:0: alloc=3768 MB, reserved=3872 MB, used=4545 MB, free=19581 MB
2026-03-24 22:15:14,924 krasis.server INFO VRAM_SNAP [before-decode-warmup] cuda:0: alloc=3768 MB, reserved=3872 MB, used=4545 MB, free=19581 MB, total=24126 MB
[VRAM before-decode-warmup] cuda:1: alloc=0 MB, reserved=0 MB, used=604 MB, free=15349 MB
2026-03-24 22:15:14,924 krasis.server INFO VRAM_SNAP [before-decode-warmup] cuda:1: alloc=0 MB, reserved=0 MB, used=604 MB, free=15349 MB, total=15953 MB
2026-03-24 22:15:14,924 krasis.model INFO DMA pipelining ENABLED (1 managers, 26 groups)
2026-03-24 22:15:14,949 krasis.gpu_prefill INFO Layer group loaded: 1 MoE layers = 294.4 MB in 0.02s (GPU total: 4246.6 MB)
2026-03-24 22:15:15,720 krasis.model INFO server_prefill: 8 tokens in 0.80s (10 tok/s), decode_mode=gpu
[VRAM decode-warmup-after-prefill] cuda:0: alloc=3768 MB, reserved=3872 MB, used=4545 MB, free=19581 MB
2026-03-24 22:15:15,720 krasis.server INFO VRAM_SNAP [decode-warmup-after-prefill] cuda:0: alloc=3768 MB, reserved=3872 MB, used=4545 MB, free=19581 MB, total=24126 MB
[VRAM decode-warmup-after-prefill] cuda:1: alloc=0 MB, reserved=0 MB, used=604 MB, free=15349 MB
2026-03-24 22:15:15,720 krasis.server INFO VRAM_SNAP [decode-warmup-after-prefill] cuda:1: alloc=0 MB, reserved=0 MB, used=604 MB, free=15349 MB, total=15953 MB
2026-03-24 22:15:15,822 krasis.server CRITICAL Uncaught exception
Traceback (most recent call last):
File "/home/genie/krasis/python/krasis/server.py", line 566, in _warmup_decode
gpu_store.gpu_generate_batch(
RuntimeError: gpu_decode_step error: moe_forward[1]: RuntimeError: route stream sync: CUDA_ERROR_ILLEGAL_ADDRESS
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "", line 198, in _run_module_as_main
File "", line 88, in _run_code
File "/home/genie/krasis/python/krasis/server.py", line 2057, in
main()
File "/home/genie/krasis/python/krasis/server.py", line 1084, in main
_warmup_decode(_model, num_steps=1)
File "/home/genie/krasis/python/krasis/server.py", line 587, in _warmup_decode
raise RuntimeError(
RuntimeError: Decode warmup failed: gpu_decode_step error: moe_forward[1]: RuntimeError: route stream sync: CUDA_ERROR_ILLEGAL_ADDRESS
This means decode is broken and the server cannot generate tokens. Fix the underlying issue before starting.
2026-03-24 22:15:15,823 krasis.server ERROR [stderr] Traceback (most recent call last):
Traceback (most recent call last):
2026-03-24 22:15:15,823 krasis.server ERROR [stderr] File "/home/genie/krasis/python/krasis/server.py", line 566, in _warmup_decode
File "/home/genie/krasis/python/krasis/server.py", line 566, in _warmup_decode
2026-03-24 22:15:15,823 krasis.server ERROR [stderr] gpu_store.gpu_generate_batch(
gpu_store.gpu_generate_batch(
2026-03-24 22:15:15,823 krasis.server ERROR [stderr] RuntimeError
RuntimeError2026-03-24 22:15:15,823 krasis.server ERROR [stderr] :
: 2026-03-24 22:15:15,823 krasis.server ERROR [stderr] gpu_decode_step error: moe_forward[1]: RuntimeError: route stream sync: CUDA_ERROR_ILLEGAL_ADDRESS
gpu_decode_step error: moe_forward[1]: RuntimeError: route stream sync: CUDA_ERROR_ILLEGAL_ADDRESS
2026-03-24 22:15:15,823 krasis.server ERROR [stderr] The above exception was the direct cause of the following exception:
The above exception was the direct cause of the following exception:
2026-03-24 22:15:15,823 krasis.server ERROR [stderr] Traceback (most recent call last):
Traceback (most recent call last):
2026-03-24 22:15:15,823 krasis.server ERROR [stderr] File "", line 198, in _run_module_as_main
File "", line 198, in _run_module_as_main
2026-03-24 22:15:15,823 krasis.server ERROR [stderr] File "", line 88, in _run_code
File "", line 88, in _run_code
2026-03-24 22:15:15,823 krasis.server ERROR [stderr] File "/home/genie/krasis/python/krasis/server.py", line 2057, in
File "/home/genie/krasis/python/krasis/server.py", line 2057, in
2026-03-24 22:15:15,823 krasis.server ERROR [stderr] main()
main()
2026-03-24 22:15:15,823 krasis.server ERROR [stderr] File "/home/genie/krasis/python/krasis/server.py", line 1084, in main
File "/home/genie/krasis/python/krasis/server.py", line 1084, in main
2026-03-24 22:15:15,823 krasis.server ERROR [stderr] _warmup_decode(_model, num_steps=1)
_warmup_decode(_model, num_steps=1)
2026-03-24 22:15:15,824 krasis.server ERROR [stderr] File "/home/genie/krasis/python/krasis/server.py", line 587, in _warmup_decode
File "/home/genie/krasis/python/krasis/server.py", line 587, in _warmup_decode
2026-03-24 22:15:15,824 krasis.server ERROR [stderr] raise RuntimeError(
raise RuntimeError(
2026-03-24 22:15:15,824 krasis.server ERROR [stderr] RuntimeError
RuntimeError2026-03-24 22:15:15,824 krasis.server ERROR [stderr] :
: 2026-03-24 22:15:15,824 krasis.server ERROR [stderr] Decode warmup failed: gpu_decode_step error: moe_forward[1]: RuntimeError: route stream sync: CUDA_ERROR_ILLEGAL_ADDRESS
This means decode is broken and the server cannot generate tokens. Fix the underlying issue before starting.
Decode warmup failed: gpu_decode_step error: moe_forward[1]: RuntimeError: route stream sync: CUDA_ERROR_ILLEGAL_ADDRESS
This means decode is broken and the server cannot generate tokens. Fix the underlying issue before starting.
thread '' (28992) panicked at /home/genie/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/cudarc-0.12.1/src/driver/safe/core.rs:252:76:
called
Result::unwrap()on anErrvalue: DriverError(CUDA_ERROR_ILLEGAL_ADDRESS, "an illegal memory access was encountered")stack backtrace:
0: 0x7117f42552b3 - <std::sys::backtrace::BacktraceLock::print::DisplayBacktrace as core::fmt::Display>::fmt::h93773fc827e3113d
1: 0x7117f3dbac0a - core::fmt::write::hed7b5c73d82ecb7c
2: 0x7117f422bf06 - std::io::Write::write_fmt::h6f0185aecf0ed75f
3: 0x7117f4236bba - std::panicking::default_hook::{{closure}}::h2be84df4f189ae36
4: 0x7117f42369e8 - std::panicking::default_hook::hf0ea8939246f43a9
5: 0x7117f4236eab - std::panicking::panic_with_hook::hb4bd9ac1123582a0
6: 0x7117f4236c78 - std::panicking::panic_handler::{{closure}}::hde00dd15f5637fe2
7: 0x7117f4232979 - std::sys::backtrace::rust_end_short_backtrace::hb72197fa777c1785
8: 0x7117f421fd9d - rustc[4425a7e20b4c8619]::rust_begin_unwind
9: 0x7117f3dc4aac - core::panicking::panic_fmt::ha59b517dd231f4da
10: 0x7117f3dc3bb2 - core::result::unwrap_failed::hf2d1f30a3ac850fc
11: 0x7117f3e92110 - core::ptr::drop_in_place<cudarc::driver::safe::core::CudaSlice>::h9b4be98944a295e5
12: 0x7117f3e84cce - core::ptr::drop_in_place<alloc::vec::Vec<core::option::Optionkrasis::gpu_decode::HcsCacheEntry>>::h546e5990b6df0384
13: 0x7117f3e8e351 - core::ptr::drop_in_placekrasis::gpu_decode::GpuDecodeGraph::hc951e853e0c7ec37
14: 0x7117f3e8f6ba - core::ptr::drop_in_placekrasis::gpu_decode::GpuDecodeStore::h059c6d844c3ab860
15: 0x7117f3e5d6dd - <pyo3::pycell::impl::PyClassObject as pyo3::pycell::impl::PyClassObjectLayout>::tp_dealloc::hc3347620756b38a1
16: 0x7117f3eb212d - pyo3::impl::trampoline::trampoline_unraisable::h810a1319a7141020
17: 0x7117f3eb5510 - pyo3::impl::pyclass::tp_dealloc::h58cce6a81cb2a4df
18: 0x575eae -
19: 0x575bfc -
20: 0x59efd5 -
21: 0x573376 -
22: 0x583404 - _PyModule_Clear
23: 0x6b1999 -
24: 0x6b0e1d - Py_FinalizeEx
25: 0x6bc7d1 - Py_RunMain
26: 0x6bc3ed - Py_BytesMain
27: 0x7117f502a1ca - __libc_start_call_main
at ./csu/../sysdeps/nptl/libc_start_call_main.h:58:16
28: 0x7117f502a28b - __libc_start_main_impl
at ./csu/../csu/libc-start.c:360:3
29: 0x6576c5 - _start
30: 0x0 -
thread '' (28992) panicked at /home/genie/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/cudarc-0.12.1/src/driver/safe/core.rs:252:76:
called
Result::unwrap()on anErrvalue: DriverError(CUDA_ERROR_ILLEGAL_ADDRESS, "an illegal memory access was encountered")stack backtrace:
0: 0x7117f42552b3 - <std::sys::backtrace::BacktraceLock::print::DisplayBacktrace as core::fmt::Display>::fmt::h93773fc827e3113d
1: 0x7117f3dbac0a - core::fmt::write::hed7b5c73d82ecb7c
2: 0x7117f422bf06 - std::io::Write::write_fmt::h6f0185aecf0ed75f
3: 0x7117f4236bba - std::panicking::default_hook::{{closure}}::h2be84df4f189ae36
4: 0x7117f42369e8 - std::panicking::default_hook::hf0ea8939246f43a9
5: 0x7117f4236eab - std::panicking::panic_with_hook::hb4bd9ac1123582a0
6: 0x7117f4236c78 - std::panicking::panic_handler::{{closure}}::hde00dd15f5637fe2
7: 0x7117f4232979 - std::sys::backtrace::rust_end_short_backtrace::hb72197fa777c1785
8: 0x7117f421fd9d - rustc[4425a7e20b4c8619]::rust_begin_unwind
9: 0x7117f3dc4aac - core::panicking::panic_fmt::ha59b517dd231f4da
10: 0x7117f3dc3bb2 - core::result::unwrap_failed::hf2d1f30a3ac850fc
11: 0x7117f3e92110 - core::ptr::drop_in_place<cudarc::driver::safe::core::CudaSlice>::h9b4be98944a295e5
12: 0x7117f3e84d52 - core::ptr::drop_in_place<alloc::vec::Vec<core::option::Optionkrasis::gpu_decode::HcsCacheEntry>>::h546e5990b6df0384
13: 0x7117f3e8e351 - core::ptr::drop_in_placekrasis::gpu_decode::GpuDecodeGraph::hc951e853e0c7ec37
14: 0x7117f3e8f6ba - core::ptr::drop_in_placekrasis::gpu_decode::GpuDecodeStore::h059c6d844c3ab860
15: 0x7117f3e5d6dd - <pyo3::pycell::impl::PyClassObject as pyo3::pycell::impl::PyClassObjectLayout>::tp_dealloc::hc3347620756b38a1
16: 0x7117f3eb212d - pyo3::impl::trampoline::trampoline_unraisable::h810a1319a7141020
17: 0x7117f3eb5510 - pyo3::impl::pyclass::tp_dealloc::h58cce6a81cb2a4df
18: 0x575eae -
19: 0x575bfc -
20: 0x59efd5 -
21: 0x573376 -
22: 0x583404 - _PyModule_Clear
23: 0x6b1999 -
24: 0x6b0e1d - Py_FinalizeEx
25: 0x6bc7d1 - Py_RunMain
26: 0x6bc3ed - Py_BytesMain
27: 0x7117f502a1ca - __libc_start_call_main
at ./csu/../sysdeps/nptl/libc_start_call_main.h:58:16
28: 0x7117f502a28b - __libc_start_main_impl
at ./csu/../csu/libc-start.c:360:3
29: 0x6576c5 - _start
30: 0x0 -
thread '' (28992) panicked at /rustc/4a4ef493e3a1488c6e321570238084b38948f6db/library/core/src/panicking.rs:233:5: