Skip to content

AttributeError: 'kt_kernel_ext.CPUInfer' object has no attribute 'submit_with_cuda_stream' #1760

@sunnsi

Description

@sunnsi

Reminder

  • I have read the above rules and searched the existing issues.

System Info

(ktransformers) ➜  ik_llama.cpp git:(main) kt doctor                                                                     

KTransformers 环境诊断

┏━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Check                ┃ Status   ┃ Value                                                                                  ┃
┡━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ Python 版本          │ 正常     │ 3.11.11                                                                                │
│ CUDA 可用性          │ 正常     │ 12.6                                                                                   │
│ GPU 检测             │ 正常     │ 发现 1 个 GPU: NVIDIA RTX A6000                                                        │
│ CPU                  │ 正常     │ INTEL(R) XEON(R) GOLD 6530 (64 核心 / 64 线程)                                         │
│ CPU 指令集           │ 正常     │ AMX-INT8, AMX-BF16, AMX-TILE, AVX512BF16, AVX512F, AVX512BW, AVX512VL, AVX2 (+14 more) │
│ NUMA 拓扑            │ 正常     │ 4 个节点                                                                               │
│ 系统内存             │ 正常     │ 617.3GB 可用 / 628.6GB 总计                                                            │
│ 磁盘空间             │ 正常     │ /data/llm-models 有 2301.6GB 可用空间                                                  │
│ SGLang Source        │ 正常     │ Source (GitHub: kvcache-ai/sglang, branch: main)                                       │
│ SGLang kt-kernel     │ 正常     │ SGLang kt-kernel 支持已验证                                                            │
│ Environment Managers │ 正常     │ conda 24.9.2, venv builtin, docker 28.3.0                                              │
└──────────────────────┴──────────┴────────────────────────────────────────────────────────────────────────────────────────┘

Reproduction

 kt run /data/llm-models/MiniMaxAI/MiniMax-M2.1 --cpu-threads 64 --numa-nodes 4

Output and error:

[2025-12-27 14:37:30] The available memory for KV cache is 2.00 GB.                                                                           [313/1863]
[2025-12-27 14:37:30] max_total_tokens=100000 is larger than the profiled value 8469. Use the profiled value instead.                                   
[2025-12-27 14:37:30] KV Cache is allocated. #tokens: 8469, K size: 1.00 GB, V size: 1.00 GB                                                            
[2025-12-27 14:37:30] Memory pool end. avail mem=8.14 GB                                                                                                
[2025-12-27 14:37:31] Capture cuda graph begin. This can take up to several minutes. avail mem=7.63 GB                                                  
[2025-12-27 14:37:31] Capture cuda graph bs [1, 2, 4, 8, 12, 16]                                                                                        
Capturing batches (bs=16 avail_mem=7.61 GB):   0%|                                                                             | 0/6 [00:16<?, ?it/s]   
[2025-12-27 14:37:48] Scheduler hit an exception: Traceback (most recent call last):                                                                    
  File "/home/mgi527a/Softwares/sglang/python/sglang/srt/managers/scheduler.py", line 2833, in run_scheduler_process                                    
    scheduler = Scheduler(                                                                                                                              
                ^^^^^^^^^^                                                                                                                              
  File "/home/mgi527a/Softwares/sglang/python/sglang/srt/managers/scheduler.py", line 314, in __init__                                                  
    self.init_model_worker()                                                                                                                            
  File "/home/mgi527a/Softwares/sglang/python/sglang/srt/managers/scheduler.py", line 452, in init_model_worker                                         
    self.tp_worker = TpModelWorker(                                                                                                                     
                     ^^^^^^^^^^^^^^                                                                                                                     
  File "/home/mgi527a/Softwares/sglang/python/sglang/srt/managers/tp_worker.py", line 253, in __init__                                                  
    self._model_runner = ModelRunner(                                                                                                                   
                         ^^^^^^^^^^^^                                                                                                                   
  File "/home/mgi527a/Softwares/sglang/python/sglang/srt/model_executor/model_runner.py", line 385, in __init__                                         
    self.initialize(min_per_gpu_memory)                                                                                                                 
  File "/home/mgi527a/Softwares/sglang/python/sglang/srt/model_executor/model_runner.py", line 584, in initialize                                       
    self.init_device_graphs()                                                                                                                           
  File "/home/mgi527a/Softwares/sglang/python/sglang/srt/model_executor/model_runner.py", line 2670, in init_device_graphs                              
    self.graph_runner = graph_runners[self.device](self)                                                                                                
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                
  File "/home/mgi527a/Softwares/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 358, in __init__                                    
    self.capture()                                                                                                                                      
  File "/home/mgi527a/Softwares/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 513, in capture                                     
    _capture_one_stream()                                                                                                                               
  File "/home/mgi527a/Softwares/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 500, in _capture_one_stream                         
    ) = self.capture_one_batch_size(bs, forward, stream_idx)                                                                                            
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                            
  File "/home/mgi527a/Softwares/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 719, in capture_one_batch_size                      
    run_once()                                                                                                                                          
  File "/home/mgi527a/Softwares/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 706, in run_once                                    
    logits_output_or_pp_proxy_tensors = forward(                                                                                                        
                                        ^^^^^^^^                                                                                                        
  File "/home/mgi527a/anaconda3/envs/ktransformers/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context              
    return func(*args, **kwargs)                                                                                                                        
           ^^^^^^^^^^^^^^^^^^^^^                                                                                                                        
  File "/home/mgi527a/Softwares/sglang/python/sglang/srt/models/minimax_m2.py", line 795, in forward                                                    
    hidden_states = self.model(input_ids, positions, forward_batch, input_embeds)                                                                       
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                       
  File "/home/mgi527a/anaconda3/envs/ktransformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl           
    return self._call_impl(*args, **kwargs)                                                                                                             
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                             
  File "/home/mgi527a/anaconda3/envs/ktransformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl                   
    return forward_call(*args, **kwargs)                                                                                                                
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                      
  File "/home/mgi527a/Softwares/sglang/python/sglang/srt/models/minimax_m2.py", line 710, in forward                                                    
    hidden_states, residual = layer(                                                                                                                    
                              ^^^^^^                                                                                                                    
  File "/home/mgi527a/anaconda3/envs/ktransformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl           
    return self._call_impl(*args, **kwargs)                                                                                                             
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                             
  File "/home/mgi527a/anaconda3/envs/ktransformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl                   
    return forward_call(*args, **kwargs)                                                                                                                
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                
  File "/home/mgi527a/Softwares/sglang/python/sglang/srt/models/minimax_m2.py", line 559, in forward                                                    
    hidden_states = self.block_sparse_moe(hidden_states, forward_batch)                                                                                 
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                 
  File "/home/mgi527a/anaconda3/envs/ktransformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl           
    return self._call_impl(*args, **kwargs)                                                                                                             
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                             
  File "/home/mgi527a/anaconda3/envs/ktransformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl                   
    return forward_call(*args, **kwargs)                                                                                                                
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                
  File "/home/mgi527a/Softwares/sglang/python/sglang/srt/models/minimax_m2.py", line 200, in forward                                                    
    return self.forward_normal(hidden_states)                                                                                                           
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                           
  File "/home/mgi527a/Softwares/sglang/python/sglang/srt/models/minimax_m2.py", line 210, in forward_normal                                             
    final_hidden_states = self.experts(hidden_states, topk_output)                                                                                      
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                      
  File "/home/mgi527a/anaconda3/envs/ktransformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl           
    return self._call_impl(*args, **kwargs)                                                                                                             
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                             
  File "/home/mgi527a/anaconda3/envs/ktransformers/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl                   
    return forward_call(*args, **kwargs)                                                                                                                
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                
  File "/home/mgi527a/Softwares/sglang/python/sglang/srt/layers/moe/fused_moe_triton/layer.py", line 908, in forward                                    
    return self.forward_impl(hidden_states, topk_output)                                                                                                
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                
  File "/home/mgi527a/Softwares/sglang/python/sglang/srt/layers/moe/fused_moe_triton/layer.py", line 927, in forward_impl                               
    combine_input = self.run_moe_core(                                                                                                                  
                    ^^^^^^^^^^^^^^^^^^                                                                                                                  
  File "/home/mgi527a/Softwares/sglang/python/sglang/srt/layers/moe/fused_moe_triton/layer.py", line 948, in run_moe_core                               
    return self.quant_method.apply(                                                                                                                     
           ^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                     
  File "/home/mgi527a/Softwares/sglang/python/sglang/srt/layers/moe/kt_ep_wrapper.py", line 965, in apply                                               
    self.submit(layer, dispatch_output)                                                                                                                 
  File "/home/mgi527a/Softwares/sglang/python/sglang/srt/layers/moe/kt_ep_wrapper.py", line 895, in submit                                              
    self.wrapper.submit_forward(                                                                                                                        
  File "/home/mgi527a/anaconda3/envs/ktransformers/lib/python3.11/site-packages/kt_kernel/experts_base.py", line 277, in submit_forward                 
    self.cpu_infer.submit_with_cuda_stream(                                                                                                             
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                              
AttributeError: 'kt_kernel_ext.CPUInfer' object has no attribute 'submit_with_cuda_stream'                                                              
                                                                                                                                                        
[2025-12-27 14:37:48] Received sigquit from a child process. It usually means the child failed.

Others

No response

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions