Skip to content

【BUG】CUDA error on infer #41

@gerayking

Description

@gerayking

在多轮对话时会偶发出现下面错误

INFO:     21.64.147.125:36540 - "POST /voice/chat HTTP/1.1" 200 OK
/pytorch/aten/src/ATen/native/cuda/TensorCompare.cu:110: _assert_async_cuda_kernel: block: [0,0,0], thread: [0,0,0] Assertion `probability tensor contains either `inf`, `nan` or element < 0` failed.
Exception in thread Thread-42 (run_generate):
Traceback (most recent call last):
  File "/data/miniconda3/envs/mimo/lib/python3.12/threading.py", line 1075, in _bootstrap_inner
    self.run()
  File "/data/miniconda3/envs/mimo/lib/python3.12/threading.py", line 1012, in run
    self._target(*self._args, **self._kwargs)
  File "/mnt/cephfs/MiMo-Audio-A/src/mimo_audio/mimo_audio.py", line 1202, in run_generate
    self.model.generate(
  File "/data/miniconda3/envs/mimo/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/cephfs/MiMo-Audio-A/src/mimo_audio/modeling_mimo_audio.py", line 1001, in generate
    return self.slm_sample(
           ^^^^^^^^^^^^^^^^
  File "/mnt/cephfs/MiMo-Audio-A/src/mimo_audio/modeling_mimo_audio.py", line 1088, in slm_sample
    next_speech_tokens = self.local_forward(
                         ^^^^^^^^^^^^^^^^^^^
  File "/mnt/cephfs/MiMo-Audio-A/src/mimo_audio/modeling_mimo_audio.py", line 739, in local_forward
    output: BaseModelOutputWithPast = self.local_transformer(
                                      ^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/miniconda3/envs/mimo/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/miniconda3/envs/mimo/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/miniconda3/envs/mimo/lib/python3.12/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 579, in forward
    layer_outputs = decoder_layer(
                    ^^^^^^^^^^^^^^
  File "/data/miniconda3/envs/mimo/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/miniconda3/envs/mimo/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/miniconda3/envs/mimo/lib/python3.12/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 276, in forward
    hidden_states = self.mlp(hidden_states)
                    ^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/miniconda3/envs/mimo/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/miniconda3/envs/mimo/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/miniconda3/envs/mimo/lib/python3.12/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 57, in forward
    down_proj = self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/miniconda3/envs/mimo/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/miniconda3/envs/mimo/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/miniconda3/envs/mimo/lib/python3.12/site-packages/torch/nn/modules/activation.py", line 432, in forward
    return F.silu(input, inplace=self.inplace)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/miniconda3/envs/mimo/lib/python3.12/site-packages/torch/nn/functional.py", line 2380, in silu
    return torch._C._nn.silu(input)
           ^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: device-side assert triggered

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions