Skip to content

[Bug]: mllm-qwen-npu failed to run on OnePlus Ace5 pro #575

@wutiaojian000

Description

@wutiaojian000

Prerequisites

  • I have searched the existing issues and confirmed this is not a duplicate.
  • I am using the latest version of the MLLM framework.

Bug Description

The environment similar to #574, include
Ubuntu: 22.04.5
QNN SDK: 2.41.0.251128
Hexagon NPU Runtime: 6.4.0.1
And i also trying to run mllm-qwen-npu with model Qwen1.5-1.8B-Chat on my OnePlus Ace5 pro but failed.

Steps to Reproduce

  1. I tried the step on [v2] Missing demo scripts for Android QNN backend execution (was available in v1) #560, nothing but the NPU I used is v79
  2. And the result
D_LIBRARY_PATH=/data/local/tmp/build-android-arm64-v8a-qnn/bin/ ./mllm-qwen-npu                                                         <
[INFO] /home/zcm/mllm/examples/qwen_npu/main.cpp:33 Mixed inference mode: NPU prefill + CPU decode
[INFO] /home/zcm/mllm/examples/qwen_npu/main.cpp:58 CPU decode model loaded from: /data/local/tmp/zhanghao/models/qwen1.5-1.8b-chat-rot_q4_0.mllm
[INFO] /home/zcm/mllm/examples/qwen_npu/main.cpp:62 Loading QNN model for prefill...
[INFO] /home/zcm/mllm/mllm/backends/qnn/QNNUtils.cpp:23 QNN Backend Lib: libQnnHtp.so
[INFO] /home/zcm/mllm/mllm/backends/qnn/QNNBackend.cpp:306 Registered Op Package: libQnnLLaMAPackage_CPU.so and interface provider: LLaMAPackageInterfaceProvider
[INFO] /home/zcm/mllm/mllm/backends/qnn/QNNBackend.cpp:306 Registered Op Package: libQnnLLaMAPackage_HTP.so and interface provider: LLaMAPackageInterfaceProvider
[INFO] /home/zcm/mllm/mllm/backends/qnn/QNNBackend.cpp:47 QNN Backend Build Id: v2.41.0.251128145156_191518
[INFO] /home/zcm/mllm/mllm/backends/qnn/QNNBackend.cpp:49 QNN backend supports tensor sparsity
[INFO] /home/zcm/mllm/mllm/backends/qnn/QNNBackend.cpp:52 QNN backend supports dynamic dimensions
[INFO] /home/zcm/mllm/mllm/backends/base/PluginSystem.cpp:89 Register customized op: DequantizeAdd:4097 -> QNN
[INFO] /home/zcm/mllm/examples/qwen_npu/main.cpp:72 Created shared StaticCache with 24 layers
[INFO] /home/zcm/mllm/examples/qwen_npu/main.cpp:77 QNN prefill model loaded from: /data/local/tmp/zhanghao/models/qwen1.5-1.8b-chat-rot-qnn.mllm
[INFO] /home/zcm/mllm/examples/qwen_npu/main.cpp:87 Configured 24 QNN KVCache layers to use shared StaticCache
[INFO] /home/zcm/mllm/examples/qwen_npu/main.cpp:97 Configured 24 CPU KVCache layers to use shared StaticCache
[INFO] /home/zcm/mllm/examples/qwen_npu/main.cpp:110 Input tokens: 194 tokens
[INFO] /home/zcm/mllm/examples/qwen_npu/main.cpp:179 Starting QNN prefill...
enter npu_model->trace(past, {})
leave npu_model->trace(past, {})
[INFO] /home/zcm/mllm/examples/qwen_npu/main.cpp:203 ************************enter graphBuildPM.run()
linalg.CPU.RMSNormOp [name="model.layers.0.input_layernorm"](%1577:tensor<[1, 194, 2048], Float32, QNN>[is_graph_input:true]) -> (%1578:tensor<[1, 194, 2048], Float32, QNN>)[ERROR] /home/zcm/mllm/mllm/backends/qnn/op/QNNRMSNormOp.cpp:48 Failed to cast to QNNRMSNormOp
[ERROR] /home/zcm/mllm/mllm/backends/qnn/passes/QNNGraphBuildPass.cpp:164 Failed to add node for op type: RMSNorm in graph 'model.layers.0_1'
[ERROR] /home/zcm/mllm/mllm/backends/qnn/op/QNNLinearOp.cpp:135 Failed to cast to QNNLinearOp
[ERROR] /home/zcm/mllm/mllm/backends/qnn/passes/QNNGraphBuildPass.cpp:164 Failed to add node for op type: Linear in graph 'model.layers.0_2'
linalg.CPU.RMSNormOp [name="model.layers.1.input_layernorm"](%1644:tensor<[1, 194, 2048], Float32, QNN>[is_graph_input:true, is_graph_output:true]) -> (%1645:tensor<[1, 194, 2048], Float32, QNN>)[ERROR] /home/zcm/mllm/mllm/backends/qnn/op/QNNRMSNormOp.cpp:48 Failed to cast to QNNRMSNormOp
[ERROR] /home/zcm/mllm/mllm/backends/qnn/passes/QNNGraphBuildPass.cpp:164 Failed to add node for op type: RMSNorm in graph 'model.layers.1_1'
[ERROR] /home/zcm/mllm/mllm/backends/qnn/op/QNNLinearOp.cpp:135 Failed to cast to QNNLinearOp
[ERROR] /home/zcm/mllm/mllm/backends/qnn/passes/QNNGraphBuildPass.cpp:164 Failed to add node for op type: Linear in graph 'model.layers.1_2'
...
[INFO] /home/zcm/mllm/examples/qwen_npu/main.cpp:205 ************************out graphBuildPM.run()
[WARN] /home/zcm/mllm/mllm/backends/cpu/kernels/common/ggml/vec_dot_type.hpp:181 Unsupported DataType Int8
[WARN] /home/zcm/mllm/mllm/backends/cpu/kernels/common/ggml/vec_dot_type.hpp:181 Unsupported DataType Int8
Segmentation fault

I read the source code and found when the Module reg its Layer, it also reg the BaseOp of the Layer

template<typename T, typename... Args>
  auto reg(const std::string& name, Args&&... args) {
    // Register a module
    if constexpr (std::is_base_of_v<Module, T>) {
      auto ret = T((impl_->getAbsoluteName() == "" ? name
                                                   : impl_->getAbsoluteName() + (name == "" ? "" : ".") +  // avoid double dot
                                                         name),
                   std::forward<Args>(args)...);
      impl_->regChildNode(ret.impl());
      ret.impl()->setName(name);
      return ret;
    }

    // Register to thisThread table
    if constexpr (std::is_base_of_v<Layer, T>) {
      auto ret = T(std::forward<Args>(args)...);
      impl_->regChildNode(ret.impl());
      ret.impl()->setAbsoluteName((impl_->getAbsoluteName() == ""
                                       ? name
                                       : impl_->getAbsoluteName() + (name == "" ? "" : ".") +  // avoid double dot
                                             name));
      ret.impl()->setName(name);

      auto& ctx = Context::instance();
      // Create Op
      BaseOp::ptr_t _op = nullptr;
      
      _op = ctx.getBackend(ret.impl()->getDevice())->createOp(ret.opType(), ret.refOptions());
      _op->setName(ret.impl()->getAbsoluteName());

      // Register Op
      ret.impl()->setInstancedOp(_op);

      return ret;
    }
  }

but the default value of ret.impl()->getDevice() is always kCPU, which means the Op's backend is CPUBackend(as well as the ). I don't know it's right, maybe i ingore some details. I dump the subgraph, just like

graph.SubGraphOp @model.layers.0_1 <QNN> {
    (%1577:tensor<[1, 194, 2048], Float32, QNN>[is_graph_input:true]) -> (%1594:tensor<[1, 16, 194, 128], Float32, QNN>[is_graph_output:true], %1595:tensor<[1, 16, 194, 128], Float32, QNN>[is_graph_output:true], %1596:tensor<[1, 16, 194, 128], Float32, QNN>[is_graph_output:true]) {
        linalg.CPU.RMSNormOp [name="model.layers.0.input_layernorm"](%1577:tensor<[1, 194, 2048], Float32, QNN>[is_graph_input:true]) -> (%1578:tensor<[1, 194, 2048], Float32, QNN>)
        linalg.QNN.ViewOp [name="model.layers.0_1.View.0"](%1578:tensor<[1, 194, 2048], Float32, QNN>) -> (%1579:tensor<[1, 194, 1, 2048], Float32, QNN>)
        linalg.QNN.CastTypeOp [name="model.layers.0_1.CastType.0"](%1579:tensor<[1, 194, 1, 2048], Float32, QNN>) -> (%1580:tensor<[1, 194, 1, 2048], Int16, QNN>)
        linalg.CPU.LinearOp [name="model.layers.0.self_attn.q_proj"](%1580:tensor<[1, 194, 1, 2048], Int16, QNN>) -> (%1582:tensor<[1, 194, 1, 2048], Int16, QNN>)
        linalg.CPU.LinearOp [name="model.layers.0.self_attn.k_proj"](%1580:tensor<[1, 194, 1, 2048], Int16, QNN>) -> (%1583:tensor<[1, 194, 1, 2048], Int16, QNN>)
        linalg.CPU.LinearOp [name="model.layers.0.self_attn.v_proj"](%1580:tensor<[1, 194, 1, 2048], Int16, QNN>) -> (%1584:tensor<[1, 194, 1, 2048], Int16, QNN>)
        linalg.QNN.ViewOp [name="model.layers.0_1.View.1"](%1582:tensor<[1, 194, 1, 2048], Int16, QNN>) -> (%1585:tensor<[1, 194, 16, 128], Int16, QNN>)
        linalg.QNN.ViewOp [name="model.layers.0_1.View.2"](%1583:tensor<[1, 194, 1, 2048], Int16, QNN>) -> (%1587:tensor<[1, 194, 16, 128], Int16, QNN>)
        linalg.QNN.ViewOp [name="model.layers.0_1.View.3"](%1584:tensor<[1, 194, 1, 2048], Int16, QNN>) -> (%1589:tensor<[1, 194, 16, 128], Int16, QNN>)
        linalg.QNN.DequantizeAdd [name="model.layers.0.self_attn.q_proj.dequantize"](%1585:tensor<[1, 194, 16, 128], Int16, QNN>) -> (%1591:tensor<[1, 194, 16, 128], Float32, QNN>)
        linalg.QNN.DequantizeAdd [name="model.layers.0.self_attn.k_proj.dequantize"](%1587:tensor<[1, 194, 16, 128], Int16, QNN>) -> (%1592:tensor<[1, 194, 16, 128], Float32, QNN>)
        linalg.QNN.DequantizeAdd [name="model.layers.0.self_attn.v_proj.dequantize"](%1589:tensor<[1, 194, 16, 128], Int16, QNN>) -> (%1593:tensor<[1, 194, 16, 128], Float32, QNN>)
        linalg.QNN.TransposeOp [name="model.layers.0_1.Transpose.0"](%1591:tensor<[1, 194, 16, 128], Float32, QNN>) -> (%1594:tensor<[1, 16, 194, 128], Float32, QNN>[is_graph_output:true])
        linalg.QNN.TransposeOp [name="model.layers.0_1.Transpose.1"](%1592:tensor<[1, 194, 16, 128], Float32, QNN>) -> (%1595:tensor<[1, 16, 194, 128], Float32, QNN>[is_graph_output:true])
        linalg.QNN.TransposeOp [name="model.layers.0_1.Transpose.2"](%1593:tensor<[1, 194, 16, 128], Float32, QNN>) -> (%1596:tensor<[1, 16, 194, 128], Float32, QNN>[is_graph_output:true])
        cf.ReturnOp (%1594:tensor<[1, 16, 194, 128], Float32, QNN>[is_graph_output:true], %1595:tensor<[1, 16, 194, 128], Float32, QNN>[is_graph_output:true], %1596:tensor<[1, 16, 194, 128], Float32, QNN>[is_graph_output:true]) -> ()
    }
}

Is it an issue with with the above QNN/Hexagon versions? Or could there be other possible causes?
Thanks for your help.

Expected Behavior

work correctly

Operating System

Android

Device

OnePlus Ace5 pro

MLLM Framework Version

current version

Model Information

No response

Additional Context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions