[Bug]: mllm-qwen-npu failed to run on OnePlus Ace5 pro

### Prerequisites

- [x] I have searched the existing issues and confirmed this is not a duplicate.
- [x] I am using the latest version of the MLLM framework.

### Bug Description

The environment similar to https://github.com/UbiquitousLearning/mllm/issues/574, include
Ubuntu: 22.04.5
QNN SDK: 2.41.0.251128
Hexagon NPU Runtime: 6.4.0.1
And i also trying to run mllm-qwen-npu with model Qwen1.5-1.8B-Chat on my OnePlus Ace5 pro but failed.

### Steps to Reproduce

1. I tried the step on https://github.com/UbiquitousLearning/mllm/issues/560, nothing but the NPU I used is v79
2. And the result 
```
D_LIBRARY_PATH=/data/local/tmp/build-android-arm64-v8a-qnn/bin/ ./mllm-qwen-npu                                                         <
[INFO] /home/zcm/mllm/examples/qwen_npu/main.cpp:33 Mixed inference mode: NPU prefill + CPU decode
[INFO] /home/zcm/mllm/examples/qwen_npu/main.cpp:58 CPU decode model loaded from: /data/local/tmp/zhanghao/models/qwen1.5-1.8b-chat-rot_q4_0.mllm
[INFO] /home/zcm/mllm/examples/qwen_npu/main.cpp:62 Loading QNN model for prefill...
[INFO] /home/zcm/mllm/mllm/backends/qnn/QNNUtils.cpp:23 QNN Backend Lib: libQnnHtp.so
[INFO] /home/zcm/mllm/mllm/backends/qnn/QNNBackend.cpp:306 Registered Op Package: libQnnLLaMAPackage_CPU.so and interface provider: LLaMAPackageInterfaceProvider
[INFO] /home/zcm/mllm/mllm/backends/qnn/QNNBackend.cpp:306 Registered Op Package: libQnnLLaMAPackage_HTP.so and interface provider: LLaMAPackageInterfaceProvider
[INFO] /home/zcm/mllm/mllm/backends/qnn/QNNBackend.cpp:47 QNN Backend Build Id: v2.41.0.251128145156_191518
[INFO] /home/zcm/mllm/mllm/backends/qnn/QNNBackend.cpp:49 QNN backend supports tensor sparsity
[INFO] /home/zcm/mllm/mllm/backends/qnn/QNNBackend.cpp:52 QNN backend supports dynamic dimensions
[INFO] /home/zcm/mllm/mllm/backends/base/PluginSystem.cpp:89 Register customized op: DequantizeAdd:4097 -> QNN
[INFO] /home/zcm/mllm/examples/qwen_npu/main.cpp:72 Created shared StaticCache with 24 layers
[INFO] /home/zcm/mllm/examples/qwen_npu/main.cpp:77 QNN prefill model loaded from: /data/local/tmp/zhanghao/models/qwen1.5-1.8b-chat-rot-qnn.mllm
[INFO] /home/zcm/mllm/examples/qwen_npu/main.cpp:87 Configured 24 QNN KVCache layers to use shared StaticCache
[INFO] /home/zcm/mllm/examples/qwen_npu/main.cpp:97 Configured 24 CPU KVCache layers to use shared StaticCache
[INFO] /home/zcm/mllm/examples/qwen_npu/main.cpp:110 Input tokens: 194 tokens
[INFO] /home/zcm/mllm/examples/qwen_npu/main.cpp:179 Starting QNN prefill...
enter npu_model->trace(past, {})
leave npu_model->trace(past, {})
[INFO] /home/zcm/mllm/examples/qwen_npu/main.cpp:203 ************************enter graphBuildPM.run()
linalg.CPU.RMSNormOp [name="model.layers.0.input_layernorm"](%1577:tensor<[1, 194, 2048], Float32, QNN>[is_graph_input:true]) -> (%1578:tensor<[1, 194, 2048], Float32, QNN>)[ERROR] /home/zcm/mllm/mllm/backends/qnn/op/QNNRMSNormOp.cpp:48 Failed to cast to QNNRMSNormOp
[ERROR] /home/zcm/mllm/mllm/backends/qnn/passes/QNNGraphBuildPass.cpp:164 Failed to add node for op type: RMSNorm in graph 'model.layers.0_1'
[ERROR] /home/zcm/mllm/mllm/backends/qnn/op/QNNLinearOp.cpp:135 Failed to cast to QNNLinearOp
[ERROR] /home/zcm/mllm/mllm/backends/qnn/passes/QNNGraphBuildPass.cpp:164 Failed to add node for op type: Linear in graph 'model.layers.0_2'
linalg.CPU.RMSNormOp [name="model.layers.1.input_layernorm"](%1644:tensor<[1, 194, 2048], Float32, QNN>[is_graph_input:true, is_graph_output:true]) -> (%1645:tensor<[1, 194, 2048], Float32, QNN>)[ERROR] /home/zcm/mllm/mllm/backends/qnn/op/QNNRMSNormOp.cpp:48 Failed to cast to QNNRMSNormOp
[ERROR] /home/zcm/mllm/mllm/backends/qnn/passes/QNNGraphBuildPass.cpp:164 Failed to add node for op type: RMSNorm in graph 'model.layers.1_1'
[ERROR] /home/zcm/mllm/mllm/backends/qnn/op/QNNLinearOp.cpp:135 Failed to cast to QNNLinearOp
[ERROR] /home/zcm/mllm/mllm/backends/qnn/passes/QNNGraphBuildPass.cpp:164 Failed to add node for op type: Linear in graph 'model.layers.1_2'
...
[INFO] /home/zcm/mllm/examples/qwen_npu/main.cpp:205 ************************out graphBuildPM.run()
[WARN] /home/zcm/mllm/mllm/backends/cpu/kernels/common/ggml/vec_dot_type.hpp:181 Unsupported DataType Int8
[WARN] /home/zcm/mllm/mllm/backends/cpu/kernels/common/ggml/vec_dot_type.hpp:181 Unsupported DataType Int8
Segmentation fault
```
I read the source code and found when the Module reg its Layer, it also reg the BaseOp of the Layer
```
template<typename T, typename... Args>
  auto reg(const std::string& name, Args&&... args) {
    // Register a module
    if constexpr (std::is_base_of_v<Module, T>) {
      auto ret = T((impl_->getAbsoluteName() == "" ? name
                                                   : impl_->getAbsoluteName() + (name == "" ? "" : ".") +  // avoid double dot
                                                         name),
                   std::forward<Args>(args)...);
      impl_->regChildNode(ret.impl());
      ret.impl()->setName(name);
      return ret;
    }

    // Register to thisThread table
    if constexpr (std::is_base_of_v<Layer, T>) {
      auto ret = T(std::forward<Args>(args)...);
      impl_->regChildNode(ret.impl());
      ret.impl()->setAbsoluteName((impl_->getAbsoluteName() == ""
                                       ? name
                                       : impl_->getAbsoluteName() + (name == "" ? "" : ".") +  // avoid double dot
                                             name));
      ret.impl()->setName(name);

      auto& ctx = Context::instance();
      // Create Op
      BaseOp::ptr_t _op = nullptr;
      
      _op = ctx.getBackend(ret.impl()->getDevice())->createOp(ret.opType(), ret.refOptions());
      _op->setName(ret.impl()->getAbsoluteName());

      // Register Op
      ret.impl()->setInstancedOp(_op);

      return ret;
    }
  }
```
but the default value of ret.impl()->getDevice() is always kCPU, which means the Op's backend is CPUBackend(as well as the ). I don't know it's right, maybe i ingore some details. I dump the subgraph, just like
```
graph.SubGraphOp @model.layers.0_1 <QNN> {
    (%1577:tensor<[1, 194, 2048], Float32, QNN>[is_graph_input:true]) -> (%1594:tensor<[1, 16, 194, 128], Float32, QNN>[is_graph_output:true], %1595:tensor<[1, 16, 194, 128], Float32, QNN>[is_graph_output:true], %1596:tensor<[1, 16, 194, 128], Float32, QNN>[is_graph_output:true]) {
        linalg.CPU.RMSNormOp [name="model.layers.0.input_layernorm"](%1577:tensor<[1, 194, 2048], Float32, QNN>[is_graph_input:true]) -> (%1578:tensor<[1, 194, 2048], Float32, QNN>)
        linalg.QNN.ViewOp [name="model.layers.0_1.View.0"](%1578:tensor<[1, 194, 2048], Float32, QNN>) -> (%1579:tensor<[1, 194, 1, 2048], Float32, QNN>)
        linalg.QNN.CastTypeOp [name="model.layers.0_1.CastType.0"](%1579:tensor<[1, 194, 1, 2048], Float32, QNN>) -> (%1580:tensor<[1, 194, 1, 2048], Int16, QNN>)
        linalg.CPU.LinearOp [name="model.layers.0.self_attn.q_proj"](%1580:tensor<[1, 194, 1, 2048], Int16, QNN>) -> (%1582:tensor<[1, 194, 1, 2048], Int16, QNN>)
        linalg.CPU.LinearOp [name="model.layers.0.self_attn.k_proj"](%1580:tensor<[1, 194, 1, 2048], Int16, QNN>) -> (%1583:tensor<[1, 194, 1, 2048], Int16, QNN>)
        linalg.CPU.LinearOp [name="model.layers.0.self_attn.v_proj"](%1580:tensor<[1, 194, 1, 2048], Int16, QNN>) -> (%1584:tensor<[1, 194, 1, 2048], Int16, QNN>)
        linalg.QNN.ViewOp [name="model.layers.0_1.View.1"](%1582:tensor<[1, 194, 1, 2048], Int16, QNN>) -> (%1585:tensor<[1, 194, 16, 128], Int16, QNN>)
        linalg.QNN.ViewOp [name="model.layers.0_1.View.2"](%1583:tensor<[1, 194, 1, 2048], Int16, QNN>) -> (%1587:tensor<[1, 194, 16, 128], Int16, QNN>)
        linalg.QNN.ViewOp [name="model.layers.0_1.View.3"](%1584:tensor<[1, 194, 1, 2048], Int16, QNN>) -> (%1589:tensor<[1, 194, 16, 128], Int16, QNN>)
        linalg.QNN.DequantizeAdd [name="model.layers.0.self_attn.q_proj.dequantize"](%1585:tensor<[1, 194, 16, 128], Int16, QNN>) -> (%1591:tensor<[1, 194, 16, 128], Float32, QNN>)
        linalg.QNN.DequantizeAdd [name="model.layers.0.self_attn.k_proj.dequantize"](%1587:tensor<[1, 194, 16, 128], Int16, QNN>) -> (%1592:tensor<[1, 194, 16, 128], Float32, QNN>)
        linalg.QNN.DequantizeAdd [name="model.layers.0.self_attn.v_proj.dequantize"](%1589:tensor<[1, 194, 16, 128], Int16, QNN>) -> (%1593:tensor<[1, 194, 16, 128], Float32, QNN>)
        linalg.QNN.TransposeOp [name="model.layers.0_1.Transpose.0"](%1591:tensor<[1, 194, 16, 128], Float32, QNN>) -> (%1594:tensor<[1, 16, 194, 128], Float32, QNN>[is_graph_output:true])
        linalg.QNN.TransposeOp [name="model.layers.0_1.Transpose.1"](%1592:tensor<[1, 194, 16, 128], Float32, QNN>) -> (%1595:tensor<[1, 16, 194, 128], Float32, QNN>[is_graph_output:true])
        linalg.QNN.TransposeOp [name="model.layers.0_1.Transpose.2"](%1593:tensor<[1, 194, 16, 128], Float32, QNN>) -> (%1596:tensor<[1, 16, 194, 128], Float32, QNN>[is_graph_output:true])
        cf.ReturnOp (%1594:tensor<[1, 16, 194, 128], Float32, QNN>[is_graph_output:true], %1595:tensor<[1, 16, 194, 128], Float32, QNN>[is_graph_output:true], %1596:tensor<[1, 16, 194, 128], Float32, QNN>[is_graph_output:true]) -> ()
    }
}
```
Is it an issue with with the above QNN/Hexagon versions? Or could there be other possible causes?
Thanks for your help.

### Expected Behavior

work correctly

### Operating System

Android

### Device

OnePlus Ace5 pro

### MLLM Framework Version

current version

### Model Information

_No response_

### Additional Context

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: mllm-qwen-npu failed to run on OnePlus Ace5 pro #575

Prerequisites

Bug Description

Steps to Reproduce

Expected Behavior

Operating System

Device

MLLM Framework Version

Model Information

Additional Context

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[Bug]: mllm-qwen-npu failed to run on OnePlus Ace5 pro #575

Description

Prerequisites

Bug Description

Steps to Reproduce

Expected Behavior

Operating System

Device

MLLM Framework Version

Model Information

Additional Context

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions