According to the paper “Fast On-Device LLM Inference with NPUs”, experiments were conducted not only with Qwen1.5-1.8B, but also with Gemma-2B, Phi-2.7B, LLaMA2-Chat-7B, and Mistral-7B.
However, if you look at the model card here:
https://github.com/UbiquitousLearning/mllm?tab=readme-ov-file#supported-models
it states that for Hexagon NPU inference, only Qwen1.5 models from 0.5B to 1.5B are supported (including PhoneLM-1.5 and Qwen2-VL).
In addition, the mllm-based paper
“Accelerating Mobile Language Models via Speculative Decoding and NPU-Coordinated Execution” mentions that the LLaMA-3.2-3B model was also used.
Is it possible to run models not explicitly listed in the model card (such as LLaMA-3.2-3B or LLaMA2-Chat-7B) on the NPU?
If so, is there any manual or documentation available?