Skip to content

m96-chan/NxPU

    _   __     ____  __  __
   / | / /  __/ __ \/ / / /
  /  |/ / |/_/ /_/ / / / /
 / /|  />  </ ____/ /_/ /
/_/ |_/_/|_/_/    \____/

WGSL → NPU transpiler for multi-vendor neural processing units

Write ML kernels once in WGSL. Deploy to any NPU.

CI Coverage Release License Rust PRs Welcome

Architecture · Changelog · Contributing


Why NxPU?

NPU hardware is fragmented — every vendor ships a different SDK, model format, and toolchain. NxPU solves this by providing a single compilation pipeline from WGSL compute shaders to native NPU formats.

  • One language, many targets — Write WGSL once, emit ONNX, TFLite, CoreML, StableHLO, or vendor-specific formats
  • Pattern recognition — Automatically classifies compute kernels into MatMul, Conv2D, Attention, and 10+ other ML operations
  • Optimization passes — Constant folding, FMA fusion, dead code elimination, common subexpression elimination, and quantization
  • Vendor-aware validation — Operator support matrices for 8 NPU vendors with native/emulated/unsupported classification
  • Pluggable backends — Add new NPU targets by implementing a single trait

Quick Start

cargo install --path crates/nxpu-cli
# Transpile WGSL → ONNX
nxpu examples/vecadd.wgsl --target onnx -o vecadd.onnx

# Transpile → TFLite with int8 quantization
nxpu examples/matmul.wgsl --target tflite --precision int8 -o matmul.tflite

# Dump the intermediate representation
nxpu examples/relu.wgsl --target ir-dump

# List all available backends
nxpu --list-targets

Example

Inputexamples/vecadd.wgsl Output
@group(0) @binding(0) var<storage, read> a: array<f32>;
@group(0) @binding(1) var<storage, read> b: array<f32>;
@group(0) @binding(2) var<storage, read_write> c: array<f32>;

@compute @workgroup_size(256)
fn main(@builtin(global_invocation_id) gid: vec3<u32>) {
  let idx = gid.x;
  c[idx] = a[idx] + b[idx];
}
$ nxpu examples/vecadd.wgsl --target onnx -o vecadd.onnx
Info: entry point 'main': classified as Add

The output vecadd.onnx can be loaded directly into any ONNX runtime.

Supported Backends

Target Aliases Format Status
onnx .onnx (protobuf)
tflite litert .tflite (FlatBuffers)
coreml apple-ane .mlmodel
stablehlo xla .mlir (text)
ir-dump ir Text (stdout)
intel-npu openvino OpenVINO IR .xml + .onnx
amd-xdna amd-npu ONNX + XDNA metadata
arm-ethos ethos-u TFLite + optional Vela
samsung exynos ONNX + ONE toolchain hints
qualcomm hexagon-npu ONNX + QNN SDK hints
mediatek neuropilot TFLite + NeuroPilot hints
rockchip rknn ONNX + RKNN Toolkit hints
ceva neupro ONNX + CDNN compiler hints

Each vendor backend includes an operator support matrix that validates patterns against the target NPU's capabilities, emitting warnings for emulated or unsupported operations.

Vendor backend details
Vendor NPU Hardware Native Precision Output Format SDK Toolchain
Intel Meteor Lake / Arrow Lake NPU F16 OpenVINO IR v11 (.xml + .bin) + ONNX fallback OpenVINO (ov::Core::read_model)
AMD Ryzen AI XDNA Int8, F16 ONNX with XDNA metadata props Vitis AI EP / ONNX Runtime
Arm Ethos-U55 (128 MAC) / U65 (512 MAC) Int8 (U55), Int8+Int16 (U65) TFLite + optional Vela compilation ethos-u-vela compiler
Samsung Exynos NPU F16, Int8 ONNX ONE toolchain (one-import-onnx, one-codegen)
Qualcomm Hexagon NPU Int8, F16 ONNX QNN SDK (qnn-onnx-converter)
MediaTek Dimensity APU Int8, F16 TFLite NeuroPilot SDK (ncc-tflite)
Rockchip RK3588 NPU (3 TOPS) Int8, F16 ONNX RKNN Toolkit 2 (Python API)
CEVA NeuPro-S Int8 ONNX CDNN compiler (cdnn_cli)

Recognized ML Patterns

NxPU analyzes WGSL compute kernels and classifies them into ML operations:

Category Operations
Linear Algebra MatMul, element-wise Add / Sub / Mul
Convolution Conv2D
Pooling MaxPool
Activation ReLU, Tanh, Sigmoid
Normalization BatchNorm
Reduction ReduceSum
Tensor Ops Transpose, Reshape, Concat, Split
Attention Scaled dot-product attention

Architecture

                         ┌─────────────────────────────────────────────┐
                         │             Optimization Passes             │
                         │  ┌───────┐ ┌───────┐ ┌───────┐ ┌────────┐  │
                         │  │ Const │ │  FMA  │ │ DCE / │ │Quantize│  │
                         │  │ Fold  │ │Fusion │ │  CSE  │ │        │  │
                         │  └───────┘ └───────┘ └───────┘ └────────┘  │
                         └──────────────────┬──────────────────────────┘
                                            │
  ┌──────────┐     ┌──────────┐     ┌───────┴──┐     ┌──────────────┐
  │   WGSL   │────>│  Parser  │────>│  SSA IR  │────>│   Backend    │
  │  Source   │     │  (naga)  │     │          │     │   Emitter    │
  └──────────┘     └──────────┘     └──────────┘     └──────┬───────┘
                                                            │
                                          ┌─────────────────┼─────────────────┐
                                          │                 │                 │
                                     ┌────┴───┐       ┌─────┴────┐      ┌────┴────┐
                                     │  ONNX  │       │  TFLite  │      │ CoreML  │
                                     │  HLO   │       │          │      │ Vendors │
                                     └────────┘       └──────────┘      └─────────┘

Project Structure

crates/
├── nxpu-parser/              WGSL parsing via naga, lowering to NxPU IR
├── nxpu-ir/                  Arena-based SSA intermediate representation
├── nxpu-opt/                 Optimization passes (const fold, FMA, DCE, quantize)
├── nxpu-analysis/            Pattern classification and fusion
├── nxpu-backend-core/        Backend trait, plugin registry, IR dump
├── nxpu-backend-onnx/        ONNX protobuf emitter
├── nxpu-backend-tflite/      TFLite FlatBuffers emitter
├── nxpu-backend-coreml/      CoreML emitter
├── nxpu-backend-stablehlo/   StableHLO MLIR emitter
├── nxpu-backend-*/           Vendor-specific backends (8 vendors)
├── nxpu-cli/                 Command-line interface
└── nxpu-e2e-tests/           End-to-end numerical correctness tests
examples/                     WGSL sample kernels (14 examples)
docs/                         Architecture and contributor guides

CLI Reference

nxpu [OPTIONS] <INPUT>

Arguments:
  <INPUT>                  Input WGSL file

Options:
  -t, --target <TARGET>    Target backend [default: ir-dump]
  -o, --output <OUTPUT>    Output file path (default: stdout)
      --opt-level <N>      Optimization level: 0, 1, or 2 [default: 1]
      --precision <MODE>   Precision: keep, f16, bf16, int8, auto [default: auto]
      --emit-ir            Dump IR to stderr before backend compilation
      --dry-run            Validate and optimize without output
      --list-targets       List available backends and exit
  -h, --help               Print help
  -V, --version            Print version

Building from Source

Prerequisites: Rust 1.87+ (edition 2024)

cargo build            # Build all crates
cargo test             # Run all tests
cargo clippy           # Lint
cargo fmt --check      # Check formatting

To build with specific backends only:

cargo build -p nxpu-cli --no-default-features --features backend-onnx,backend-tflite
Available feature flags

backend-onnx · backend-tflite · backend-coreml · backend-stablehlo · backend-samsung · backend-mediatek · backend-intel · backend-amd · backend-qualcomm · backend-arm-ethos · backend-ceva · backend-rockchip

Contributing

Contributions are welcome! See CONTRIBUTING.md for guidelines and docs/adding-a-backend.md for backend implementation guides.

License

Licensed under either of

at your option.

About

WGSL → NPU transpiler. Compiles WebGPU shading language to native NPU formats (CoreML, ONNX, TFLite, StableHLO).

Topics

Resources

License

Apache-2.0, MIT licenses found

Licenses found

Apache-2.0
LICENSE-APACHE
MIT
LICENSE-MIT

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors