diff --git a/.github/workflows/tauri-build-debug.yml b/.github/workflows/tauri-build-debug.yml index 8c59c38..904f45c 100644 --- a/.github/workflows/tauri-build-debug.yml +++ b/.github/workflows/tauri-build-debug.yml @@ -110,7 +110,7 @@ jobs: # ---------------- Tauri ----------------- - name: Build Tauri Project - uses: tauri-apps/tauri-action@9ce1dcc1a78395184050946b71457a6c242beab6 + uses: tauri-apps/tauri-action@e3ec38d49ea445df6d61ebaf015a85b1846b63f3 env: GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }} NDK_HOME: ${{ steps.setup-ndk.outputs.ndk-path }} diff --git a/.github/workflows/tauri-build-release.yml b/.github/workflows/tauri-build-release.yml index c0c72d7..58653cb 100644 --- a/.github/workflows/tauri-build-release.yml +++ b/.github/workflows/tauri-build-release.yml @@ -127,7 +127,7 @@ jobs: # ---------------- Tauri ----------------- - name: Build Tauri Project - uses: tauri-apps/tauri-action@ca517bcbe58fd7012408d7ddfaeff950428bdeb1 + uses: tauri-apps/tauri-action@e3ec38d49ea445df6d61ebaf015a85b1846b63f3 env: GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }} NDK_HOME: ${{ steps.setup-ndk.outputs.ndk-path }} diff --git a/EVALUATION_REPORT.md b/EVALUATION_REPORT.md new file mode 100644 index 0000000..8e9f10c --- /dev/null +++ b/EVALUATION_REPORT.md @@ -0,0 +1,118 @@ +# Local LLM Integration: Technical Evaluation & Recommendation + +## Executive Summary + +After evaluating the leading runtime options against your constraints (Cross-platform parity, Offline capability, Hardware Acceleration, Tauri/Rust integration), two clear candidates emerge. + +* **Primary Recommendation**: **llama.cpp** (via Rust bindings). It offers the best balance of single-file model UX (GGUF), mature hardware acceleration (Metal/Vulkan/CUDA), and ease of integration. +* **Backup Strategy**: **ONNX Runtime (GenAI)**. It provides superior NPU access (NNAPI) but comes with higher complexity in model management and runtime distribution. + +--- + +## 1. Comparative Analysis + +| Feature | **llama.cpp** | **ONNX Runtime (ORT)** | **MLC LLM (TVM)** | **Ratchet (WebGPU)** | +| :--- | :--- | :--- | :--- | :--- | +| **Desktop Support** | ✅ (Win/Mac/Lin) | ✅ (All) | ✅ (All) | ✅ (All) | +| **Mobile Support** | ✅ (iOS/Android) | ✅ (iOS/Android) | ✅ (iOS/Android) | ⚠️ (Android WebGPU varies) | +| **Backend (Apple)** | Metal (Native) | CoreML / Metal | Metal | WebGPU (Metal) | +| **Backend (Android)** | Vulkan / CPU | NNAPI / QNN / CPU | Vulkan / OpenCL | WebGPU (Vulkan) | +| **NPU Support** | ⚠️ Partial (via dedicated backends) | ✅ Strong (NNAPI/CoreML) | ⚠️ Compilation required | ❌ No direct NPU | +| **Model Format** | **GGUF** (Single File) | ONNX (Folder/Files) | Compiled Libs + Weights | GGUF / SafeTensor | +| **Arbitrary Models** | ✅ Excellent (Drag & Drop) | ⚠️ Good (Needs conversion) | ❌ Poor (Needs compile) | ✅ Good | +| **Runtime Size** | ~2-5 MB (Static) | ~20-50 MB (Dynamic) | ~5-10 MB | ~5 MB | +| **Rust Integration** | ✅ `llama-cpp-2` | ✅ `ort` crate | ⚠️ C++ mostly | ✅ Native Rust | +| **Tauri Mobile** | ✅ Static Linking | ✅ Static/Dynamic | ⚠️ Complex | ✅ Native | + +### ❌ Disqualified Options + +* **Candle (HuggingFace)**: While promising and pure Rust, its **Android GPU/NPU support** lags significantly behind `llama.cpp` and `ORT`. It does not meet the "No Feature Disparity" constraint for mobile performance. +* **ExecuTorch**: Too experimental and heavy for a generic "model viewer" app; requires specific model preprocessing. + +--- + +## 2. Technical Analysis of Top Candidates + +### Option A: llama.cpp (Primary Recommendation) + +The *de facto* standard for local LLMs. It uses custom kernels (Metal, CUDA, Vulkan) rather than relying on OS-level APIs like CoreML/NNAPI, ensuring consistent behavior. + +* **Strengths**: + * **GGUF Format**: The industry standard for portable models. Users can download a single file from HuggingFace and run it. No "folders" or config hell. + * **Apple Silicon Parity**: Its Metal backend is exceptionally optimized, often beating CoreML in flexibility. + * **Android Parity**: Uses Vulkan for GPU acceleration. While not using the NPU (NNAPI) by default, modern Android GPUs (Adreno/Mali) are often faster than NPUs for LLMs anyway. + * **Tauri Integration**: Can be statically linked into the Tauri binary, solving the iOS App Store distribution rule. + +* **Weaknesses**: + * **NPU Access**: Does not deeply integrate with Android NNAPI or Windows NPU (yet). It relies on raw compute (CPU/GPU). + * **Granularity**: Manual overrides are usually "Layers to GPU" (0-100%). You cannot easily say "Run Layer 1 on NPU, Layer 2 on GPU" without deeper code changes. + +### Option B: ONNX Runtime (Backup) + +The corporate/standard approach. Uses execution providers (EPs) to delegate to hardware. + +* **Strengths**: + * **Hardware Access**: Best-in-class support for generic NPUs (Android NNAPI, Qualcomm QNN, Apple CoreML). + * **Granular Control**: You can explicitly select which Execution Provider to use for a session. + +* **Weaknesses**: + * **Model UX**: ONNX models are multi-file (weights + graph). Converting GGUF -> ONNX is not user-friendly. Users must download pre-converted ONNX models. + * **Binary Size**: The `onnxruntime` shared library is massive. Statically linking it on mobile is possible but bloats the app size significantly. + * **Complexity**: Configuring EPs for cross-platform parity is difficult (e.g., handling unsupported operators on NNAPI). + +### Option C: Ratchet (Emerging / Wildcard) + +A Rust-native, WebGPU-first runtime. + +* **Strengths**: True "Write Once, Run Everywhere" via `wgpu`. Native Rust (no C++ FFI headaches). Supports GGUF. +* **Weaknesses**: Android WebGPU support is still maturing. Performance is generally 80-90% of native Metal/CUDA. +* **Verdict**: Keep on radar, but `llama.cpp` is safer for production today. + +--- + +## 3. Implementation Strategy for Tauri + +To meet your requirements (Offline, Hardware Override, App Store Compliance), here is the recommended architecture: + +### 1. The Core: `llama.cpp` via FFI +Use the **`llama-cpp-2`** Rust bindings. This allows you to interact with the C++ engine safely from Rust. + +* **Linking**: + * **iOS/Android**: Enable **Static Linking** features in the crate. This builds `libllama.a` and bundles it into your main app binary. This is **App Store compliant**. + * **Desktop**: You can bundle the dynamic lib (`llama.dll`/`.so`/`.dylib`) or statically link. Static linking is preferred for "single binary" distribution. + +### 2. Hardware Acceleration Logic +You must implement a "Device Manager" in Rust that maps user preferences to `llama.cpp` params. + +* **Auto (Default)**: + * Detect OS. + * If macOS: Enable `Metal`. + * If Windows/Linux + NVIDIA: Enable `CUDA`. + * If Android: Enable `Vulkan`. + * Fallback: CPU (with threading optimized for Performance Cores). +* **Manual Override**: + * Expose a setting: "Inference Backend". + * Options: `CPU`, `Metal` (Mac/iOS), `Vulkan` (Android/Win/Lin), `CUDA` (Desktop). + * *Note*: `llama.cpp` handles the low-level device enumeration. You pass the flag `n_gpu_layers` to offload work. + +### 3. Model Management +* **Download**: Use Rust's `reqwest` to download GGUF files to `app_data_dir`. +* **Storage**: Store models in a persistent local directory. +* **Loading**: Pass the file path to `llama_backend_init_from_file`. + +--- + +## 4. Final Recommendation + +**Go with `llama.cpp`.** + +**Why?** +1. **Parity**: GGUF works everywhere. You don't need to explain to users why "Model X works on PC but not Mobile". +2. **UX**: Single-file models are superior for end-users compared to ONNX folders. +3. **Tauri/Mobile**: Static linking support is mature, ensuring painless iOS App Store approval. +4. **Performance**: Metal (iOS) and Vulkan (Android) backends are production-ready. + +**Next Steps:** +1. Add `llama-cpp-2` to your `Cargo.toml`. +2. Configure `build.rs` to enable `vulkan` feature for Android and `metal` for iOS. +3. Write a simple Rust test to load a dummy GGUF and run one token generation.