Skip to content

WebGPU Integration (continued)#201

Open
reeselevine wants to merge 21 commits intongxson:masterfrom
reeselevine:webgpu-integration
Open

WebGPU Integration (continued)#201
reeselevine wants to merge 21 commits intongxson:masterfrom
reeselevine:webgpu-integration

Conversation

@reeselevine
Copy link

@reeselevine reeselevine commented Jan 15, 2026

See #198 for original discussion

Summary by CodeRabbit

  • New Features

    • WebGPU hardware acceleration with a prefer-WebGPU option and visible "WebGPU" runtime indicator
    • Real-time performance metrics display (prefill/decode tok/s) and a Reset metrics action
    • Public perf APIs to fetch/reset performance data from the client
  • Chores

    • Automated deploy workflow for examples to GitHub Pages
    • Updated dev tooling, types and Emscripten SDK versions and dependency bumps

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link

coderabbitai bot commented Jan 15, 2026

Note

Reviews paused

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.
📝 Walkthrough

Walkthrough

Adds WebGPU selectable backend support and perf-context APIs, bumps GLUE protocol to v2, updates WASM/build tooling and TypeScript types, exposes WebGPU and perf status in the frontend, and wires new perf actions through the native worker bridge.

Changes

Cohort / File(s) Summary
Build & Deployment
./.github/workflows/deploy-examples-main.yml, CMakeLists.txt, scripts/build_wasm.sh, scripts/docker-compose.yml
New GH Actions workflow for deploying examples; added CMake options GGML_WEBGPU, GGML_WEBGPU_JSPI, LLAMA_WASM_MEM64; bumped EMSDK image tag; consolidated shared EMCC flags.
C++ Protocol & Runtime
cpp/glue.hpp, cpp/actions.hpp, cpp/wllama.cpp, cpp/test_glue.cpp
GLUE_VERSION -> 2; load_req adds use_webgpu and no_perf; new perf messages (pctx_req/pctx_res, prst_req/prst_res); added device field to app_t; backend selection (WebGPU/CPU) and two new actions action_perf_context, action_perf_reset.
TypeScript Protocol
src/glue/messages.ts
GLUE_VERSION -> 2; load_req gains use_webgpu and no_perf; added perf message prototypes and corresponding TypeScript interfaces; expanded GlueMsg union.
Core TypeScript / API
src/wllama.ts
New config flags preferWebGPU? and noPerf?; runtime useWebGPU tracking and usingWebGPU() accessor; getPerfContext() and resetPerfContext() APIs; load payloads include WebGPU/no-perf settings; threading logic adjusted when WebGPU active.
Worker / WASM Bridge
src/workers-code/llama-cpp.js
callWrapper gains isAsync support; cwrap wrappers updated to support async calls for wllama_start and wllama_action; wrapper signatures and invocation flow adjusted.
Frontend UI & Context
examples/main/src/components/ChatScreen.tsx, examples/main/src/components/ModelScreen.tsx, examples/main/src/utils/types.ts, examples/main/src/utils/wllama.context.tsx
Added perf state and controls in ChatScreen (fetch/reset perf); ModelScreen shows WebGPU status; RuntimeInfo adds usingWebGPU; context initializes preferWebGPU: true and updates usingWebGPU.
TypeScript Build & Dependencies
tsconfig.build.json, examples/main/tsconfig.app.json, package.json, examples/main/package.json
Added @webgpu/types to tsconfig types; added @webgpu/types devDependency; bumped @huggingface/jinja in examples.
Frontend Build Files
examples/main/package.json, examples/main/tsconfig.app.json
Type definitions and dev deps updated to support WebGPU types.
Submodule
llama.cpp
Subproject commit pointer updated.

Sequence Diagram(s)

sequenceDiagram
    participant Client as Browser Client
    participant Wllama as Wllama (TS)
    participant Worker as Worker/WASM
    participant Backend as GGML Backend

    Client->>Wllama: loadModel(preferWebGPU: true)
    Wllama->>Wllama: Check for WebGPU adapter
    alt adapter found
        Wllama->>Wllama: set useWebGPU = true
    else no adapter
        Wllama->>Wllama: set useWebGPU = false (warn)
    end
    Wllama->>Worker: wllama_start(loadReq with use_webgpu, no_perf, n_gpu_layers)
    Worker->>Backend: initialize (select WebGPU or CPU)
    Backend-->>Worker: initialized
    Worker-->>Wllama: ready
    Wllama-->>Client: model loaded (usingWebGPU status)

    Client->>Wllama: getPerfContext()
    Wllama->>Worker: wllama_action(pctx_req)
    Worker->>Backend: collect perf metrics
    Backend-->>Worker: perf data
    Worker-->>Wllama: pctx_res (metrics)
    Wllama-->>Client: PerfContextData
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 I hopped in code with WebGPU delight,

Tucked perf metrics snug and tight.
Protocol bumped, the wasm took flight,
Threads bowed out when GPU shone bright.
A rabbit cheers — the demo's light! 🎉

🚥 Pre-merge checks | ✅ 1 | ❌ 2
❌ Failed checks (1 warning, 1 inconclusive)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 30.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title 'WebGPU Integration (continued)' is partially related to the changeset. While WebGPU integration is a significant part of the changes, the title is vague and overly broad—it doesn't specify what aspect of WebGPU integration is being implemented, and it obscures other important changes like performance monitoring, GitHub Actions workflow, CMake configuration, and async handling improvements. Consider a more specific title that highlights the primary focus of the changeset, such as 'Add WebGPU backend selection, performance monitoring, and async support' or 'Implement WebGPU backend with perf context APIs'.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@reeselevine reeselevine mentioned this pull request Jan 15, 2026
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
cpp/glue.hpp (2)

449-452: Missing break statement causes unintended fallthrough.

The GLUE_DTYPE_RAW case is missing a break statement, causing execution to fall through to GLUE_DTYPE_ARRAY_BOOL. This will corrupt deserialization of raw fields.

🐛 Fix missing break statement
     case GLUE_DTYPE_RAW:
       ((glue_raw *)field)->parse(input);
+      break;
     case GLUE_DTYPE_ARRAY_BOOL:

464-466: Missing break statement at end of switch.

The GLUE_DTYPE_ARRAY_RAW case is missing a break statement. While this is the last case and won't cause functional issues currently, it's inconsistent with other cases and could cause problems if more cases are added later.

🐛 Fix missing break statement
     case GLUE_DTYPE_ARRAY_RAW:
       ((glue_arr_raw *)field)->parse(input);
+      break;
     }
🤖 Fix all issues with AI agents
In @.github/workflows/deploy-examples-main.yml:
- Around line 1-36: Replace the invalid pinned SHA for the GitHub Pages deploy
action in the deploy step (the uses: JamesIves/github-pages-deploy-action@...)
with a valid tag or release (e.g., JamesIves/github-pages-deploy-action@v4.6.1)
to ensure the action can be resolved; also add a top-level concurrency block
(e.g., concurrency: { group: 'deploy-pages', cancel-in-progress: true }) to the
workflow to prevent overlapping runs when the workflow is triggered multiple
times.

In `@CMakeLists.txt`:
- Around line 4-6: The GGML_WEBGPU_JSPI option currently defaults to ON which
breaks compatibility for many browsers; change the CMake option declaration for
GGML_WEBGPU_JSPI to default to OFF (i.e., set its default value to OFF) and
update the surrounding CMake comment or project documentation to explicitly
state JSPI is experimental and list the browser requirements/flags (e.g.,
Chrome/Edge experimental flags, Firefox 139 experimental support, Safari/WebKit
not supported) so consumers know how to enable it intentionally.
🧹 Nitpick comments (4)
scripts/docker-compose.yml (1)

22-22: Comprehensive EMCC flags for JSPI and WebGPU support.

The consolidated flags correctly enable JSPI (JavaScript Promise Integration), SIMD, modern exception handling, and expose necessary runtime methods including HEAPU8 for direct memory access. The JSPI_EXPORTS configuration (wllama_start, wllama_action) correctly targets the async-capable functions.

Note: JSPI is experimental/provisional in Emscripten 4.0.20. Runtime behavior depends on browser support—Chrome/V8 ships it behind origin trials, and some versions require explicit feature flags to enable. Ensure your deployment targets compatible browser versions.

The single-line flag definition is difficult to read and maintain. Consider breaking it into a multi-line format for clarity:

♻️ Suggested readability improvement
-        export SHARED_EMCC_FLAGS="--no-entry -O3 -msimd128 -DNDEBUG -flto=full -frtti -fwasm-exceptions -sEXPORT_ALL=1 -sEXPORT_ES6=0 -sMODULARIZE=0 -sINITIAL_MEMORY=128MB -sMAXIMUM_MEMORY=4096MB -sALLOW_MEMORY_GROWTH=1 -sFORCE_FILESYSTEM=1 -sEXPORTED_FUNCTIONS=_main,_wllama_malloc,_wllama_start,_wllama_action,_wllama_exit,_wllama_debug -sEXPORTED_RUNTIME_METHODS=ccall,cwrap,HEAPU8 -sNO_EXIT_RUNTIME=1 -Wno-unused-command-line-argument -sJSPI -sJSPI_EXPORTS=['wllama_start','wllama_action']"
+        export SHARED_EMCC_FLAGS="\
+          --no-entry -O3 -msimd128 -DNDEBUG -flto=full -frtti -fwasm-exceptions \
+          -sEXPORT_ALL=1 -sEXPORT_ES6=0 -sMODULARIZE=0 \
+          -sINITIAL_MEMORY=128MB -sMAXIMUM_MEMORY=4096MB -sALLOW_MEMORY_GROWTH=1 \
+          -sFORCE_FILESYSTEM=1 \
+          -sEXPORTED_FUNCTIONS=_main,_wllama_malloc,_wllama_start,_wllama_action,_wllama_exit,_wllama_debug \
+          -sEXPORTED_RUNTIME_METHODS=ccall,cwrap,HEAPU8 \
+          -sNO_EXIT_RUNTIME=1 -Wno-unused-command-line-argument \
+          -sJSPI -sJSPI_EXPORTS=['wllama_start','wllama_action']"
examples/main/src/components/ChatScreen.tsx (1)

51-61: Consider guarding refreshPerf with loadedModel check.

The resetPerf function guards against !loadedModel, but refreshPerf does not. If refreshPerf is called when no model is loaded (e.g., during edge-case race conditions), getWllamaInstance().getPerfContext() might throw an error that's not specific to the missing model state.

♻️ Suggested improvement
 const refreshPerf = async () => {
+   if (!loadedModel) return;
   setPerfBusy(true);
   setPerfError(null);
   try {
     setPerfData(await getWllamaInstance().getPerfContext());
   } catch (e) {
     setPerfError((e as any)?.message ?? 'Failed to fetch perf data');
   } finally {
     setPerfBusy(false);
   }
 };
src/wllama.ts (2)

1382-1393: Type inconsistency between return type and actual value.

getPerfContext() is declared to return PerfContextData, but it actually returns GlueMsgPerfContextRes which includes the _name: "pctx_res" field. While this works at runtime (extra fields are allowed), it's a type inconsistency.

Looking at the PerfContextData interface (lines 134-143), it matches GlueMsgPerfContextRes except for _name. Consider either:

  1. Stripping _name from the return value, or
  2. Extending PerfContextData from GlueMsgPerfContextRes
♻️ Option 1: Strip _name from return
 async getPerfContext(): Promise<PerfContextData> {
   this.checkModelLoaded();
-   return await this.proxy.wllamaAction<GlueMsgPerfContextRes>(
+   const result = await this.proxy.wllamaAction<GlueMsgPerfContextRes>(
     'perf_context',
     {
       _name: 'pctx_req',
     }
   );
+   const { _name, ...perfData } = result;
+   return perfData;
 }

587-595: Use async adapter check for robust WebGPU detection.

The current check navigator.gpu only verifies the WebGPU API exists, not whether a compatible adapter is available. According to WebGPU best practices, await navigator.gpu?.requestAdapter() should be called to confirm a usable adapter exists; requestAdapter() resolves to null when the browser cannot provide a suitable adapter.

Since loadModel is already async, this check can be awaited without architectural changes.

♻️ More robust WebGPU detection
-   if (this.config.preferWebGPU) {
-     if (navigator.gpu) {
-       this.useWebGPU = true;
-     } else {
-       this.logger().warn(
-         'WebGPU backend requested but WebGPU is not available, falling back to CPU'
-       );
-     }
-   }
+   if (this.config.preferWebGPU) {
+     try {
+       const adapter = await navigator.gpu?.requestAdapter();
+       if (adapter) {
+         this.useWebGPU = true;
+       } else {
+         this.logger().warn(
+           'WebGPU backend requested but no compatible adapter found, falling back to CPU'
+         );
+       }
+     } catch {
+       this.logger().warn(
+         'WebGPU backend requested but WebGPU is not available, falling back to CPU'
+       );
+     }
+   }
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8778d7b and a8a9546.

⛔ Files ignored due to path filters (5)
  • examples/main/package-lock.json is excluded by !**/package-lock.json
  • package-lock.json is excluded by !**/package-lock.json
  • src/multi-thread/wllama.wasm is excluded by !**/*.wasm
  • src/single-thread/wllama.wasm is excluded by !**/*.wasm
  • src/webgpu-single-thread/wllama.wasm is excluded by !**/*.wasm
📒 Files selected for processing (24)
  • .github/workflows/deploy-examples-main.yml
  • CMakeLists.txt
  • cpp/actions.hpp
  • cpp/glue.hpp
  • cpp/test_glue.cpp
  • cpp/wllama.cpp
  • examples/main/package.json
  • examples/main/src/components/ChatScreen.tsx
  • examples/main/src/components/ModelScreen.tsx
  • examples/main/src/utils/types.ts
  • examples/main/src/utils/wllama.context.tsx
  • examples/main/tsconfig.app.json
  • llama.cpp
  • package.json
  • scripts/build_wasm.sh
  • scripts/docker-compose.yml
  • src/glue/messages.ts
  • src/multi-thread/wllama.js
  • src/single-thread/wllama.js
  • src/webgpu-single-thread/wllama.js
  • src/wllama.ts
  • src/workers-code/generated.ts
  • src/workers-code/llama-cpp.js
  • tsconfig.build.json
🧰 Additional context used
🧠 Learnings (3)
📚 Learning: 2025-04-18T08:24:14.434Z
Learnt from: ngxson
Repo: ngxson/wllama PR: 0
File: :0-0
Timestamp: 2025-04-18T08:24:14.434Z
Learning: The file `generate_wasm_from_cdn.ts` in the wllama project is a generated script that gets stored on CDN, making it impossible to use `require('../package.json')` to dynamically access version information. This is why version references in this file need to be manually updated when the package version changes.

Applied to files:

  • scripts/build_wasm.sh
  • src/wllama.ts
  • package.json
📚 Learning: 2025-12-29T03:44:02.649Z
Learnt from: reeselevine
Repo: ngxson/wllama PR: 198
File: src/workers-code/llama-cpp.js:269-279
Timestamp: 2025-12-29T03:44:02.649Z
Learning: In wllama's src/workers-code/llama-cpp.js, the isAsync parameter in callWrapper should only be true for functions included in JSPI_EXPORTS build configuration (wllama_start and wllama_action). Other functions like wllama_exit, wllama_debug, and wllama_malloc should use isAsync=false even though the callWrapper returns an async function, because they don't perform async operations at the C/WASM level.

Applied to files:

  • cpp/wllama.cpp
  • src/wllama.ts
📚 Learning: 2025-12-29T03:44:00.085Z
Learnt from: reeselevine
Repo: ngxson/wllama PR: 198
File: src/workers-code/llama-cpp.js:269-279
Timestamp: 2025-12-29T03:44:00.085Z
Learning: In src/workers-code/llama-cpp.js, constrain the isAsync flag passed to callWrapper: set isAsync to true only for functions included in the JSPI_EXPORTS build (specifically wllama_start and wllama_action). For other exported functions such as wllama_exit, wllama_debug, and wllama_malloc, always use isAsync=false since they do not perform async operations at the C/WASM level, even though callWrapper may expose an async function. This should be verifiable by reviewing the build exports and ensuring these functions are not marked as asynchronous in the wrapper when invoked.

Applied to files:

  • src/workers-code/llama-cpp.js
🧬 Code graph analysis (5)
examples/main/src/components/ChatScreen.tsx (1)
src/wllama.ts (1)
  • PerfContextData (134-143)
examples/main/src/utils/wllama.context.tsx (2)
src/wllama.ts (1)
  • Wllama (298-1473)
examples/main/src/config.ts (1)
  • WLLAMA_CONFIG_PATHS (9-12)
src/wllama.ts (1)
src/glue/messages.ts (2)
  • GlueMsgPerfContextRes (1384-1394)
  • GlueMsgPerfResetRes (1402-1405)
cpp/glue.hpp (1)
src/glue/messages.ts (1)
  • GLUE_VERSION (6-6)
src/workers-code/llama-cpp.js (2)
cpp/generate_glue_prototype.js (1)
  • name (39-39)
src/worker.ts (4)
  • wllamaStart (119-127)
  • wllamaAction (129-141)
  • wllamaExit (143-153)
  • wllamaDebug (155-162)
🔇 Additional comments (29)
src/workers-code/llama-cpp.js (2)

230-247: LGTM!

The callWrapper function correctly handles the isAsync flag:

  • Passes { async: true } to cwrap only when needed for JSPI support
  • The conditional await inside the wrapper ensures synchronous functions execute without unnecessary promise wrapping at the native level
  • Error handling is appropriate

269-283: LGTM! The isAsync flags are correctly applied.

Based on learnings, isAsync=true should only be set for functions included in the JSPI_EXPORTS build (wllama_start and wllama_action). The configuration here correctly reflects this:

Function isAsync Correct
wllama_malloc false
wllama_start true
wllama_action true
wllama_exit false
wllama_debug false
examples/main/package.json (2)

30-30: Appropriate addition for WebGPU type support.

Adding @webgpu/types as a devDependency is correct for providing TypeScript type definitions for WebGPU APIs. This aligns with the tsconfig changes that include this in the types array.


16-16: No action required. The version bump from ^0.2.2 to ^0.5.3 has no documented breaking changes. The v0.5.x release adds formatting support for chat templates (additive changes), and the example app already has a workaround in place for DeepSeek template compatibility issues.

package.json (1)

51-51: LGTM!

Adding @webgpu/types at the root level is necessary for the library's TypeScript compilation. The version is consistent with the example app's devDependency.

llama.cpp (1)

1-1: Verify submodule commit and review upstream changes before merging.

Ensure the new commit exists in the upstream llama.cpp repository and review changes between 4abef75f2cf2eee75eb5083b30a94cf981587394 and 9e41884dce4719d8d901058613ea7a700dc3697d for breaking API changes, security implications, or new dependencies that might affect the integration with this project's WebGPU implementation.

examples/main/tsconfig.app.json (1)

8-8: Note: types array restricts automatic type inclusion.

When the types array is specified, only the listed packages are included. If other @types/* packages are needed (e.g., @types/node), they must be explicitly added here. If this restriction is intentional for strict type control, this is fine.

scripts/docker-compose.yml (2)

27-28: LGTM - Single-thread build configuration.

Correctly derives flags from the shared base.


36-37: LGTM - Multi-thread build with pthread configuration.

The pthread flags are correctly appended, and PTHREAD_POOL_SIZE=Module["pthreadPoolSize"] allows runtime configuration of the thread pool size.

tsconfig.build.json (1)

30-31: LGTM - WebGPU types and synthetic imports enabled.

Both additions are appropriate:

  • allowSyntheticDefaultImports enables cleaner import syntax for CommonJS interop
  • types: ["@webgpu/types"] provides WebGPU type definitions for the library build

@webgpu/types is properly listed in devDependencies.

scripts/build_wasm.sh (1)

5-5: LGTM - Version bump to support JSPI features.

The Emscripten SDK update to 4.0.20 aligns with the JSPI (JavaScript Promise Integration) flags in the docker-compose.yml build configuration (-sJSPI -sJSPI_EXPORTS=['wllama_start','wllama_action']). The Docker image tags exist on Docker Hub, including the ARM64 variant properly appended at line 16.

examples/main/src/utils/types.ts (1)

16-20: LGTM!

The usingWebGPU field addition to RuntimeInfo is consistent with the existing interface pattern and properly integrates with the WebGPU runtime detection in wllama.context.tsx.

cpp/wllama.cpp (1)

121-122: LGTM!

The new perf_context and perf_reset action handlers follow the established WLLAMA_ACTION macro pattern consistently with other actions in this file.

examples/main/src/components/ModelScreen.tsx (1)

332-336: LGTM!

The WebGPU status indicator follows the established pattern for displaying runtime information alongside the existing Multithread and Chat template indicators.

cpp/test_glue.cpp (1)

15-43: LGTM!

The test correctly validates the serialization round-trip for the new use_webgpu field, ensuring the value is properly preserved through serialize/deserialize operations.

examples/main/src/utils/wllama.context.tsx (2)

61-71: LGTM!

The WebGPU preference is correctly configured in both the initial instance creation and resetWllamaInstance, ensuring consistent behavior. The actual WebGPU usage state is properly retrieved at runtime via usingWebGPU() method.


176-180: LGTM!

The runtime info correctly captures the actual WebGPU usage state (not just the preference) by calling wllamaInstance.usingWebGPU() after model loading completes.

examples/main/src/components/ChatScreen.tsx (2)

77-80: LGTM!

The formatTokPerSec utility correctly handles the edge case of ms <= 0 to prevent division by zero, and the calculation is accurate.


197-221: LGTM!

The performance metrics UI is well-structured with proper conditional rendering for error states, appropriate button disabling during busy/generating states, and clean display of prefill/decode metrics.

cpp/actions.hpp (3)

800-819: LGTM!

The action_perf_context implementation correctly handles the null context case and properly maps all performance counter fields from llama_perf_context_data to the response structure.


821-833: LGTM!

The action_perf_reset implementation is clean with proper null context handling.


164-177: No issue: devices array is safely used and copied by llama.cpp.

The stack-allocated devices array is passed immediately to llama_model_load_from_splits(), which copies the device list into the model's internal storage during the call. The original array does not need to outlive the function call, and no dangling pointer issue exists here.

cpp/glue.hpp (2)

497-499: LGTM!

The new use_webgpu and no_perf fields are correctly positioned and typed, consistent with the TypeScript definitions in src/glue/messages.ts.


824-851: LGTM!

The new performance context message types are correctly defined with handler names matching the TypeScript definitions and field structures aligning with PerfContextData interface.

src/glue/messages.ts (2)

1-6: Note: This is a generated file.

The file header indicates it's generated by cpp/generate_glue_prototype.js. Ensure changes to message definitions are made through the generator script rather than manual edits to maintain consistency between C++ and TypeScript.


1378-1405: LGTM!

The new performance context interfaces are correctly defined and consistent with the C++ message structures in cpp/glue.hpp.

src/wllama.ts (3)

619-626: LGTM!

The logic correctly disables multi-threading when WebGPU is active, with an informative warning message explaining the reason.


662-664: LGTM!

The n_gpu_layers setting of 999 for WebGPU (to offload all layers) and 0 for CPU is the correct approach for backend selection.


464-467: LGTM!

The usingWebGPU() accessor follows the same pattern as other state accessors like isMultithread(), with proper model-loaded check.

✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.

Comment on lines +4 to +6
option(GGML_WEBGPU "Enable GGML WebGPU backend" ON)
option(GGML_WEBGPU_JSPI "Enable GGML WebGPU JSPI support" ON)
option(LLAMA_WASM_MEM64 "Enable 64-bit memory for WebAssembly builds" OFF)
Copy link

@coderabbitai coderabbitai bot Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🌐 Web query:

JSPI JavaScript Promise Integration browser support 2025 2026

💡 Result:

Summary — JSPI (WebAssembly JavaScript Promise Integration) browser support (2025–Jan 15, 2026)

  • Status: Standardization underway (W3C/WebAssembly WG; JSPI has advanced through the spec process and is implemented experimentally). (v8.dev)
  • Chrome / Chromium: Implemented experimentally in V8. Available via feature flag / origin trial during 2024–2025 and shipped behind flags in Chrome stable/channel builds (noted availability in Chrome 126+ for the new API and coverage in Chrome 137 articles). Feature-detection: 'Suspender' in WebAssembly (or newer API checks as described in V8 docs). (developer.chrome.com)
  • Edge: Chromium-based Edge tracks Chromium’s implementation; JSPI mentioned in Edge platform release notes (support entries in Edge 137 notes). Origin-trial participation was offered. (learn.microsoft.com)
  • Firefox: V8 docs and V8 blog posts note availability/experimental builds in Firefox (references to Firefox 139 in V8 materials); support status in Firefox may be experimental behind flags/versions — check specific Firefox nightly/dev release notes for exact builds. (v8.dev)
  • Safari / WebKit: No broad shipping announcement found (as of Jan 15, 2026). JSPI has been implemented first in Chromium/V8; WebKit status appears not to have shipped stable support — verify WebKit bug tracker / Safari release notes for any updates before relying on it in production. (v8.dev)

Practical notes:

  • In 2024–2025 JSPI was primarily available via origin trials or feature flags in Chromium-based browsers; by mid–late 2025 some Chromium channels documented shipping behavior (see Chrome/Edge notes). Confirm current availability by testing feature-detection in the target browser and checking vendor release notes for the exact version you need. (developer.chrome.com)

If you want, I can:

  • Check the exact support matrix for specific browser versions you care about (Chrome/Edge/Firefox/Safari) and list commands/flags to enable JSPI locally.

Citations:


Options are well-defined; reconsider GGML_WEBGPU_JSPI default due to browser compatibility.

The CMake options are correctly placed before add_subdirectory(llama.cpp). However, GGML_WEBGPU_JSPI defaults to ON, and JSPI browser support remains limited as of January 2026: Chrome/Edge require experimental flags, Firefox 139 has experimental support, and Safari/WebKit have not shipped JSPI support. This creates a significant compatibility gap for Safari users. Consider defaulting to OFF for broader compatibility, or explicitly document JSPI browser requirements and the experimental nature of this feature.

🤖 Prompt for AI Agents
In `@CMakeLists.txt` around lines 4 - 6, The GGML_WEBGPU_JSPI option currently
defaults to ON which breaks compatibility for many browsers; change the CMake
option declaration for GGML_WEBGPU_JSPI to default to OFF (i.e., set its default
value to OFF) and update the surrounding CMake comment or project documentation
to explicitly state JSPI is experimental and list the browser requirements/flags
(e.g., Chrome/Edge experimental flags, Firefox 139 experimental support,
Safari/WebKit not supported) so consumers know how to enable it intentionally.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's a good point, but there are still some unresolved issues with using ASYNCIFY. We could try to handle it, but it also seems like WebKit is planning on supporting JSPI at some point (WebKit/standards-positions#422)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Fetch the WebKit standards position on JSPI to understand the status
gh api repos/WebKit/standards-positions/issues/422 --jq '{state: .state, title: .title, created: .created_at, updated: .updated_at, body: .body, comments: .comments}' 2>/dev/null || echo "Failed to fetch issue details"

# Get recent comments on the issue to understand current status
gh api repos/WebKit/standards-positions/issues/422/comments --jq '.[-3:] | .[] | {created: .created_at, author: .user.login, body: .body}' 2>/dev/null || echo "Failed to fetch comments"

Length of output: 4287


@reeselevine, that's a fair point! The tradeoff between ASYNCIFY issues and JSPI browser support is a reasonable engineering decision. I can see the WebKit standards position issue shows positive progress—kmiller68 from WebKit commented in September 2025 that "we've withdrawn our objections to this API," which is a good signal for future support.

Given that ASYNCIFY has unresolved issues and JSPI enables better functionality for the majority of browsers that do support it (Chrome, Edge, Firefox experimental), defaulting to ON makes sense for your use case.

One suggestion: consider adding a note in the README or build documentation about JSPI browser requirements so users on Safari/WebKit-based browsers understand why WebGPU might not work for them yet, and can disable the option if needed. Something like:

## WebGPU Backend
- Requires JSPI (JavaScript Promise Integration) support
- Supported: Chrome/Edge (with flags), Firefox 139+ (experimental)
- Not yet supported: Safari/WebKit (under consideration)
- To disable: `cmake -DGGML_WEBGPU_JSPI=OFF`

But that's just a nice-to-have for user experience—the default is justified given the technical constraints.


🧠 Learnings used
Learnt from: reeselevine
Repo: ngxson/wllama PR: 198
File: src/workers-code/llama-cpp.js:269-279
Timestamp: 2025-12-29T03:44:02.649Z
Learning: In wllama's src/workers-code/llama-cpp.js, the isAsync parameter in callWrapper should only be true for functions included in JSPI_EXPORTS build configuration (wllama_start and wllama_action). Other functions like wllama_exit, wllama_debug, and wllama_malloc should use isAsync=false even though the callWrapper returns an async function, because they don't perform async operations at the C/WASM level.

@reeselevine
Copy link
Author

Ok, I guess if we want to support Safari, we will need to build separate wasm blobs without JSPI/WebGPU for contexts that don't support it. Unfortunately, it looks like the wasm built with JSPI isn't compatible with Safari, even if the WebGPU backend is not used.

I can look into supporting ASYNCIFY a bit and if I get it working compare the performance. Potentially there's a path where only ASYNCIFY is used until JSPI is more widely supported. Or wllama will have to ship more wasm binaries, ones with JSPI and ones without.

@ngxson
Copy link
Owner

ngxson commented Jan 16, 2026

Yeah I think we can have the third build named webgpu, that is basically webgpu + single-thread + asyncify.

Anyways, I'm still not sure why CI doesn't run automatically on this PR 🤔 I just need the npm run test to pass as it should already cover chrome + firefox.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@src/wllama.ts`:
- Around line 601-621: The code can set this.nbThreads to 0 when
navigator.hardwareConcurrency is 1; change the hwConcurrency and nbThreads
calculation to enforce a minimum of 1: compute hwConcurrency = Math.max(1,
Math.floor((navigator.hardwareConcurrency || 1) / 2)) and then set
this.nbThreads = Math.max(1, config.n_threads ?? hwConcurrency); keep the
existing check that sets this.useMultiThread = true only when this.nbThreads > 1
and retain the warning branch when it falls back to single-thread; update
references to this.nbThreads (used later as n_threads) accordingly.
- Around line 134-143: getPerfContext() currently returns the raw
GlueMsgPerfContextRes which includes the internal _name: "pctx_res" field and
leaks it through the public PerfContextData API; modify getPerfContext() to
transform/map the GlueMsgPerfContextRes into a PerfContextData object by copying
only the public fields (success, t_start_ms, t_load_ms, t_p_eval_ms, t_eval_ms,
n_p_eval, n_eval, n_reused) and omitting _name before returning so callers
receive the clean PerfContextData shape.
🧹 Nitpick comments (1)
src/wllama.ts (1)

587-599: Consider adding try-catch around WebGPU adapter request.

While requestAdapter() typically returns null on failure, wrapping it in try-catch would guard against edge cases where the call throws (browser bugs, permission errors, etc.).

Suggested defensive approach
    if (this.config.preferWebGPU) {
      if (navigator.gpu) {
-       if(await navigator.gpu.requestAdapter()) {
-           this.useWebGPU = true;
-       } else {
-         this.logger().warn('WebGPU backend requested but no adapter found, falling back to CPU');
+       try {
+         if (await navigator.gpu.requestAdapter()) {
+           this.useWebGPU = true;
+         } else {
+           this.logger().warn('WebGPU backend requested but no adapter found, falling back to CPU');
+         }
+       } catch (e) {
+         this.logger().warn('WebGPU adapter request failed, falling back to CPU', e);
        }
      } else {
        this.logger().warn(
          'WebGPU backend requested but WebGPU is not available, falling back to CPU'
        );
      }
    }

Comment on lines +601 to 621
if (await isSupportMultiThread()) {
if (this.pathConfig['multi-thread/wllama.wasm']) {
const hwConcurrency = Math.floor((navigator.hardwareConcurrency || 1) / 2);
this.nbThreads = config.n_threads ?? hwConcurrency;
if (this.nbThreads > 1) {
this.useMultiThread = true;
} else {
this.logger().warn(
'Falling back single-thread due to n_threads configuration or limited hardware concurrency'
);
}
} else {
this.logger().warn(
'Missing paths to "multi-thread/wllama.wasm", falling back to single-thread'
);
}
} else {
this.logger().warn(
'Multi-threads are not supported in this environment, falling back to single-thread'
);
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Edge case: nbThreads may become 0 when hardware concurrency is limited.

When navigator.hardwareConcurrency is 1, hwConcurrency becomes Math.floor(1/2) = 0. If config.n_threads is not specified, nbThreads will be set to 0, which could cause unexpected behavior when passed to the worker (line 671: n_threads: this.nbThreads).

Proposed fix: ensure minimum of 1 thread
-      const hwConcurrency = Math.floor((navigator.hardwareConcurrency || 1) / 2);
-      this.nbThreads = config.n_threads ?? hwConcurrency;
+      const hwConcurrency = Math.max(1, Math.floor((navigator.hardwareConcurrency || 1) / 2));
+      this.nbThreads = config.n_threads ?? hwConcurrency;
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if (await isSupportMultiThread()) {
if (this.pathConfig['multi-thread/wllama.wasm']) {
const hwConcurrency = Math.floor((navigator.hardwareConcurrency || 1) / 2);
this.nbThreads = config.n_threads ?? hwConcurrency;
if (this.nbThreads > 1) {
this.useMultiThread = true;
} else {
this.logger().warn(
'Falling back single-thread due to n_threads configuration or limited hardware concurrency'
);
}
} else {
this.logger().warn(
'Missing paths to "multi-thread/wllama.wasm", falling back to single-thread'
);
}
} else {
this.logger().warn(
'Multi-threads are not supported in this environment, falling back to single-thread'
);
}
if (await isSupportMultiThread()) {
if (this.pathConfig['multi-thread/wllama.wasm']) {
const hwConcurrency = Math.max(1, Math.floor((navigator.hardwareConcurrency || 1) / 2));
this.nbThreads = config.n_threads ?? hwConcurrency;
if (this.nbThreads > 1) {
this.useMultiThread = true;
} else {
this.logger().warn(
'Falling back single-thread due to n_threads configuration or limited hardware concurrency'
);
}
} else {
this.logger().warn(
'Missing paths to "multi-thread/wllama.wasm", falling back to single-thread'
);
}
} else {
this.logger().warn(
'Multi-threads are not supported in this environment, falling back to single-thread'
);
}
🤖 Prompt for AI Agents
In `@src/wllama.ts` around lines 601 - 621, The code can set this.nbThreads to 0
when navigator.hardwareConcurrency is 1; change the hwConcurrency and nbThreads
calculation to enforce a minimum of 1: compute hwConcurrency = Math.max(1,
Math.floor((navigator.hardwareConcurrency || 1) / 2)) and then set
this.nbThreads = Math.max(1, config.n_threads ?? hwConcurrency); keep the
existing check that sets this.useMultiThread = true only when this.nbThreads > 1
and retain the warning branch when it falls back to single-thread; update
references to this.nbThreads (used later as n_threads) accordingly.

@ngxson
Copy link
Owner

ngxson commented Jan 16, 2026

@coderabbitai pause

@coderabbitai
Copy link

coderabbitai bot commented Jan 16, 2026

✅ Actions performed

Reviews paused.

@reeselevine
Copy link
Author

Not sure how to open a PR targeting this branch in this repository, but I've added ASYNCIFY support and some discussion points here: reeselevine#1

@reeselevine
Copy link
Author

@ngxson I know you're probably busy but would be interested in getting some of this integrated soon.

@ngxson
Copy link
Owner

ngxson commented Feb 25, 2026

@reeselevine sorry for the delay, I'm going back to wllama this week.

For now, npm run test fails so I still cannot merge this yet, but will have a look now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants