Skip to content

WebGPU integration#198

Closed
reeselevine wants to merge 20 commits intongxson:masterfrom
reeselevine:master
Closed

WebGPU integration#198
reeselevine wants to merge 20 commits intongxson:masterfrom
reeselevine:master

Conversation

@reeselevine
Copy link

@reeselevine reeselevine commented Dec 23, 2025

Adds initial WebGPU integration to wllama. There are still a few issues to be resolved, but opening this PR now so we can start working on them!

To add WebGPU support, this PR does the following:

  • Adds the WebGPU backend to the single thread WASM build, and adds a configuration option to wllama. If WebGPU is requested and available, it uses the WebGPU backend in the single thread WASM build. Otherwise, it uses either the multi-thread build or single thread build with the CPU backend.
  • Instead of having consumers of wllama set n_gpu_layers, only the backend is specified in the config, and if WebGPU is requested/available, n_gpu_layers is set to 999. (This is how I've been running GPU backends locally, is this the right way to do it?)
  • A few build flags and some code had to be changed to account for the WebGPU asynchronous operations. In particular, some wllama functions had to be included in JSPI_EXPORTS so that they are properly handled. {'async' : true} also has to be passed to Module.cwrap for these functions. We also found that it was necessary to export HEAPU8, even for CPU-only builds (maybe this is due to updating emscripten versions?).
  • I also added a Github action to deploy to Github Pages, so that anyone can easily host examples/main. For example, I have it here: https://reeselevine.github.io/wllama/

Issues to be resolved

  • Wllama still does not work with WASM 64-bit builds, which is a problem because ggml's backend get_memory function uses a size_t for free/total, which in WASM 32-bit builds overflows when the available memory is > 2^32 - 1 bytes. This is a problem on many machines, e.g., my M3 reports 2^32 bytes. So, in the built wasm in this PR, I actually hardcoded the return values from get_memory to 2^32 - 1. To address this issue, either (1) we need wllama to support 64-bit builds, which is not trivial (some interfaces between JavaScript/WASM need to change I think) and might come with other issues, or (2), change the ggml interface here to not use a size_t, or report the available memory in something other than bytes. This also gets to another question, which is that I haven't yet looked into how wllama is storing the models in memory, and if this can be optimized to avoid storing the model on the CPU side in memory if it's running on the GPU, or if it can currently support models larger than 4 GB on machines with more memory.
  • Not all models work with the WebGPU backend yet. For example, some in the model-picker list fail because of overlapping buffer ranges in WebGPU. This is an issue I'm aware of, I just haven't had time yet to implement fixes in llama.cpp, since it requires special compilation of shaders with overlapping buffer ranges.
  • There are some UI improvements in the example I'd like to add, as well as reporting of prefill/decode speeds (like WebLLM does).
  • The code is not yet as performant as WebLLM. I think we can get there, and I think llama.cpp has advantages over WebLLM in being able to scalably provide performance across a bunch of different machines, and this is a long term project that I'm working on. On the bright side, on my machine it is much faster than using the CPU-only, using models like https://huggingface.co/unsloth/Llama-3.2-1B-Instruct-GGUF.

Summary by CodeRabbit

  • New Features
    • WebGPU GPU acceleration for improved model inference performance with automatic CPU fallback
    • Real-time performance metrics displaying tokens/second and detailed timing information
    • Performance context visibility including token counts and cache reuse data
    • WebGPU hardware acceleration status indicator in the model interface
    • Performance context reset capability for benchmarking and testing

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link

coderabbitai bot commented Dec 23, 2025

📝 Walkthrough

Walkthrough

This pull request introduces WebGPU backend support with CPU fallback, adds performance monitoring APIs (perf context/reset), upgrades the GLUE protocol from v1 to v2 with new message types, integrates WebGPU UI indicators and performance metrics display in the frontend, and updates build dependencies and EMSDK versions.

Changes

Cohort / File(s) Summary
Backend Integration & Protocol V2
cpp/actions.hpp, cpp/glue.hpp, cpp/test_glue.cpp, cpp/wllama.cpp
Extended struct app_t with WebGPU device field. Reworked backend selection: chooses WebGPU if requested, otherwise CPU, with error handling. Added perf context/reset actions. GLUE_VERSION bumped to 2; glue_msg_load_req gains use_webgpu and no_perf fields (replaces n_gpu_layers). New perf message types added to protocol. Test updated to validate WebGPU field.
Message Protocol (TypeScript)
src/glue/messages.ts
GLUE_VERSION updated to 2. LoadReq interface extended with use_webgpu and no_perf fields. New public message interfaces for perf operations: GlueMsgPerfContextReq/Res and GlueMsgPerfResetReq/Res. Union type GlueMsg extended accordingly.
Core Library & Worker
src/wllama.ts, src/workers-code/llama-cpp.js
Added preferWebGPU and noPerf options to WllamaConfig. Wllama initialization now WebGPU-first with CPU fallback. New public API methods: usingWebGPU(), getPerfContext(), resetPerfContext(). callWrapper enhanced to support async operations via new 4th parameter.
Frontend UI & Types
examples/main/src/components/ModelScreen.tsx, examples/main/src/components/ChatScreen.tsx, examples/main/src/utils/wllama.context.tsx, examples/main/src/utils/types.ts
Added WebGPU indicator display in ModelCard. ChatScreen now displays perf metrics (Prefill/Decode tok/s) with refresh and reset buttons. RuntimeInfo interface extended with usingWebGPU boolean. Wllama instantiation includes preferWebGPU: true.
Build Configuration & Dependencies
CMakeLists.txt, scripts/build_wasm.sh, scripts/docker-compose.yml, package.json, examples/main/package.json, tsconfig.build.json, examples/main/tsconfig.app.json, .github/workflows/deploy-examples-main.yml
EMSDK_IMAGE_TAG updated to 4.0.20. docker-compose refactored: SHARED_EMCC_FLAGS replaces SHARED_EMCC_CFLAGS. CMake options added: GGML_WEBGPU, GGML_WEBGPU_JSPI, LLAMA_WASM_MEM64. @webgpu/types added as devDependency. TypeScript configs updated with WebGPU types. Jinja dependency updated from ^0.2.2 to ^0.5.3. GitHub Pages deployment workflow added.
Submodule Update
llama.cpp
Submodule pointer updated to commit 9e41884dce.

Sequence Diagrams

sequenceDiagram
    participant Frontend as Frontend<br/>(React App)
    participant Wllama as Wllama<br/>(TypeScript)
    participant Worker as Web Worker<br/>(llama-cpp.js)
    participant Backend as C++ Backend<br/>(via WASM)

    Frontend->>Wllama: init(config: {preferWebGPU: true})
    Wllama->>Wllama: Check WebGPU availability
    alt WebGPU Available
        Wllama->>Wllama: useWebGPU = true
    else WebGPU Unavailable
        Wllama->>Wllama: useWebGPU = false (CPU fallback)
    end
    Wllama->>Worker: load model {use_webgpu, nbThreads}
    Worker->>Backend: wllama_action(load_req)
    Backend->>Backend: Select device (WebGPU or CPU)
    Backend-->>Worker: Load complete
    Worker-->>Wllama: Load response
    Wllama-->>Frontend: Initialized with usingWebGPU()
Loading
sequenceDiagram
    participant Frontend as Frontend<br/>(ChatScreen)
    participant Wllama as Wllama
    participant Worker as Web Worker
    participant Backend as C++ Backend

    Frontend->>Frontend: After model inference completion
    Frontend->>Wllama: getPerfContext()
    Wllama->>Worker: Send perf_context request
    Worker->>Backend: wllama_action(perf_context_req)
    Backend->>Backend: Collect perf metrics<br/>(t_start_ms, t_eval_ms, etc.)
    Backend-->>Worker: perf_context_res
    Worker-->>Wllama: PerfContextData
    Wllama-->>Frontend: {t_start_ms, t_eval_ms, n_eval, ...}
    Frontend->>Frontend: Display Prefill/Decode tok/s
    Frontend->>Wllama: resetPerfContext() (on Reset button)
    Wllama->>Worker: Send perf_reset request
    Worker->>Backend: wllama_action(perf_reset_req)
    Backend-->>Worker: perf_reset_res {success}
    Worker-->>Wllama: Reset complete
    Wllama-->>Frontend: Success response
Loading

Estimated Code Review Effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Poem

🐰 A WebGPU dream in silicon fur,
From CPU's steady hum to GPU's whir,
Now perf metrics dance on the screen,
The fastest LLama I've ever seen!
With protocol bumps and async refrain,
This rabbit's proud of our GPU gain! 🚀

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 30.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'WebGPU integration' accurately and concisely summarizes the main change: adding WebGPU backend support to the wllama project across multiple files and components.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

📜 Recent review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 492c423 and a8a9546.

⛔ Files ignored due to path filters (2)
  • src/multi-thread/wllama.wasm is excluded by !**/*.wasm
  • src/single-thread/wllama.wasm is excluded by !**/*.wasm
📒 Files selected for processing (2)
  • llama.cpp
  • src/workers-code/generated.ts
🚧 Files skipped from review as they are similar to previous changes (1)
  • llama.cpp

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
cpp/glue.hpp (1)

450-452: Missing break statement causes fall-through bug in deserialize.

The GLUE_DTYPE_RAW case is missing a break statement, causing unintended fall-through to GLUE_DTYPE_ARRAY_BOOL. This is a pre-existing bug not introduced by this PR, but worth noting.

🔎 Proposed fix
     case GLUE_DTYPE_RAW:
       ((glue_raw *)field)->parse(input);
+      break;
     case GLUE_DTYPE_ARRAY_BOOL:
🧹 Nitpick comments (3)
scripts/docker-compose.yml (2)

22-22: Reconsider -sEXPORT_ALL=1 flag.

The -sEXPORT_ALL=1 flag exports all C functions, which makes the explicit -sEXPORTED_FUNCTIONS list redundant. This can increase binary size and potentially expose internal functions that don't need to be public.

Consider removing -sEXPORT_ALL=1 and relying solely on the explicit export list for better control and smaller binary size.

🔎 Suggested adjustment
-        export SHARED_EMCC_FLAGS="--no-entry -O3 -msimd128 -DNDEBUG -flto=full -frtti -fwasm-exceptions -sEXPORT_ALL=1 -sEXPORT_ES6=0 -sMODULARIZE=0 -sINITIAL_MEMORY=128MB -sMAXIMUM_MEMORY=4096MB -sALLOW_MEMORY_GROWTH=1 -sFORCE_FILESYSTEM=1 -sEXPORTED_FUNCTIONS=_main,_wllama_malloc,_wllama_start,_wllama_action,_wllama_exit,_wllama_debug -sEXPORTED_RUNTIME_METHODS=ccall,cwrap,HEAPU8 -sNO_EXIT_RUNTIME=1 -Wno-unused-command-line-argument"
+        export SHARED_EMCC_FLAGS="--no-entry -O3 -msimd128 -DNDEBUG -flto=full -frtti -fwasm-exceptions -sEXPORT_ES6=0 -sMODULARIZE=0 -sINITIAL_MEMORY=128MB -sMAXIMUM_MEMORY=4096MB -sALLOW_MEMORY_GROWTH=1 -sFORCE_FILESYSTEM=1 -sEXPORTED_FUNCTIONS=_main,_wllama_malloc,_wllama_start,_wllama_action,_wllama_exit,_wllama_debug -sEXPORTED_RUNTIME_METHODS=ccall,cwrap,HEAPU8 -sNO_EXIT_RUNTIME=1 -Wno-unused-command-line-argument"

26-27: Document JSPI browser compatibility requirements in project documentation.

JSPI (JavaScript Promise Integration) is currently experimental, with limited mainstream browser support: Chrome has an origin trial (enabled through Chrome 128), Microsoft Edge's origin trial has expired (as of July 22, 2025), and Firefox/Safari remain in tracking mode without shipping. The WebAssembly API has also undergone iterations (e.g., removal of explicit Suspender objects in 2024–2025). Add a compatibility note to the project's README or documentation to inform users of these browser limitations before using the WebGPU build target.

src/wllama.ts (1)

562-587: Consider skipping multi-thread check when WebGPU is used.

When useWebGPU is true, the multi-thread support check (line 569) and subsequent logic are unnecessary since WebGPU always uses single-thread mode (lines 582-584, 587). Consider moving the isSupportMultiThread() call inside a conditional to avoid the async check when WebGPU is active.

🔎 Proposed refactor
 const useWebGPU = this.config.backend === 'webgpu' && navigator.gpu !== undefined;
 if (this.config.backend === 'webgpu' && !useWebGPU) {
   this.logger().warn(
     'WebGPU backend requested but WebGPU is not available, falling back to CPU'
   );
 }
 // detect if we can use multi-thread
-let supportMultiThread = await isSupportMultiThread();
-if (!useWebGPU && !supportMultiThread) {
+let supportMultiThread = false;
+if (!useWebGPU) {
+  supportMultiThread = await isSupportMultiThread();
+  if (!supportMultiThread) {
-  this.logger().warn(
-    'Multi-threads are not supported in this environment, falling back to single-thread'
-  );
+    this.logger().warn(
+      'Multi-threads are not supported in this environment, falling back to single-thread'
+    );
+  }
 }
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8778d7b and 690da82.

⛔ Files ignored due to path filters (5)
  • examples/main/package-lock.json is excluded by !**/package-lock.json
  • package-lock.json is excluded by !**/package-lock.json
  • src/multi-thread/wllama.wasm is excluded by !**/*.wasm
  • src/single-thread/wllama.wasm is excluded by !**/*.wasm
  • src/webgpu-single-thread/wllama.wasm is excluded by !**/*.wasm
📒 Files selected for processing (21)
  • .github/workflows/deploy-examples-main.yml
  • cpp/actions.hpp
  • cpp/glue.hpp
  • cpp/test_glue.cpp
  • examples/main/package.json
  • examples/main/src/config.ts
  • examples/main/src/utils/wllama.context.tsx
  • examples/main/tsconfig.app.json
  • llama.cpp
  • package.json
  • scripts/build_wasm.sh
  • scripts/docker-compose.yml
  • src/glue/messages.ts
  • src/multi-thread/wllama.js
  • src/single-thread/wllama.js
  • src/webgpu-single-thread/wllama.js
  • src/wllama.ts
  • src/worker.ts
  • src/workers-code/generated.ts
  • src/workers-code/llama-cpp.js
  • tsconfig.build.json
💤 Files with no reviewable changes (1)
  • src/worker.ts
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-04-18T08:24:14.434Z
Learnt from: ngxson
Repo: ngxson/wllama PR: 0
File: :0-0
Timestamp: 2025-04-18T08:24:14.434Z
Learning: The file `generate_wasm_from_cdn.ts` in the wllama project is a generated script that gets stored on CDN, making it impossible to use `require('../package.json')` to dynamically access version information. This is why version references in this file need to be manually updated when the package version changes.

Applied to files:

  • package.json
  • scripts/build_wasm.sh
  • scripts/docker-compose.yml
🧬 Code graph analysis (3)
src/wllama.ts (1)
src/utils.ts (1)
  • isSupportMultiThread (140-156)
src/workers-code/llama-cpp.js (1)
src/worker.ts (4)
  • wllamaStart (118-126)
  • wllamaAction (128-140)
  • wllamaExit (142-152)
  • wllamaDebug (154-161)
cpp/glue.hpp (1)
src/glue/messages.ts (1)
  • GLUE_VERSION (6-6)
🔇 Additional comments (22)
scripts/docker-compose.yml (1)

35-36: Multi-thread build configuration looks correct.

The multi-thread build properly disables WebGPU (since JSPI doesn't work with pthreads) and applies appropriate pthread flags. The quote escaping in PTHREAD_POOL_SIZE=Module[\\\"pthreadPoolSize\\\"] is complex but necessary for the multi-layer shell/docker environment.

examples/main/src/utils/wllama.context.tsx (1)

62-71: LGTM: Backend configuration is properly integrated.

The backend option is consistently applied in both the initial Wllama instance creation and the reset function, ensuring uniform configuration throughout the application lifecycle.

tsconfig.build.json (1)

30-31: LGTM: WebGPU types correctly configured.

The addition of @webgpu/types to the types array properly enables WebGPU type definitions for the build, aligning with the devDependency added in package.json.

package.json (1)

51-51: Update @webgpu/types to the latest stable version.

Version 0.1.68 appears to exceed the latest published version (0.1.66 as of December 23, 2025 per Snyk). Verify the correct version from the npm registry and ensure it matches the intended dependency.

llama.cpp (1)

1-1: Code snippet is missing; unable to verify submodule update claims.

The review comment references a submodule pointer change but provides no actual diff or code context to review. Verification of WebGPU alignment and backward compatibility requires examining the specific commits involved. Include the git diff output in the review to enable proper assessment.

examples/main/package.json (2)

30-30: LGTM!

Adding @webgpu/types as a devDependency correctly supports TypeScript type checking for WebGPU APIs used in this PR.


16-16: Verify breaking changes in @huggingface/jinja version 0.5.3.

The upgrade from ^0.2.2 to ^0.5.3 is a significant version jump. The codebase usage is straightforward (Template constructor and render method), but the specific breaking changes between versions cannot be definitively determined without access to the package changelog. Consider checking the @huggingface/jinja GitHub repository releases or changelog to confirm API compatibility.

examples/main/tsconfig.app.json (1)

8-8: LGTM!

The types configuration correctly includes @webgpu/types to provide WebGPU type declarations, complementing the devDependency addition.

examples/main/src/config.ts (1)

95-96: LGTM!

The default backend of 'webgpu' aligns with the PR objectives. The fallback logic to CPU when WebGPU is unavailable should be handled by the Wllama class backend selection mechanism.

cpp/actions.hpp (1)

22-32: LGTM - Clean struct extension for device tracking.

Adding ggml_backend_dev_t device to app_t properly tracks the selected backend device for the model's lifetime.

cpp/glue.hpp (2)

24-24: LGTM - Version bump correctly reflects breaking protocol change.

The GLUE_VERSION bump from 1 to 2 properly indicates the breaking change from n_gpu_layers (int) to use_webgpu (bool), ensuring version mismatch detection between TypeScript and C++ sides.


497-497: LGTM - Protocol change from n_gpu_layers to use_webgpu.

The field change from n_gpu_layers (int) to use_webgpu (bool) aligns with the new backend selection model and matches the TypeScript side in src/glue/messages.ts.

src/glue/messages.ts (3)

6-6: LGTM - Version bump matches C++ side.

The GLUE_VERSION = 2 aligns with cpp/glue.hpp, ensuring protocol version compatibility.


46-50: LGTM - Protocol field change is consistent.

The use_webgpu boolean field correctly replaces the previous n_gpu_layers integer, matching the C++ struct definition.


1006-1006: LGTM - Interface correctly updated for use_webgpu.

The TypeScript interface reflects the protocol change with the use_webgpu: boolean field.

.github/workflows/deploy-examples-main.yml (2)

31-36: > Likely an incorrect or invalid review comment.


23-29: No action needed. The workflow correctly handles the @wllama/wllama file:// dependency through npm ci, which resolves local package references. The WASM files (src/single-thread/wllama.wasm, src/multi-thread/wllama.wasm) are committed to the repository, so they're available to examples/main without requiring a separate root package build step. The root package's build:wasm script uses Docker and is only needed when regenerating WASM from C++ source, which is not part of this deployment workflow.

Likely an incorrect or invalid review comment.

scripts/build_wasm.sh (1)

5-5: Update EMSDK version with awareness of WebGPU binding deprecation.

The Emscripten 4.0.20 upgrade deprecates the builtin -sUSE_WEBGPU binding in favor of the external emdawnwebgpu port for WebGPU support. Ensure the build configuration migrates to the recommended emdawnwebgpu approach rather than relying on the deprecated builtin binding. JSPI support is promoted in this release and is now the recommended async integration mechanism.

cpp/test_glue.cpp (1)

15-41: LGTM! Test correctly updated for protocol change.

The test accurately reflects the protocol change from n_gpu_layers (numeric) to use_webgpu (boolean). Serialization and deserialization assertions are consistent with the new field type.

src/wllama.ts (2)

77-80: LGTM! Backend configuration option is well-defined.

The new backend option is properly typed with a clear union type and includes helpful documentation.


618-623: LGTM! Load request correctly updated.

The use_webgpu flag is properly passed to the load operation, consistent with the protocol change from n_gpu_layers to use_webgpu.

src/workers-code/llama-cpp.js (1)

230-247: LGTM! Async wrapper implementation is correct.

The updated callWrapper function properly handles both synchronous and asynchronous functions by conditionally passing { async: true } to cwrap and using await only when needed. The uniform async wrapper interface ensures consistency.

@reeselevine
Copy link
Author

^ latest commits added a new API to wllama that surfaces performance data (tokens/second), and for now, I updated the wasm to support a max size of 4GB, or whatever the machine supports

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
cpp/glue.hpp (2)

450-453: Missing break statement causes fall-through bug.

The GLUE_DTYPE_RAW case is missing a break statement, causing it to fall through to GLUE_DTYPE_ARRAY_BOOL. This will cause incorrect parsing of raw fields.

🔎 Proposed fix
     case GLUE_DTYPE_RAW:
       ((glue_raw *)field)->parse(input);
+      break;
     case GLUE_DTYPE_ARRAY_BOOL:

464-466: Missing break statement at end of switch.

The GLUE_DTYPE_ARRAY_RAW case is missing a break statement. While it's the last case before the closing brace, adding the break is good practice for maintainability.

🔎 Proposed fix
     case GLUE_DTYPE_ARRAY_RAW:
       ((glue_arr_raw *)field)->parse(input);
+      break;
     }
🧹 Nitpick comments (2)
examples/main/src/components/ChatScreen.tsx (1)

190-195: Consider using e.key instead of deprecated e.keyCode.

keyCode is deprecated. For better compatibility and clarity, use e.key === 'Enter'.

🔎 Suggested fix
               onKeyDown={(e) => {
-                if (e.keyCode == 13 && e.shiftKey == false) {
+                if (e.key === 'Enter' && !e.shiftKey) {
                   e.preventDefault();
                   onSubmit();
                 }
               }}
src/wllama.ts (1)

600-601: Minor typo: hwConccurency should be hwConcurrency.

The variable name has an extra 'c'. This is a local variable so it doesn't affect functionality, but correcting the typo improves readability.

🔎 Suggested fix
-        const hwConccurency = Math.floor((navigator.hardwareConcurrency || 1) / 2);
-        this.nbThreads = config.n_threads ?? hwConccurency;
+        const hwConcurrency = Math.floor((navigator.hardwareConcurrency || 1) / 2);
+        this.nbThreads = config.n_threads ?? hwConcurrency;
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between be3305e and f5a4058.

⛔ Files ignored due to path filters (2)
  • src/multi-thread/wllama.wasm is excluded by !**/*.wasm
  • src/single-thread/wllama.wasm is excluded by !**/*.wasm
📒 Files selected for processing (6)
  • cpp/actions.hpp
  • cpp/glue.hpp
  • cpp/wllama.cpp
  • examples/main/src/components/ChatScreen.tsx
  • src/glue/messages.ts
  • src/wllama.ts
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-12-29T03:44:02.649Z
Learnt from: reeselevine
Repo: ngxson/wllama PR: 198
File: src/workers-code/llama-cpp.js:269-279
Timestamp: 2025-12-29T03:44:02.649Z
Learning: In wllama's src/workers-code/llama-cpp.js, the isAsync parameter in callWrapper should only be true for functions included in JSPI_EXPORTS build configuration (wllama_start and wllama_action). Other functions like wllama_exit, wllama_debug, and wllama_malloc should use isAsync=false even though the callWrapper returns an async function, because they don't perform async operations at the C/WASM level.

Applied to files:

  • src/wllama.ts
📚 Learning: 2025-04-18T08:24:14.434Z
Learnt from: ngxson
Repo: ngxson/wllama PR: 0
File: :0-0
Timestamp: 2025-04-18T08:24:14.434Z
Learning: The file `generate_wasm_from_cdn.ts` in the wllama project is a generated script that gets stored on CDN, making it impossible to use `require('../package.json')` to dynamically access version information. This is why version references in this file need to be manually updated when the package version changes.

Applied to files:

  • src/wllama.ts
🧬 Code graph analysis (3)
examples/main/src/components/ChatScreen.tsx (1)
src/wllama.ts (1)
  • PerfContextData (134-143)
cpp/glue.hpp (1)
src/glue/messages.ts (1)
  • GLUE_VERSION (6-6)
src/wllama.ts (2)
src/utils.ts (1)
  • isSupportMultiThread (140-156)
src/glue/messages.ts (2)
  • GlueMsgPerfContextRes (1378-1388)
  • GlueMsgPerfResetRes (1396-1399)
🔇 Additional comments (8)
cpp/wllama.cpp (1)

121-122: LGTM!

The new perf_context and perf_reset actions are correctly integrated into the dispatch table, following the established WLLAMA_ACTION macro pattern used by other actions.

examples/main/src/components/ChatScreen.tsx (1)

51-75: LGTM on perf data fetching and reset logic.

The refreshPerf and resetPerf functions properly handle async operations with loading state (perfBusy) and error handling. The UI correctly displays perf metrics (tok/s for prefill and decode) and provides a reset button with appropriate disabled states.

cpp/glue.hpp (1)

823-850: LGTM on new perf message structures.

The glue_msg_perf_context_req/res and glue_msg_perf_reset_req/res structures are correctly defined with appropriate handler IDs and fields that match the TypeScript interfaces.

cpp/actions.hpp (2)

164-179: LGTM on backend selection and device configuration.

The WebGPU/CPU backend selection logic is well-implemented:

  • WebGPU path sets n_gpu_layers = 999 for full offloading
  • CPU path sets n_gpu_layers = 0
  • Proper error handling if no backend device is available
  • The devices[] array is correctly NULL-terminated, addressing the previous review feedback

799-832: LGTM on perf context actions.

Both action_perf_context and action_perf_reset correctly:

  • Parse the request using PARSE_REQ macro
  • Check for null app.ctx before accessing performance data
  • Return appropriate success/failure responses
  • Follow the established action pattern in the codebase
src/glue/messages.ts (1)

1-6: LGTM on GLUE protocol updates.

The GLUE_VERSION bump to 2 and all associated changes (use_webgpu, no_perf fields, perf message types) are consistent with the C++ glue.hpp definitions. Since this file is auto-generated, the consistency is expected and correct.

src/wllama.ts (2)

587-615: LGTM on WebGPU-first initialization flow.

The logic correctly:

  1. Checks preferWebGPU config and navigator.gpu availability
  2. Falls back to CPU with a warning if WebGPU is unavailable
  3. Only attempts multi-thread setup when not using WebGPU
  4. Properly handles missing multi-thread wasm path with a warning

1369-1390: LGTM on performance context API methods.

The getPerfContext and resetPerfContext methods are correctly implemented:

  • Both check that the model is loaded via checkModelLoaded()
  • They use the appropriate glue message types
  • Return types align with the defined interfaces

@ngxson
Copy link
Owner

ngxson commented Jan 6, 2026

Sorry for the late response, I'll try to look into it this week.

Re. 32/64-bit builds, I think we can try to fix it in a dedicated PR. Some parts of the code currently assume the address to be 32-bit (noticeably the code to "stream" the file to wasm memory context). But it shouldn't be too complicated to change then to 64-bit.

My main concern is that 64-bit can potentially break safari, but we can probably consider dropping safari support, or force using single-thread build if needed. Also, I think single-thread build won't be benefited from 64-bit (larger models), so we can probably constraint 64-bit to only be available on multi-threaded build

@reeselevine
Copy link
Author

No worries. As for single-thread vs. multi-thread 64-bit, when you review you'll see that I only added WebGPU support to the single-thread build, because the increased performance comes from offloading to the GPU, not from running multiple threads on the CPU. For example, on my machine I get ~32 t/s decode right now with the single-thread WebGPU backend, and ~10 t/s decode with the multi-thread CPU backend on a Llama-3.2-1B-Instruct-Q4_0 model.

So we might want to come up with a plan for what wasm backends wllama wants to support, without having too many of them. I'm not sure what the full matrix of support in browsers is for WebGPU/multi-thread, but in my mind it definitely makes sense to have a single-thread 32-bit build fallback, with or without WebGPU support. Then the question is whether a multi-thread 64-bit build is necessary, or if a single-thread 64-bit build with WebGPU support should be enough for most usecases.

@ngxson
Copy link
Owner

ngxson commented Jan 8, 2026

What I'm wondering whether multi-thread + webgpu improve performance. I suspect ops that are not supported by webgpu will be benefit from that, as they will be fallback to GPU implementation.

But I agree that this can be a later improvement. For now, I think having just 3 builds can be enough:

  • single-thread CPU
  • multi-thread CPU
  • webgpu (currently single-thread, but can be multi in the future). we can also enable 64-bit only on this build

Edit: see my comment below

Copy link
Owner

@ngxson ngxson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking again at the change, it seems there's no harm adding webgpu support to all the builds. so I think the only consideration here is whether to enable 64-bit

I think we can distribute 2 x 32-bit builds (single/multi-thread) plus one extra 64-bit build (can be single or multi depending on which one works better on webgpu)

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
cpp/test_glue.cpp (1)

15-43: Add test coverage for the no_perf field.

The test covers the new use_webgpu field but not the no_perf field also added to glue_msg_load_req (see cpp/glue.hpp line 499). Complete test coverage requires testing all new fields.

🧪 Proposed test addition
 void test_load_req() {
   glue_msg_load_req req;
   req.use_mmap.value = true;
   req.n_gpu_layers.value = 32;
   req.use_webgpu.value = true;
+  req.no_perf.value = false;
   req.seed.value = 42;
   req.n_ctx.value = 2048;
   req.embeddings.value = false;
   req.pooling_type.value = "mean";

   req.handler.serialize(outbuf);
   FILE* fp = fopen("dump.bin", "wb");
   fwrite(outbuf.data.data(), 1, outbuf.data.size(), fp);
   fclose(fp);

   printf("\n----------\n\n");

   glue_msg_load_req req2;
   glue_inbuf inbuf(outbuf.data.data());
   req2.handler.deserialize(inbuf);

   assert(req2.use_mmap.value == true);
   assert(req2.n_gpu_layers.value == 32);
   assert(req2.use_webgpu.value == true);
+  assert(req2.no_perf.value == false);
   assert(req2.seed.value == 42);
   assert(req2.n_ctx.value == 2048);
   assert(req2.embeddings.value == false);
   assert(req2.pooling_type.value == "mean");
 }
🧹 Nitpick comments (2)
CMakeLists.txt (1)

4-6: Consider adding documentation for the new CMake options.

The new options lack help text. Adding descriptions would improve usability for developers configuring the build.

📝 Suggested documentation
-option(GGML_WEBGPU "Enable GGML WebGPU backend" ON)
-option(GGML_WEBGPU_JSPI "Enable GGML WebGPU JSPI support" ON)
-option(LLAMA_WASM_MEM64 "Enable 64-bit memory for WebAssembly builds" OFF)
+option(GGML_WEBGPU "Enable GGML WebGPU backend for GPU acceleration in WASM builds" ON)
+option(GGML_WEBGPU_JSPI "Enable JSPI (JavaScript Promise Integration) for async WebGPU operations" ON)
+option(LLAMA_WASM_MEM64 "Enable 64-bit memory addressing for WebAssembly (experimental, may break Safari)" OFF)
src/wllama.ts (1)

597-627: Consider checking WebGPU preference before multi-thread detection.

The current flow checks multi-thread support and configuration before disabling it for WebGPU (lines 620-626). While correct, checking this.useWebGPU earlier would avoid unnecessary multi-thread detection when WebGPU is active.

♻️ Optional refactoring to skip multi-thread detection when using WebGPU

Move the WebGPU check before the multi-thread detection:

+    // TODO: investigate why WebGPU + multi-threading causes performance issues
+    if (this.useWebGPU) {
+      this.logger().warn(
+        'Multi-threading is not yet supported with WebGPU backend'
+      );
+      this.nbThreads = 1;
+    } else if (await isSupportMultiThread()) {
       if (this.pathConfig['multi-thread/wllama.wasm']) {
         const hwConcurrency = Math.floor((navigator.hardwareConcurrency || 1) / 2);
         this.nbThreads = config.n_threads ?? hwConcurrency;
         if (this.nbThreads > 1) {
           this.useMultiThread = true;
         } else {
           this.logger().warn(
             'Falling back single-thread due to n_threads configuration or limited hardware concurrency'
           );
         }
       } else {
         this.logger().warn(
           'Missing paths to "multi-thread/wllama.wasm", falling back to single-thread'
         );
       }
     } else {
       this.logger().warn(
         'Multi-threads are not supported in this environment, falling back to single-thread'
       );
     }
-
-    // TODO: investigate why WebGPU + multi-threading causes performance issues
-    if (this.useWebGPU) {
-      this.logger().warn(
-        'Disabling multi-threading when using WebGPU backend'
-      );
-      this.useMultiThread = false;
-      this.nbThreads = 1;
-    }
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f5a4058 and 492c423.

⛔ Files ignored due to path filters (2)
  • src/multi-thread/wllama.wasm is excluded by !**/*.wasm
  • src/single-thread/wllama.wasm is excluded by !**/*.wasm
📒 Files selected for processing (14)
  • .github/workflows/deploy-examples-main.yml
  • CMakeLists.txt
  • cpp/actions.hpp
  • cpp/glue.hpp
  • cpp/test_glue.cpp
  • examples/main/src/components/ChatScreen.tsx
  • llama.cpp
  • scripts/docker-compose.yml
  • src/glue/messages.ts
  • src/multi-thread/wllama.js
  • src/webgpu-single-thread/wllama.js
  • src/wllama.ts
  • src/workers-code/generated.ts
  • src/workers-code/llama-cpp.js
🚧 Files skipped from review as they are similar to previous changes (3)
  • llama.cpp
  • examples/main/src/components/ChatScreen.tsx
  • scripts/docker-compose.yml
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-12-29T03:44:02.649Z
Learnt from: reeselevine
Repo: ngxson/wllama PR: 198
File: src/workers-code/llama-cpp.js:269-279
Timestamp: 2025-12-29T03:44:02.649Z
Learning: In wllama's src/workers-code/llama-cpp.js, the isAsync parameter in callWrapper should only be true for functions included in JSPI_EXPORTS build configuration (wllama_start and wllama_action). Other functions like wllama_exit, wllama_debug, and wllama_malloc should use isAsync=false even though the callWrapper returns an async function, because they don't perform async operations at the C/WASM level.

Applied to files:

  • src/wllama.ts
📚 Learning: 2025-12-29T03:44:00.085Z
Learnt from: reeselevine
Repo: ngxson/wllama PR: 198
File: src/workers-code/llama-cpp.js:269-279
Timestamp: 2025-12-29T03:44:00.085Z
Learning: In src/workers-code/llama-cpp.js, constrain the isAsync flag passed to callWrapper: set isAsync to true only for functions included in the JSPI_EXPORTS build (specifically wllama_start and wllama_action). For other exported functions such as wllama_exit, wllama_debug, and wllama_malloc, always use isAsync=false since they do not perform async operations at the C/WASM level, even though callWrapper may expose an async function. This should be verifiable by reviewing the build exports and ensuring these functions are not marked as asynchronous in the wrapper when invoked.

Applied to files:

  • src/workers-code/llama-cpp.js
🧬 Code graph analysis (4)
cpp/actions.hpp (1)
cpp/glue.hpp (1)
  • data (87-90)
src/wllama.ts (2)
src/utils.ts (1)
  • isSupportMultiThread (140-156)
src/glue/messages.ts (2)
  • GlueMsgPerfContextRes (1384-1394)
  • GlueMsgPerfResetRes (1402-1405)
src/workers-code/llama-cpp.js (2)
cpp/generate_glue_prototype.js (1)
  • name (39-39)
src/worker.ts (4)
  • wllamaStart (118-126)
  • wllamaAction (128-140)
  • wllamaExit (142-152)
  • wllamaDebug (154-161)
cpp/glue.hpp (1)
src/glue/messages.ts (1)
  • GLUE_VERSION (6-6)
🔇 Additional comments (20)
.github/workflows/deploy-examples-main.yml (2)

3-4: Verify if manual-only deployment is intentional.

The workflow uses workflow_dispatch (manual trigger only). If you want automatic deployment on pushes to the examples, consider adding a push trigger.

If automatic deployment is desired, here's the modification:

🔄 Add automatic trigger on push
 on:
   workflow_dispatch:
+  push:
+    branches:
+      - master
+    paths:
+      - 'examples/main/**'

32-32: Pinned action SHA is properly tied to stable releases (v4.7.6 and v4).

The commit SHA 9d877eea73427180ae43cf98e8914934fe157a1a corresponds to release v4.7.6 and is also tagged with the major version tag v4, confirming this is a stable release pinning rather than an arbitrary commit.

cpp/glue.hpp (3)

24-24: LGTM! Protocol version bump is correct.

The version bump to 2 is appropriate given the addition of new fields and message types. This matches the TypeScript side (src/glue/messages.ts line 6).


497-499: LGTM! New LoadReq fields are properly structured.

The use_webgpu and no_perf boolean fields are correctly added using the GLUE_FIELD macro and match the TypeScript definitions.


824-851: LGTM! Performance context messages are well-structured.

The new performance monitoring messages follow the existing pattern and provide comprehensive timing and evaluation metrics. The field types and names are consistent with the TypeScript definitions (src/glue/messages.ts lines 859-929).

src/glue/messages.ts (2)

6-6: LGTM! TypeScript glue definitions are consistent with C++.

The changes correctly mirror the C++ side (cpp/glue.hpp):

  • Protocol version bump to 2 matches
  • use_webgpu and no_perf fields added to LoadReq in the correct order
  • Performance context message types and fields are consistent
  • All new message types properly added to the GlueMsg union

Also applies to: 46-60, 859-929, 1087-1089, 1378-1405, 1458-1458


1-3: No action needed—the file was properly generated and committed correctly.

The generator script cpp/generate_glue_prototype.js exists and was committed in the same changeset as src/glue/messages.ts (commit d9112bd). Both files show correct timestamps and the working tree is clean. The file header is accurate and the content matches the expected output structure from the generator (TypeScript interfaces with GLUE_MESSAGE_PROTOTYPES). The generation process was followed as intended.

src/workers-code/llama-cpp.js (2)

230-247: LGTM: callWrapper async flag implementation is correct.

The updated callWrapper signature correctly accepts isAsync as the 4th parameter and passes { async: true } to Module.cwrap when needed. The returned wrapper appropriately awaits async functions while handling sync functions synchronously. Error handling is preserved in both paths.

Based on learnings, this aligns with the JSPI_EXPORTS build configuration pattern.


269-283: LGTM: Wrapper configurations align with JSPI_EXPORTS.

The isAsync flags are correctly set:

  • wllamaMalloc, wllamaExit, wllamaDebug: isAsync=false (not in JSPI_EXPORTS)
  • wllamaStart, wllamaAction: isAsync=true (included in JSPI_EXPORTS)

Based on learnings, these functions are correctly configured to match the async operations at the C/WASM level.

cpp/actions.hpp (4)

11-11: LGTM: Backend support infrastructure added correctly.

The include for ggml-backend.h and the new device field in app_t are necessary for WebGPU backend selection. The field is safely initialized to nullptr.

Also applies to: 24-24


187-187: Verify that no_perf field is always initialized.

Similar to the use_webgpu concern, req.no_perf.value is accessed without checking req.no_perf.not_null(). Ensure this field has a default value or is always set by callers.

The verification script in the previous comment will also check the no_perf field definition.


800-833: LGTM: Performance context actions implemented correctly.

Both action_perf_context and action_perf_reset follow the established pattern:

  • Proper null checks for app.ctx
  • Appropriate error handling (returning success=false)
  • Correct integration with llama.cpp performance APIs

164-177: No issue found — use_webgpu is a non-nullable field.

The field is defined with GLUE_FIELD(bool, use_webgpu) (not GLUE_FIELD_NULLABLE), marked as non-nullable in the message definition, and always provided by callers. Direct access to .value without .not_null() is the correct pattern for non-nullable fields and is consistent with how other non-nullable fields like n_ctx_auto, seed, and no_perf are accessed elsewhere in the same file.

src/wllama.ts (7)

26-27: LGTM: New configuration options and interfaces are well-documented.

The new imports, config options (preferWebGPU, noPerf), and PerfContextData interface are clearly documented and align with the PR's WebGPU integration objectives.

Also applies to: 79-88, 134-143


307-307: LGTM: WebGPU state tracking follows established patterns.

The useWebGPU field and usingWebGPU() method follow the same pattern as useMultiThread and isMultithread(), with appropriate state initialization and model-loaded guards.

Also applies to: 464-467


459-462: LGTM: Simplified thread count accessor.

The method now returns this.nbThreads directly, removing previous conditional logic. This is consistent with the updated multi-thread handling in loadModel.


587-595: LGTM: Proper WebGPU feature detection with fallback.

The code correctly checks for navigator.gpu availability and provides a clear warning when WebGPU is requested but unavailable. Good defensive programming.


639-644: LGTM: Proxy initialization simplified.

The proxy now receives this.nbThreads directly, which is consistent with the updated thread detection logic and eliminates conditional complexity at the initialization point.


658-687: LGTM: Load request properly initialized with WebGPU flags.

The load request correctly populates:

  • use_webgpu: Always set to a boolean value
  • no_perf: Uses nullish coalescing with false as default
  • n_gpu_layers: Set to 999 for WebGPU (as noted in PR objectives), 0 otherwise
  • n_threads: Uses the pre-computed this.nbThreads

This confirms that the fields flagged in cpp/actions.hpp should have safe values, as they're always initialized here.


1382-1403: LGTM: Performance context API correctly implemented.

Both getPerfContext() and resetPerfContext() follow the established pattern:

  • Proper model-loaded guards
  • Correct action names matching C++ handlers
  • Return types aligned with message definitions

@reeselevine
Copy link
Author

I added WebGPU to the multi-thread build. However, I found that using the multi-thread build led to a pretty significant performance decrease running in my browser. I don't fully understand how llama.cpp routes things in multi-thread mode yet, but I know the WebGPU backend has some inefficiencies/issues with that right now. It's something we definitely need to work on/fix going forwards in llama.cpp.

I imagine this can be improved, and like you said would allow non-GPU operations to run more efficiently, but for now I disabled using the multi-thread build with WebGPU and added a TODO in the code to look into this: https://github.com/ngxson/wllama/pull/198/files#diff-dc8126d70d08a19299c727046d114b88791f7fecac5c7735b905ec184bb07436R619.

Otherwise, to support 32-bit builds and memory limits, I just opened this PR in llama.cpp: ggml-org/llama.cpp#18707. The wasm blobs here are built against that code, once that solution or something else we come up with is merged in llama.cpp I'll update the submodule here.

Copy link
Owner

@ngxson ngxson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! We can merge when ggml-org/llama.cpp#18707 is upstream and blobs are rebuilt.

One thing to note is that I will also need to design an API for lazy-loading tensors. Currently, tensors already offloaded to GPU will still have a copy inside wasm memory, which is wasteful.

@SuperPauly
Copy link

Looking good! We can merge when ggml-org/llama.cpp#18707 is upstream and blobs are rebuilt.

One thing to note is that I will also need to design an API for lazy-loading tensors. Currently, tensors already offloaded to GPU will still have a copy inside wasm memory, which is wasteful.

Looks like llama.cpp #18707 has been merged to main. Can this PR be merged now?

@ngxson
Copy link
Owner

ngxson commented Jan 15, 2026

Not quite sure why CI is not run on PRs, I didn't notice that until now. Unfortunately the CI is current fails due to this error:

image

And npm run build fails:

image

I'll see if I can fix this quickly

@ngxson
Copy link
Owner

ngxson commented Jan 15, 2026

@reeselevine I cannot push to your branch because it was created from your master, can you create a new PR from another branch?

@reeselevine
Copy link
Author

New PR opened here: #201

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants