Conversation
…awnwebgpu port, removing remote port file from repo
📝 WalkthroughWalkthroughThis pull request introduces WebGPU backend support with CPU fallback, adds performance monitoring APIs (perf context/reset), upgrades the GLUE protocol from v1 to v2 with new message types, integrates WebGPU UI indicators and performance metrics display in the frontend, and updates build dependencies and EMSDK versions. Changes
Sequence DiagramssequenceDiagram
participant Frontend as Frontend<br/>(React App)
participant Wllama as Wllama<br/>(TypeScript)
participant Worker as Web Worker<br/>(llama-cpp.js)
participant Backend as C++ Backend<br/>(via WASM)
Frontend->>Wllama: init(config: {preferWebGPU: true})
Wllama->>Wllama: Check WebGPU availability
alt WebGPU Available
Wllama->>Wllama: useWebGPU = true
else WebGPU Unavailable
Wllama->>Wllama: useWebGPU = false (CPU fallback)
end
Wllama->>Worker: load model {use_webgpu, nbThreads}
Worker->>Backend: wllama_action(load_req)
Backend->>Backend: Select device (WebGPU or CPU)
Backend-->>Worker: Load complete
Worker-->>Wllama: Load response
Wllama-->>Frontend: Initialized with usingWebGPU()
sequenceDiagram
participant Frontend as Frontend<br/>(ChatScreen)
participant Wllama as Wllama
participant Worker as Web Worker
participant Backend as C++ Backend
Frontend->>Frontend: After model inference completion
Frontend->>Wllama: getPerfContext()
Wllama->>Worker: Send perf_context request
Worker->>Backend: wllama_action(perf_context_req)
Backend->>Backend: Collect perf metrics<br/>(t_start_ms, t_eval_ms, etc.)
Backend-->>Worker: perf_context_res
Worker-->>Wllama: PerfContextData
Wllama-->>Frontend: {t_start_ms, t_eval_ms, n_eval, ...}
Frontend->>Frontend: Display Prefill/Decode tok/s
Frontend->>Wllama: resetPerfContext() (on Reset button)
Wllama->>Worker: Send perf_reset request
Worker->>Backend: wllama_action(perf_reset_req)
Backend-->>Worker: perf_reset_res {success}
Worker-->>Wllama: Reset complete
Wllama-->>Frontend: Success response
Estimated Code Review Effort🎯 4 (Complex) | ⏱️ ~50 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
📜 Recent review detailsConfiguration used: defaults Review profile: CHILL Plan: Pro ⛔ Files ignored due to path filters (2)
📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (1)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
cpp/glue.hpp (1)
450-452: Missingbreakstatement causes fall-through bug in deserialize.The
GLUE_DTYPE_RAWcase is missing abreakstatement, causing unintended fall-through toGLUE_DTYPE_ARRAY_BOOL. This is a pre-existing bug not introduced by this PR, but worth noting.🔎 Proposed fix
case GLUE_DTYPE_RAW: ((glue_raw *)field)->parse(input); + break; case GLUE_DTYPE_ARRAY_BOOL:
🧹 Nitpick comments (3)
scripts/docker-compose.yml (2)
22-22: Reconsider-sEXPORT_ALL=1flag.The
-sEXPORT_ALL=1flag exports all C functions, which makes the explicit-sEXPORTED_FUNCTIONSlist redundant. This can increase binary size and potentially expose internal functions that don't need to be public.Consider removing
-sEXPORT_ALL=1and relying solely on the explicit export list for better control and smaller binary size.🔎 Suggested adjustment
- export SHARED_EMCC_FLAGS="--no-entry -O3 -msimd128 -DNDEBUG -flto=full -frtti -fwasm-exceptions -sEXPORT_ALL=1 -sEXPORT_ES6=0 -sMODULARIZE=0 -sINITIAL_MEMORY=128MB -sMAXIMUM_MEMORY=4096MB -sALLOW_MEMORY_GROWTH=1 -sFORCE_FILESYSTEM=1 -sEXPORTED_FUNCTIONS=_main,_wllama_malloc,_wllama_start,_wllama_action,_wllama_exit,_wllama_debug -sEXPORTED_RUNTIME_METHODS=ccall,cwrap,HEAPU8 -sNO_EXIT_RUNTIME=1 -Wno-unused-command-line-argument" + export SHARED_EMCC_FLAGS="--no-entry -O3 -msimd128 -DNDEBUG -flto=full -frtti -fwasm-exceptions -sEXPORT_ES6=0 -sMODULARIZE=0 -sINITIAL_MEMORY=128MB -sMAXIMUM_MEMORY=4096MB -sALLOW_MEMORY_GROWTH=1 -sFORCE_FILESYSTEM=1 -sEXPORTED_FUNCTIONS=_main,_wllama_malloc,_wllama_start,_wllama_action,_wllama_exit,_wllama_debug -sEXPORTED_RUNTIME_METHODS=ccall,cwrap,HEAPU8 -sNO_EXIT_RUNTIME=1 -Wno-unused-command-line-argument"
26-27: Document JSPI browser compatibility requirements in project documentation.JSPI (JavaScript Promise Integration) is currently experimental, with limited mainstream browser support: Chrome has an origin trial (enabled through Chrome 128), Microsoft Edge's origin trial has expired (as of July 22, 2025), and Firefox/Safari remain in tracking mode without shipping. The WebAssembly API has also undergone iterations (e.g., removal of explicit Suspender objects in 2024–2025). Add a compatibility note to the project's README or documentation to inform users of these browser limitations before using the WebGPU build target.
src/wllama.ts (1)
562-587: Consider skipping multi-thread check when WebGPU is used.When
useWebGPUis true, the multi-thread support check (line 569) and subsequent logic are unnecessary since WebGPU always uses single-thread mode (lines 582-584, 587). Consider moving theisSupportMultiThread()call inside a conditional to avoid the async check when WebGPU is active.🔎 Proposed refactor
const useWebGPU = this.config.backend === 'webgpu' && navigator.gpu !== undefined; if (this.config.backend === 'webgpu' && !useWebGPU) { this.logger().warn( 'WebGPU backend requested but WebGPU is not available, falling back to CPU' ); } // detect if we can use multi-thread -let supportMultiThread = await isSupportMultiThread(); -if (!useWebGPU && !supportMultiThread) { +let supportMultiThread = false; +if (!useWebGPU) { + supportMultiThread = await isSupportMultiThread(); + if (!supportMultiThread) { - this.logger().warn( - 'Multi-threads are not supported in this environment, falling back to single-thread' - ); + this.logger().warn( + 'Multi-threads are not supported in this environment, falling back to single-thread' + ); + } }
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (5)
examples/main/package-lock.jsonis excluded by!**/package-lock.jsonpackage-lock.jsonis excluded by!**/package-lock.jsonsrc/multi-thread/wllama.wasmis excluded by!**/*.wasmsrc/single-thread/wllama.wasmis excluded by!**/*.wasmsrc/webgpu-single-thread/wllama.wasmis excluded by!**/*.wasm
📒 Files selected for processing (21)
.github/workflows/deploy-examples-main.ymlcpp/actions.hppcpp/glue.hppcpp/test_glue.cppexamples/main/package.jsonexamples/main/src/config.tsexamples/main/src/utils/wllama.context.tsxexamples/main/tsconfig.app.jsonllama.cpppackage.jsonscripts/build_wasm.shscripts/docker-compose.ymlsrc/glue/messages.tssrc/multi-thread/wllama.jssrc/single-thread/wllama.jssrc/webgpu-single-thread/wllama.jssrc/wllama.tssrc/worker.tssrc/workers-code/generated.tssrc/workers-code/llama-cpp.jstsconfig.build.json
💤 Files with no reviewable changes (1)
- src/worker.ts
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-04-18T08:24:14.434Z
Learnt from: ngxson
Repo: ngxson/wllama PR: 0
File: :0-0
Timestamp: 2025-04-18T08:24:14.434Z
Learning: The file `generate_wasm_from_cdn.ts` in the wllama project is a generated script that gets stored on CDN, making it impossible to use `require('../package.json')` to dynamically access version information. This is why version references in this file need to be manually updated when the package version changes.
Applied to files:
package.jsonscripts/build_wasm.shscripts/docker-compose.yml
🧬 Code graph analysis (3)
src/wllama.ts (1)
src/utils.ts (1)
isSupportMultiThread(140-156)
src/workers-code/llama-cpp.js (1)
src/worker.ts (4)
wllamaStart(118-126)wllamaAction(128-140)wllamaExit(142-152)wllamaDebug(154-161)
cpp/glue.hpp (1)
src/glue/messages.ts (1)
GLUE_VERSION(6-6)
🔇 Additional comments (22)
scripts/docker-compose.yml (1)
35-36: Multi-thread build configuration looks correct.The multi-thread build properly disables WebGPU (since JSPI doesn't work with pthreads) and applies appropriate pthread flags. The quote escaping in
PTHREAD_POOL_SIZE=Module[\\\"pthreadPoolSize\\\"]is complex but necessary for the multi-layer shell/docker environment.examples/main/src/utils/wllama.context.tsx (1)
62-71: LGTM: Backend configuration is properly integrated.The backend option is consistently applied in both the initial Wllama instance creation and the reset function, ensuring uniform configuration throughout the application lifecycle.
tsconfig.build.json (1)
30-31: LGTM: WebGPU types correctly configured.The addition of
@webgpu/typesto the types array properly enables WebGPU type definitions for the build, aligning with the devDependency added in package.json.package.json (1)
51-51: Update @webgpu/types to the latest stable version.Version 0.1.68 appears to exceed the latest published version (0.1.66 as of December 23, 2025 per Snyk). Verify the correct version from the npm registry and ensure it matches the intended dependency.
llama.cpp (1)
1-1: Code snippet is missing; unable to verify submodule update claims.The review comment references a submodule pointer change but provides no actual diff or code context to review. Verification of WebGPU alignment and backward compatibility requires examining the specific commits involved. Include the git diff output in the review to enable proper assessment.
examples/main/package.json (2)
30-30: LGTM!Adding
@webgpu/typesas a devDependency correctly supports TypeScript type checking for WebGPU APIs used in this PR.
16-16: Verify breaking changes in @huggingface/jinja version 0.5.3.The upgrade from
^0.2.2to^0.5.3is a significant version jump. The codebase usage is straightforward (Template constructor and render method), but the specific breaking changes between versions cannot be definitively determined without access to the package changelog. Consider checking the @huggingface/jinja GitHub repository releases or changelog to confirm API compatibility.examples/main/tsconfig.app.json (1)
8-8: LGTM!The types configuration correctly includes
@webgpu/typesto provide WebGPU type declarations, complementing the devDependency addition.examples/main/src/config.ts (1)
95-96: LGTM!The default backend of
'webgpu'aligns with the PR objectives. The fallback logic to CPU when WebGPU is unavailable should be handled by theWllamaclass backend selection mechanism.cpp/actions.hpp (1)
22-32: LGTM - Clean struct extension for device tracking.Adding
ggml_backend_dev_t devicetoapp_tproperly tracks the selected backend device for the model's lifetime.cpp/glue.hpp (2)
24-24: LGTM - Version bump correctly reflects breaking protocol change.The
GLUE_VERSIONbump from 1 to 2 properly indicates the breaking change fromn_gpu_layers(int) touse_webgpu(bool), ensuring version mismatch detection between TypeScript and C++ sides.
497-497: LGTM - Protocol change from n_gpu_layers to use_webgpu.The field change from
n_gpu_layers(int) touse_webgpu(bool) aligns with the new backend selection model and matches the TypeScript side insrc/glue/messages.ts.src/glue/messages.ts (3)
6-6: LGTM - Version bump matches C++ side.The
GLUE_VERSION = 2aligns withcpp/glue.hpp, ensuring protocol version compatibility.
46-50: LGTM - Protocol field change is consistent.The
use_webgpuboolean field correctly replaces the previousn_gpu_layersinteger, matching the C++ struct definition.
1006-1006: LGTM - Interface correctly updated for use_webgpu.The TypeScript interface reflects the protocol change with the
use_webgpu: booleanfield..github/workflows/deploy-examples-main.yml (2)
31-36: > Likely an incorrect or invalid review comment.
23-29: No action needed. The workflow correctly handles the@wllama/wllamafile:// dependency throughnpm ci, which resolves local package references. The WASM files (src/single-thread/wllama.wasm,src/multi-thread/wllama.wasm) are committed to the repository, so they're available toexamples/mainwithout requiring a separate root package build step. The root package'sbuild:wasmscript uses Docker and is only needed when regenerating WASM from C++ source, which is not part of this deployment workflow.Likely an incorrect or invalid review comment.
scripts/build_wasm.sh (1)
5-5: Update EMSDK version with awareness of WebGPU binding deprecation.The Emscripten 4.0.20 upgrade deprecates the builtin
-sUSE_WEBGPUbinding in favor of the externalemdawnwebgpuport for WebGPU support. Ensure the build configuration migrates to the recommended emdawnwebgpu approach rather than relying on the deprecated builtin binding. JSPI support is promoted in this release and is now the recommended async integration mechanism.cpp/test_glue.cpp (1)
15-41: LGTM! Test correctly updated for protocol change.The test accurately reflects the protocol change from
n_gpu_layers(numeric) touse_webgpu(boolean). Serialization and deserialization assertions are consistent with the new field type.src/wllama.ts (2)
77-80: LGTM! Backend configuration option is well-defined.The new
backendoption is properly typed with a clear union type and includes helpful documentation.
618-623: LGTM! Load request correctly updated.The
use_webgpuflag is properly passed to the load operation, consistent with the protocol change fromn_gpu_layerstouse_webgpu.src/workers-code/llama-cpp.js (1)
230-247: LGTM! Async wrapper implementation is correct.The updated
callWrapperfunction properly handles both synchronous and asynchronous functions by conditionally passing{ async: true }tocwrapand usingawaitonly when needed. The uniform async wrapper interface ensures consistency.
|
^ latest commits added a new API to wllama that surfaces performance data (tokens/second), and for now, I updated the wasm to support a max size of 4GB, or whatever the machine supports |
There was a problem hiding this comment.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
cpp/glue.hpp (2)
450-453: Missingbreakstatement causes fall-through bug.The
GLUE_DTYPE_RAWcase is missing abreakstatement, causing it to fall through toGLUE_DTYPE_ARRAY_BOOL. This will cause incorrect parsing of raw fields.🔎 Proposed fix
case GLUE_DTYPE_RAW: ((glue_raw *)field)->parse(input); + break; case GLUE_DTYPE_ARRAY_BOOL:
464-466: Missingbreakstatement at end of switch.The
GLUE_DTYPE_ARRAY_RAWcase is missing abreakstatement. While it's the last case before the closing brace, adding thebreakis good practice for maintainability.🔎 Proposed fix
case GLUE_DTYPE_ARRAY_RAW: ((glue_arr_raw *)field)->parse(input); + break; }
🧹 Nitpick comments (2)
examples/main/src/components/ChatScreen.tsx (1)
190-195: Consider usinge.keyinstead of deprecatede.keyCode.
keyCodeis deprecated. For better compatibility and clarity, usee.key === 'Enter'.🔎 Suggested fix
onKeyDown={(e) => { - if (e.keyCode == 13 && e.shiftKey == false) { + if (e.key === 'Enter' && !e.shiftKey) { e.preventDefault(); onSubmit(); } }}src/wllama.ts (1)
600-601: Minor typo:hwConccurencyshould behwConcurrency.The variable name has an extra 'c'. This is a local variable so it doesn't affect functionality, but correcting the typo improves readability.
🔎 Suggested fix
- const hwConccurency = Math.floor((navigator.hardwareConcurrency || 1) / 2); - this.nbThreads = config.n_threads ?? hwConccurency; + const hwConcurrency = Math.floor((navigator.hardwareConcurrency || 1) / 2); + this.nbThreads = config.n_threads ?? hwConcurrency;
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (2)
src/multi-thread/wllama.wasmis excluded by!**/*.wasmsrc/single-thread/wllama.wasmis excluded by!**/*.wasm
📒 Files selected for processing (6)
cpp/actions.hppcpp/glue.hppcpp/wllama.cppexamples/main/src/components/ChatScreen.tsxsrc/glue/messages.tssrc/wllama.ts
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-12-29T03:44:02.649Z
Learnt from: reeselevine
Repo: ngxson/wllama PR: 198
File: src/workers-code/llama-cpp.js:269-279
Timestamp: 2025-12-29T03:44:02.649Z
Learning: In wllama's src/workers-code/llama-cpp.js, the isAsync parameter in callWrapper should only be true for functions included in JSPI_EXPORTS build configuration (wllama_start and wllama_action). Other functions like wllama_exit, wllama_debug, and wllama_malloc should use isAsync=false even though the callWrapper returns an async function, because they don't perform async operations at the C/WASM level.
Applied to files:
src/wllama.ts
📚 Learning: 2025-04-18T08:24:14.434Z
Learnt from: ngxson
Repo: ngxson/wllama PR: 0
File: :0-0
Timestamp: 2025-04-18T08:24:14.434Z
Learning: The file `generate_wasm_from_cdn.ts` in the wllama project is a generated script that gets stored on CDN, making it impossible to use `require('../package.json')` to dynamically access version information. This is why version references in this file need to be manually updated when the package version changes.
Applied to files:
src/wllama.ts
🧬 Code graph analysis (3)
examples/main/src/components/ChatScreen.tsx (1)
src/wllama.ts (1)
PerfContextData(134-143)
cpp/glue.hpp (1)
src/glue/messages.ts (1)
GLUE_VERSION(6-6)
src/wllama.ts (2)
src/utils.ts (1)
isSupportMultiThread(140-156)src/glue/messages.ts (2)
GlueMsgPerfContextRes(1378-1388)GlueMsgPerfResetRes(1396-1399)
🔇 Additional comments (8)
cpp/wllama.cpp (1)
121-122: LGTM!The new
perf_contextandperf_resetactions are correctly integrated into the dispatch table, following the establishedWLLAMA_ACTIONmacro pattern used by other actions.examples/main/src/components/ChatScreen.tsx (1)
51-75: LGTM on perf data fetching and reset logic.The
refreshPerfandresetPerffunctions properly handle async operations with loading state (perfBusy) and error handling. The UI correctly displays perf metrics (tok/s for prefill and decode) and provides a reset button with appropriate disabled states.cpp/glue.hpp (1)
823-850: LGTM on new perf message structures.The
glue_msg_perf_context_req/resandglue_msg_perf_reset_req/resstructures are correctly defined with appropriate handler IDs and fields that match the TypeScript interfaces.cpp/actions.hpp (2)
164-179: LGTM on backend selection and device configuration.The WebGPU/CPU backend selection logic is well-implemented:
- WebGPU path sets
n_gpu_layers = 999for full offloading- CPU path sets
n_gpu_layers = 0- Proper error handling if no backend device is available
- The
devices[]array is correctly NULL-terminated, addressing the previous review feedback
799-832: LGTM on perf context actions.Both
action_perf_contextandaction_perf_resetcorrectly:
- Parse the request using
PARSE_REQmacro- Check for null
app.ctxbefore accessing performance data- Return appropriate success/failure responses
- Follow the established action pattern in the codebase
src/glue/messages.ts (1)
1-6: LGTM on GLUE protocol updates.The GLUE_VERSION bump to 2 and all associated changes (use_webgpu, no_perf fields, perf message types) are consistent with the C++ glue.hpp definitions. Since this file is auto-generated, the consistency is expected and correct.
src/wllama.ts (2)
587-615: LGTM on WebGPU-first initialization flow.The logic correctly:
- Checks
preferWebGPUconfig andnavigator.gpuavailability- Falls back to CPU with a warning if WebGPU is unavailable
- Only attempts multi-thread setup when not using WebGPU
- Properly handles missing multi-thread wasm path with a warning
1369-1390: LGTM on performance context API methods.The
getPerfContextandresetPerfContextmethods are correctly implemented:
- Both check that the model is loaded via
checkModelLoaded()- They use the appropriate glue message types
- Return types align with the defined interfaces
|
Sorry for the late response, I'll try to look into it this week. Re. 32/64-bit builds, I think we can try to fix it in a dedicated PR. Some parts of the code currently assume the address to be 32-bit (noticeably the code to "stream" the file to wasm memory context). But it shouldn't be too complicated to change then to 64-bit. My main concern is that 64-bit can potentially break safari, but we can probably consider dropping safari support, or force using single-thread build if needed. Also, I think single-thread build won't be benefited from 64-bit (larger models), so we can probably constraint 64-bit to only be available on multi-threaded build |
|
No worries. As for single-thread vs. multi-thread 64-bit, when you review you'll see that I only added WebGPU support to the single-thread build, because the increased performance comes from offloading to the GPU, not from running multiple threads on the CPU. For example, on my machine I get ~32 t/s decode right now with the single-thread WebGPU backend, and ~10 t/s decode with the multi-thread CPU backend on a So we might want to come up with a plan for what wasm backends wllama wants to support, without having too many of them. I'm not sure what the full matrix of support in browsers is for WebGPU/multi-thread, but in my mind it definitely makes sense to have a single-thread 32-bit build fallback, with or without WebGPU support. Then the question is whether a multi-thread 64-bit build is necessary, or if a single-thread 64-bit build with WebGPU support should be enough for most usecases. |
|
What I'm wondering whether multi-thread + webgpu improve performance. I suspect ops that are not supported by webgpu will be benefit from that, as they will be fallback to GPU implementation. But I agree that this can be a later improvement. For now, I think having just 3 builds can be enough:
Edit: see my comment below |
ngxson
left a comment
There was a problem hiding this comment.
looking again at the change, it seems there's no harm adding webgpu support to all the builds. so I think the only consideration here is whether to enable 64-bit
I think we can distribute 2 x 32-bit builds (single/multi-thread) plus one extra 64-bit build (can be single or multi depending on which one works better on webgpu)
There was a problem hiding this comment.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
cpp/test_glue.cpp (1)
15-43: Add test coverage for theno_perffield.The test covers the new
use_webgpufield but not theno_perffield also added toglue_msg_load_req(see cpp/glue.hpp line 499). Complete test coverage requires testing all new fields.🧪 Proposed test addition
void test_load_req() { glue_msg_load_req req; req.use_mmap.value = true; req.n_gpu_layers.value = 32; req.use_webgpu.value = true; + req.no_perf.value = false; req.seed.value = 42; req.n_ctx.value = 2048; req.embeddings.value = false; req.pooling_type.value = "mean"; req.handler.serialize(outbuf); FILE* fp = fopen("dump.bin", "wb"); fwrite(outbuf.data.data(), 1, outbuf.data.size(), fp); fclose(fp); printf("\n----------\n\n"); glue_msg_load_req req2; glue_inbuf inbuf(outbuf.data.data()); req2.handler.deserialize(inbuf); assert(req2.use_mmap.value == true); assert(req2.n_gpu_layers.value == 32); assert(req2.use_webgpu.value == true); + assert(req2.no_perf.value == false); assert(req2.seed.value == 42); assert(req2.n_ctx.value == 2048); assert(req2.embeddings.value == false); assert(req2.pooling_type.value == "mean"); }
🧹 Nitpick comments (2)
CMakeLists.txt (1)
4-6: Consider adding documentation for the new CMake options.The new options lack help text. Adding descriptions would improve usability for developers configuring the build.
📝 Suggested documentation
-option(GGML_WEBGPU "Enable GGML WebGPU backend" ON) -option(GGML_WEBGPU_JSPI "Enable GGML WebGPU JSPI support" ON) -option(LLAMA_WASM_MEM64 "Enable 64-bit memory for WebAssembly builds" OFF) +option(GGML_WEBGPU "Enable GGML WebGPU backend for GPU acceleration in WASM builds" ON) +option(GGML_WEBGPU_JSPI "Enable JSPI (JavaScript Promise Integration) for async WebGPU operations" ON) +option(LLAMA_WASM_MEM64 "Enable 64-bit memory addressing for WebAssembly (experimental, may break Safari)" OFF)src/wllama.ts (1)
597-627: Consider checking WebGPU preference before multi-thread detection.The current flow checks multi-thread support and configuration before disabling it for WebGPU (lines 620-626). While correct, checking
this.useWebGPUearlier would avoid unnecessary multi-thread detection when WebGPU is active.♻️ Optional refactoring to skip multi-thread detection when using WebGPU
Move the WebGPU check before the multi-thread detection:
+ // TODO: investigate why WebGPU + multi-threading causes performance issues + if (this.useWebGPU) { + this.logger().warn( + 'Multi-threading is not yet supported with WebGPU backend' + ); + this.nbThreads = 1; + } else if (await isSupportMultiThread()) { if (this.pathConfig['multi-thread/wllama.wasm']) { const hwConcurrency = Math.floor((navigator.hardwareConcurrency || 1) / 2); this.nbThreads = config.n_threads ?? hwConcurrency; if (this.nbThreads > 1) { this.useMultiThread = true; } else { this.logger().warn( 'Falling back single-thread due to n_threads configuration or limited hardware concurrency' ); } } else { this.logger().warn( 'Missing paths to "multi-thread/wllama.wasm", falling back to single-thread' ); } } else { this.logger().warn( 'Multi-threads are not supported in this environment, falling back to single-thread' ); } - - // TODO: investigate why WebGPU + multi-threading causes performance issues - if (this.useWebGPU) { - this.logger().warn( - 'Disabling multi-threading when using WebGPU backend' - ); - this.useMultiThread = false; - this.nbThreads = 1; - }
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (2)
src/multi-thread/wllama.wasmis excluded by!**/*.wasmsrc/single-thread/wllama.wasmis excluded by!**/*.wasm
📒 Files selected for processing (14)
.github/workflows/deploy-examples-main.ymlCMakeLists.txtcpp/actions.hppcpp/glue.hppcpp/test_glue.cppexamples/main/src/components/ChatScreen.tsxllama.cppscripts/docker-compose.ymlsrc/glue/messages.tssrc/multi-thread/wllama.jssrc/webgpu-single-thread/wllama.jssrc/wllama.tssrc/workers-code/generated.tssrc/workers-code/llama-cpp.js
🚧 Files skipped from review as they are similar to previous changes (3)
- llama.cpp
- examples/main/src/components/ChatScreen.tsx
- scripts/docker-compose.yml
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-12-29T03:44:02.649Z
Learnt from: reeselevine
Repo: ngxson/wllama PR: 198
File: src/workers-code/llama-cpp.js:269-279
Timestamp: 2025-12-29T03:44:02.649Z
Learning: In wllama's src/workers-code/llama-cpp.js, the isAsync parameter in callWrapper should only be true for functions included in JSPI_EXPORTS build configuration (wllama_start and wllama_action). Other functions like wllama_exit, wllama_debug, and wllama_malloc should use isAsync=false even though the callWrapper returns an async function, because they don't perform async operations at the C/WASM level.
Applied to files:
src/wllama.ts
📚 Learning: 2025-12-29T03:44:00.085Z
Learnt from: reeselevine
Repo: ngxson/wllama PR: 198
File: src/workers-code/llama-cpp.js:269-279
Timestamp: 2025-12-29T03:44:00.085Z
Learning: In src/workers-code/llama-cpp.js, constrain the isAsync flag passed to callWrapper: set isAsync to true only for functions included in the JSPI_EXPORTS build (specifically wllama_start and wllama_action). For other exported functions such as wllama_exit, wllama_debug, and wllama_malloc, always use isAsync=false since they do not perform async operations at the C/WASM level, even though callWrapper may expose an async function. This should be verifiable by reviewing the build exports and ensuring these functions are not marked as asynchronous in the wrapper when invoked.
Applied to files:
src/workers-code/llama-cpp.js
🧬 Code graph analysis (4)
cpp/actions.hpp (1)
cpp/glue.hpp (1)
data(87-90)
src/wllama.ts (2)
src/utils.ts (1)
isSupportMultiThread(140-156)src/glue/messages.ts (2)
GlueMsgPerfContextRes(1384-1394)GlueMsgPerfResetRes(1402-1405)
src/workers-code/llama-cpp.js (2)
cpp/generate_glue_prototype.js (1)
name(39-39)src/worker.ts (4)
wllamaStart(118-126)wllamaAction(128-140)wllamaExit(142-152)wllamaDebug(154-161)
cpp/glue.hpp (1)
src/glue/messages.ts (1)
GLUE_VERSION(6-6)
🔇 Additional comments (20)
.github/workflows/deploy-examples-main.yml (2)
3-4: Verify if manual-only deployment is intentional.The workflow uses
workflow_dispatch(manual trigger only). If you want automatic deployment on pushes to the examples, consider adding apushtrigger.If automatic deployment is desired, here's the modification:
🔄 Add automatic trigger on push
on: workflow_dispatch: + push: + branches: + - master + paths: + - 'examples/main/**'
32-32: Pinned action SHA is properly tied to stable releases (v4.7.6 and v4).The commit SHA
9d877eea73427180ae43cf98e8914934fe157a1acorresponds to release v4.7.6 and is also tagged with the major version tag v4, confirming this is a stable release pinning rather than an arbitrary commit.cpp/glue.hpp (3)
24-24: LGTM! Protocol version bump is correct.The version bump to 2 is appropriate given the addition of new fields and message types. This matches the TypeScript side (src/glue/messages.ts line 6).
497-499: LGTM! New LoadReq fields are properly structured.The
use_webgpuandno_perfboolean fields are correctly added using the GLUE_FIELD macro and match the TypeScript definitions.
824-851: LGTM! Performance context messages are well-structured.The new performance monitoring messages follow the existing pattern and provide comprehensive timing and evaluation metrics. The field types and names are consistent with the TypeScript definitions (src/glue/messages.ts lines 859-929).
src/glue/messages.ts (2)
6-6: LGTM! TypeScript glue definitions are consistent with C++.The changes correctly mirror the C++ side (cpp/glue.hpp):
- Protocol version bump to 2 matches
use_webgpuandno_perffields added to LoadReq in the correct order- Performance context message types and fields are consistent
- All new message types properly added to the
GlueMsgunionAlso applies to: 46-60, 859-929, 1087-1089, 1378-1405, 1458-1458
1-3: No action needed—the file was properly generated and committed correctly.The generator script
cpp/generate_glue_prototype.jsexists and was committed in the same changeset assrc/glue/messages.ts(commit d9112bd). Both files show correct timestamps and the working tree is clean. The file header is accurate and the content matches the expected output structure from the generator (TypeScript interfaces withGLUE_MESSAGE_PROTOTYPES). The generation process was followed as intended.src/workers-code/llama-cpp.js (2)
230-247: LGTM: callWrapper async flag implementation is correct.The updated callWrapper signature correctly accepts isAsync as the 4th parameter and passes
{ async: true }to Module.cwrap when needed. The returned wrapper appropriately awaits async functions while handling sync functions synchronously. Error handling is preserved in both paths.Based on learnings, this aligns with the JSPI_EXPORTS build configuration pattern.
269-283: LGTM: Wrapper configurations align with JSPI_EXPORTS.The isAsync flags are correctly set:
wllamaMalloc,wllamaExit,wllamaDebug:isAsync=false(not in JSPI_EXPORTS)wllamaStart,wllamaAction:isAsync=true(included in JSPI_EXPORTS)Based on learnings, these functions are correctly configured to match the async operations at the C/WASM level.
cpp/actions.hpp (4)
11-11: LGTM: Backend support infrastructure added correctly.The include for
ggml-backend.hand the newdevicefield inapp_tare necessary for WebGPU backend selection. The field is safely initialized tonullptr.Also applies to: 24-24
187-187: Verify that no_perf field is always initialized.Similar to the
use_webgpuconcern,req.no_perf.valueis accessed without checkingreq.no_perf.not_null(). Ensure this field has a default value or is always set by callers.The verification script in the previous comment will also check the
no_perffield definition.
800-833: LGTM: Performance context actions implemented correctly.Both
action_perf_contextandaction_perf_resetfollow the established pattern:
- Proper null checks for
app.ctx- Appropriate error handling (returning
success=false)- Correct integration with llama.cpp performance APIs
164-177: No issue found —use_webgpuis a non-nullable field.The field is defined with
GLUE_FIELD(bool, use_webgpu)(notGLUE_FIELD_NULLABLE), marked as non-nullable in the message definition, and always provided by callers. Direct access to.valuewithout.not_null()is the correct pattern for non-nullable fields and is consistent with how other non-nullable fields liken_ctx_auto,seed, andno_perfare accessed elsewhere in the same file.src/wllama.ts (7)
26-27: LGTM: New configuration options and interfaces are well-documented.The new imports, config options (
preferWebGPU,noPerf), andPerfContextDatainterface are clearly documented and align with the PR's WebGPU integration objectives.Also applies to: 79-88, 134-143
307-307: LGTM: WebGPU state tracking follows established patterns.The
useWebGPUfield andusingWebGPU()method follow the same pattern asuseMultiThreadandisMultithread(), with appropriate state initialization and model-loaded guards.Also applies to: 464-467
459-462: LGTM: Simplified thread count accessor.The method now returns
this.nbThreadsdirectly, removing previous conditional logic. This is consistent with the updated multi-thread handling inloadModel.
587-595: LGTM: Proper WebGPU feature detection with fallback.The code correctly checks for
navigator.gpuavailability and provides a clear warning when WebGPU is requested but unavailable. Good defensive programming.
639-644: LGTM: Proxy initialization simplified.The proxy now receives
this.nbThreadsdirectly, which is consistent with the updated thread detection logic and eliminates conditional complexity at the initialization point.
658-687: LGTM: Load request properly initialized with WebGPU flags.The load request correctly populates:
use_webgpu: Always set to a boolean valueno_perf: Uses nullish coalescing withfalseas defaultn_gpu_layers: Set to 999 for WebGPU (as noted in PR objectives), 0 otherwisen_threads: Uses the pre-computedthis.nbThreadsThis confirms that the fields flagged in cpp/actions.hpp should have safe values, as they're always initialized here.
1382-1403: LGTM: Performance context API correctly implemented.Both
getPerfContext()andresetPerfContext()follow the established pattern:
- Proper model-loaded guards
- Correct action names matching C++ handlers
- Return types aligned with message definitions
|
I added WebGPU to the multi-thread build. However, I found that using the multi-thread build led to a pretty significant performance decrease running in my browser. I don't fully understand how llama.cpp routes things in multi-thread mode yet, but I know the WebGPU backend has some inefficiencies/issues with that right now. It's something we definitely need to work on/fix going forwards in llama.cpp. I imagine this can be improved, and like you said would allow non-GPU operations to run more efficiently, but for now I disabled using the multi-thread build with WebGPU and added a TODO in the code to look into this: https://github.com/ngxson/wllama/pull/198/files#diff-dc8126d70d08a19299c727046d114b88791f7fecac5c7735b905ec184bb07436R619. Otherwise, to support 32-bit builds and memory limits, I just opened this PR in llama.cpp: ggml-org/llama.cpp#18707. The wasm blobs here are built against that code, once that solution or something else we come up with is merged in llama.cpp I'll update the submodule here. |
ngxson
left a comment
There was a problem hiding this comment.
Looking good! We can merge when ggml-org/llama.cpp#18707 is upstream and blobs are rebuilt.
One thing to note is that I will also need to design an API for lazy-loading tensors. Currently, tensors already offloaded to GPU will still have a copy inside wasm memory, which is wasteful.
Looks like llama.cpp #18707 has been merged to main. Can this PR be merged now? |
|
@reeselevine I cannot push to your branch because it was created from your |
|
New PR opened here: #201 |


Adds initial WebGPU integration to wllama. There are still a few issues to be resolved, but opening this PR now so we can start working on them!
To add WebGPU support, this PR does the following:
n_gpu_layers, only the backend is specified in the config, and if WebGPU is requested/available,n_gpu_layersis set to 999. (This is how I've been running GPU backends locally, is this the right way to do it?)JSPI_EXPORTSso that they are properly handled.{'async' : true}also has to be passed toModule.cwrapfor these functions. We also found that it was necessary to exportHEAPU8, even for CPU-only builds (maybe this is due to updating emscripten versions?).examples/main. For example, I have it here: https://reeselevine.github.io/wllama/Issues to be resolved
size_tfor free/total, which in WASM 32-bit builds overflows when the available memory is > 2^32 - 1 bytes. This is a problem on many machines, e.g., my M3 reports 2^32 bytes. So, in the built wasm in this PR, I actually hardcoded the return values fromget_memoryto 2^32 - 1. To address this issue, either (1) we need wllama to support 64-bit builds, which is not trivial (some interfaces between JavaScript/WASM need to change I think) and might come with other issues, or (2), change the ggml interface here to not use a size_t, or report the available memory in something other than bytes. This also gets to another question, which is that I haven't yet looked into how wllama is storing the models in memory, and if this can be optimized to avoid storing the model on the CPU side in memory if it's running on the GPU, or if it can currently support models larger than 4 GB on machines with more memory.Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.