Skip to content

ggml-cpu: add q4_0 repack support for wasm#18858

Draft
aviallon wants to merge 1 commit intoggml-org:masterfrom
aviallon:feat/wasm-repack
Draft

ggml-cpu: add q4_0 repack support for wasm#18858
aviallon wants to merge 1 commit intoggml-org:masterfrom
aviallon:feat/wasm-repack

Conversation

@aviallon
Copy link
Contributor

Add LLM written WASM simd128 implementations for ggml_quantize_mat_q8_0_4x4, ggml_quantize_mat_q8_0_4x8 and ggml_gemv_q4_0_4x4_q8_0, ggml_gemm_q4_0_4x4_q8_0, ggml_gemv_q4_0_8x8_q8_0 and ggml_gemm_q4_0_8x8_q8_0.
Tested from a custom Wllama.

@aviallon
Copy link
Contributor Author

@ngxson this may be of interest to you. Note: it is required to modify wllama's build slightly to use this, as emscripten overrides CMAKE_SYSTEM_PROCESSOR.
Needed options:

set(GGML_CPU_REPACK ON CACHE BOOL "enable ggml CPU repack optimizations")
set(LLAMA_WASM_MEM64 OFF CACHE BOOL "disable MEMORY64 for wllama")
  • emcmake cmake -DEMSCRIPTEN_SYSTEM_PROCESSOR=wasm …

@ngxson
Copy link
Collaborator

ngxson commented Jan 15, 2026

If I understand correctly, this requires MEM64 to be disabled, right?

CC @reeselevine for visibility

@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Jan 15, 2026
@reeselevine
Copy link
Collaborator

Interesting, looks like the change in ngxson/wllama@492c423 to move the flags to CMakeLists.txt is what caused the 64-bit build to be enabled by default, since LLAMA_WASM_MEM64 is on by default in llama.cpp right now. So the cache directive is needed to override it, unless it's specified on the command line.

I don't think the simd implementations in this PR would require disabling 64-bit generally, right? It's just that right now, wllama doesn't support the 64-bit builds yet.

@reeselevine
Copy link
Collaborator

Or actually, I realize ngxson/wllama#200 doesn't include that flag, because the WebGPU integration PR hasn't been merged yet. Maybe the cache directive actually isn't needed?

@aviallon
Copy link
Contributor Author

Interesting, looks like the change in ngxson/wllama@492c423 to move the flags to CMakeLists.txt is what caused the 64-bit build to be enabled by default, since LLAMA_WASM_MEM64 is on by default in llama.cpp right now. So the cache directive is needed to override it, unless it's specified on the command line.

I don't think the simd implementations in this PR would require disabling 64-bit generally, right? It's just that right now, wllama doesn't support the 64-bit builds yet.

I actually tried building with 64-bit support enabled, and got errors even when running with node.js directly.

@aviallon
Copy link
Contributor Author

For the record, with that + the llama.cpp version bump and -ffast-math -fno-finite-math-only, I get ~50% faster PP compared to current wllama.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants