Wgpu Backend by KimHenrikOtte · Pull Request #56 · EricLBuehler/candle

KimHenrikOtte · 2025-01-04T14:10:20Z

No description provided.

EricLBuehler · 2025-01-07T23:55:39Z

@KimHenrikOtte thanks for the PR. This is super exciting, please let me know when you are ready for review.

EricLBuehler · 2025-03-23T13:33:54Z

@KimHenrikOtte is this ready for an initial review?

* tracing page * warned about asynchronous execution * cleanup * added Nsignt Systems recommendation

* Add a scattered kv cache. * Update some comments.

* add Qwen3.rs * fixed compile error * attempting to gett pr 2903 working with qwen weights * different qwen variants working * added moe model * clippy * added additional eos token * translated Korean comments to English as well as I can * removed specialized Qwen3RmsNorm and replaced with generic Candle RmsNorm * replaced custom repeat_kv implementation with candle's repeat_kv implementation * replace linear with linear_b in attention initalization * replaced custom custom kv_cache implementation with candle kv_cache * style * replaced explicit broadcast add with normal add in decoder layer * removed keeping the Rotary embedding layer in the model struct * used tie_word_embeddings bool from config instead of relying on existence of weights for lm head in CasualLM * removed duplicate code from qwen3_moe * removed sliding window from qwen3 attention * removed MoE code * removed unused option * Fixed Typo Co-authored-by: Laurent Mazare <laurent.mazare@gmail.com> * fixed tie word embeddings to use the correct embedding weights instead of the opposite --------- Co-authored-by: Max <naturale@hufs.ac.kr> Co-authored-by: Laurent Mazare <laurent.mazare@gmail.com>

* Indexing with max-value results in zero/no-op. * Add some testing. * Also adapt the metal kernels. * Another test. * Fix.

* fixed quantized_phi3 implementation * quantized_qwen3 implementation * Update quantized_phi3.rs * Update quantized_phi3.rs * add quantized_qwen3 example * Clippy fixes. * Cleanup. --------- Co-authored-by: Laurent <laurent.mazare@gmail.com>

* added resize to candle-onnx, not currently working * changed unreachable to bail, and bailed when both scales and sizes are set * cleanup and added other unused options for this op * cleanup * fixed image loading to make output work * cleanup and removed unused variables * removed path path creation code, and changed unwrap to ?

* optimize KV cache to reduce GPU memory usage * revert to using candle_nn::kv_cache::KvCache with initial capacity of 512

…pies (huggingface#2953)

* OLMo 2 model * Update olmo-2 to example * Clippy fix. --------- Co-authored-by: laurent <laurent.mazare@gmail.com>

* fixed docs quantized-qwen3 README * fixed docs quantized-qwen2-instruct README

* Add phi-4 support. * Long-rope support. * Get clippy to be happy.:

…gface#2968)

- Add `dot()` for vector/matrix products - Implement the `Frobenius` norm - Add `mv()` for matrix-vector multiply

* onnx attention * setup an example, adding and fixing onnx ops bit by bit * model working, output is garbage data * trilu working * close but not quite, Issues still with scatterND * closer but the outputs are still slightly wrong * added tests for trilu and scatterND * lint * readme * clippy * removed unnessisary comments * changed device selection, took hyperparameters from model config

* qwen-moe rebase * lint * fixed rebase error * swapped normal MoE model with CausalMoE Model in example, and swapped the tie word embeddings if statement * updated readme

* Update KvCache initialization in Qwen3 model to use a fixed max position embedding value of 512 * add doc

* add: wip RNN parameters * fix: corrected access to tensor dim in rnn * add: rnn function call * merged files * added parameter parsing * update: rnn parameter parsing * remove: ONNX descriptions * update: implemented basic operations * update: removed comment * add: RNN test * update: prepared test values * fix: operations on tensors * update: passing tests * add: test gen script * changed error message --------- authored-by: misadowsk <michalsad.protondynamic@gmail.com>

* feat: added Elu operator * feat: added hard swish * added more tests for hard swish * clened up --------- authored-by: misadowsk <michalsad.protondynamic@gmail.com>

…ggingface#3313) * mlx gemm opt init * fix bug * update * opt * opt * update more shape to matmul benchmark * remove metal_matmul_benchmark --------- Co-authored-by: ivarflakstad <69173633+ivarflakstad@users.noreply.github.com>

* Update deps * add imageproc text feature * Fix compilation --------- Co-authored-by: Eric Buehler <ericlbuehler@gmail.com>

) * direct transfer for cuda * allocate some random int data * implement dummy device, test same device logic * update to latest cudarc version * add cuda and metal checks * change dependencies to use newer cudarc version * reduce test to check for different devices * Fix deps * Clippy, fix workflow * Format * Fix handling for cuda --------- Co-authored-by: Eric Buehler <ericlbuehler@gmail.com>

* Use upstream bindgen_cuda crate. Use separate builders for each steps. Remove some cargo build warning messages * bump bindgen_cuda * fix * Fix transfer_cuda_to_device test

- fixed MutexGuard across await point clippy warnings for native build (but not solved for wasm)

- add copy to staging_buffer at the of flush_gpu_command

# Conflicts: # Cargo.toml

- fixed ternary_op_wgpu test (allow u8 buffer creation) - fix warning in candle-wasm-tests - removed wgpu feature from candle-test example

- improved doc comments - restructured public api

- a ShaderLoader::load now returns a Cow(a shader loader might return a static shader string, or may generate a shader in place)

…e#3361)

- Simplified quantized shader loading - Fixed `MutexGuard` across async methods; on WASM, the locks will be dropped before the await point - Added multi-thread test

# Conflicts: # candle-core/Cargo.toml

EricLBuehler force-pushed the main branch from bac2055 to 96279d5 Compare January 8, 2025 17:25

EricLBuehler marked this pull request as ready for review March 23, 2025 13:33

greenrazer and others added 26 commits April 29, 2025 21:35

Added tracing page to the candle book. (huggingface#2922)

5029ac5

* tracing page * warned about asynchronous execution * cleanup * added Nsignt Systems recommendation

Add support for Helium-v1. (huggingface#2932)

38fc866

Bump the candle version to 0.9.1. (huggingface#2935)

8a19bb7

Add a scattered kv cache. (huggingface#2936)

cd96fa8

* Add a scattered kv cache. * Update some comments.

fixed quantized_phi3 implementation

66be13b

Indexing with max-value results in zero/no-op. (huggingface#2940)

e27b470

* Indexing with max-value results in zero/no-op. * Add some testing. * Also adapt the metal kernels. * Another test. * Fix.

Bump cudarc to 0.16.3. (huggingface#2942)

637473c

Fixed Quantized Qwen3 Model (huggingface#2951)

485ddf2

* optimize KV cache to reduce GPU memory usage * revert to using candle_nn::kv_cache::KvCache with initial capacity of 512

Make tensor contiguous before the repeat_kv calls to avoid strided co…

6bd6172

…pies (huggingface#2953)

Olmo 2 model (huggingface#2954)

450a49e

* OLMo 2 model * Update olmo-2 to example * Clippy fix. --------- Co-authored-by: laurent <laurent.mazare@gmail.com>

Fix docs quantized qwen3 (huggingface#2955)

9ce4fe6

* fixed docs quantized-qwen3 README * fixed docs quantized-qwen2-instruct README

Fixes for clippy 1.87. (huggingface#2956)

92106c8

Proper support for phi-4 (huggingface#2960)

9a62c91

* Add phi-4 support. * Long-rope support. * Get clippy to be happy.:

Use a tanh activation in the xlm-roberta classification head. (huggin…

61ddb95

…gface#2968)

(hotfix) fix the doc test for indexer (huggingface#2970)

cac51fe

Add fine-tuned text classifier to xlm roberta example (huggingface#2969)

1a183c9

feat: enhance linear algebra operations (huggingface#2972)

5aed817

- Add `dot()` for vector/matrix products - Implement the `Frobenius` norm - Add `mv()` for matrix-vector multiply

Add Qwen3 MoE (huggingface#2934)

0224a74

* qwen-moe rebase * lint * fixed rebase error * swapped normal MoE model with CausalMoE Model in example, and swapped the tie word embeddings if statement * updated readme

Fix cuda memory error for Qwen3 non-quantized (huggingface#2987)

17313a4

* Update KvCache initialization in Qwen3 model to use a fixed max position embedding value of 512 * add doc

Fix typos (huggingface#2958)

23968db

candle-onnx: Implement Hard Swish operator (huggingface#2980)

2e5dbc7

* feat: added Elu operator * feat: added hard swish * added more tests for hard swish * clened up --------- authored-by: misadowsk <michalsad.protondynamic@gmail.com>

SpenserCai and others added 30 commits January 21, 2026 22:15

Update deps (huggingface#3320)

8d5873b

* Update deps * add imageproc text feature * Fix compilation --------- Co-authored-by: Eric Buehler <ericlbuehler@gmail.com>

renamed wgpu-compute-engine to wgpu-compute-layer

de07890

removed unused code, made internal methods and structs private

cf6e85d

[Cuda] Use upstream bindgen_cuda crate (huggingface#3328)

f041b87

* Use upstream bindgen_cuda crate. Use separate builders for each steps. Remove some cargo build warning messages * bump bindgen_cuda * fix * Fix transfer_cuda_to_device test

Bump candle version to 0.9.2 (huggingface#3329)

e53310d

Add dep versioning for candle-flash-attn-build (huggingface#3330)

3b39794

- removed unused code, made internal methods and structs private

a0ce7fc

- fixed MutexGuard across await point clippy warnings for native build (but not solved for wasm)

- fixed rotarx_emb_thd impl

c0c5516

- only add pipeline in the command_buffer if it changed

b963bd6

- add copy to staging_buffer at the of flush_gpu_command

fixed quantized impl when using batch > 1

30e770c

format document

9baa732

format document

2769072

Merge remote-tracking branch 'origin/main' into wgpu_cleanup

a73ecf9

# Conflicts: # Cargo.toml

- fixed warning when wgpu is not active

005a6c4

- fixed ternary_op_wgpu test (allow u8 buffer creation) - fix warning in candle-wasm-tests - removed wgpu feature from candle-test example

- added tests

d0a926b

- improved doc comments - restructured public api

normalized layout for binary and unary operations

92eeacc

test immediates1

6e4ab56

use defines to allow smaller intermmediate bufers

0419cdb

fix memory leak

d4bc0aa

- improved runtime shader generation performance

7d15957

- a ShaderLoader::load now returns a Cow(a shader loader might return a static shader string, or may generate a shader in place)

fixed browser impl

ae73269

format document

709a1ec

Bump float8 to 0.7.0, cudarc to 0.19.1 (huggingface#3360)

061c392

Bump float8 to 0.7.0, cudarc to 0.19.1 (huggingface#3360) (huggingfac…

971e7ed

…e#3361)

- Fixed Clippy warnings

7484fa1

- Simplified quantized shader loading - Fixed `MutexGuard` across async methods; on WASM, the locks will be dropped before the await point - Added multi-thread test

format document

9ac4b71

Merge branch 'main' into wgpu_cleanup

1806f5a

# Conflicts: # candle-core/Cargo.toml

improved diff with branch main

1150c6f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wgpu Backend#56

Wgpu Backend#56
KimHenrikOtte wants to merge 549 commits intoEricLBuehler:mainfrom
KimHenrikOtte:wgpu_cleanup

KimHenrikOtte commented Jan 4, 2025

Uh oh!

EricLBuehler commented Jan 7, 2025

Uh oh!

EricLBuehler commented Mar 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

KimHenrikOtte commented Jan 4, 2025

Uh oh!

EricLBuehler commented Jan 7, 2025

Uh oh!

EricLBuehler commented Mar 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants