Skip to content

Wgpu Backend#56

Open
KimHenrikOtte wants to merge 549 commits intoEricLBuehler:mainfrom
KimHenrikOtte:wgpu_cleanup
Open

Wgpu Backend#56
KimHenrikOtte wants to merge 549 commits intoEricLBuehler:mainfrom
KimHenrikOtte:wgpu_cleanup

Conversation

@KimHenrikOtte
Copy link

No description provided.

@EricLBuehler
Copy link
Owner

@KimHenrikOtte thanks for the PR. This is super exciting, please let me know when you are ready for review.

@EricLBuehler EricLBuehler marked this pull request as ready for review March 23, 2025 13:33
@EricLBuehler
Copy link
Owner

@KimHenrikOtte is this ready for an initial review?

greenrazer and others added 26 commits April 29, 2025 21:35
* tracing page

* warned about asynchronous execution

* cleanup

* added Nsignt Systems recommendation
* Add a scattered kv cache.

* Update some comments.
* add Qwen3.rs

* fixed compile error

* attempting to gett pr 2903 working with qwen weights

* different qwen variants working

* added moe model

* clippy

* added additional eos token

* translated Korean comments to English as well as I can

* removed specialized Qwen3RmsNorm and replaced with generic Candle RmsNorm

* replaced custom repeat_kv implementation with candle's repeat_kv implementation

* replace linear with linear_b in attention initalization

* replaced custom custom kv_cache implementation with candle kv_cache

* style

* replaced explicit broadcast add with normal add in decoder layer

* removed keeping the Rotary embedding layer in the model struct

* used tie_word_embeddings bool from config instead of relying on existence of weights for lm head in CasualLM

* removed duplicate code from qwen3_moe

* removed sliding window from qwen3 attention

* removed MoE code

* removed unused option

* Fixed Typo

Co-authored-by: Laurent Mazare <laurent.mazare@gmail.com>

* fixed tie word embeddings to use the correct embedding weights instead of the opposite

---------

Co-authored-by: Max <naturale@hufs.ac.kr>
Co-authored-by: Laurent Mazare <laurent.mazare@gmail.com>
* Indexing with max-value results in zero/no-op.

* Add some testing.

* Also adapt the metal kernels.

* Another test.

* Fix.
* fixed quantized_phi3 implementation

* quantized_qwen3 implementation

* Update quantized_phi3.rs

* Update quantized_phi3.rs

* add quantized_qwen3 example

* Clippy fixes.

* Cleanup.

---------

Co-authored-by: Laurent <laurent.mazare@gmail.com>
* added resize to candle-onnx, not currently working

* changed unreachable to bail, and bailed when both scales and sizes are set

* cleanup and added other unused options for this op

* cleanup

* fixed image loading to make output work

* cleanup and removed unused variables

* removed path path creation code, and changed unwrap to ?
* optimize KV cache to reduce GPU memory usage

* revert to using candle_nn::kv_cache::KvCache with initial capacity of 512
* OLMo 2 model

* Update olmo-2 to example

* Clippy fix.

---------

Co-authored-by: laurent <laurent.mazare@gmail.com>
* fixed docs quantized-qwen3 README

* fixed docs quantized-qwen2-instruct README
* Add phi-4 support.

* Long-rope support.

* Get clippy to be happy.:
- Add `dot()` for vector/matrix products
- Implement the `Frobenius` norm
- Add `mv()` for matrix-vector multiply
* onnx attention

* setup an example, adding and fixing onnx ops bit by bit

* model working, output is garbage data

* trilu working

* close but not quite, Issues still with scatterND

* closer but the outputs are still slightly wrong

* added tests for trilu and scatterND

* lint

* readme

* clippy

* removed unnessisary comments

* changed device selection, took hyperparameters from model config
* qwen-moe rebase

* lint

* fixed rebase error

* swapped normal MoE model with CausalMoE Model in example, and swapped the tie word embeddings if statement

* updated readme
* Update KvCache initialization in Qwen3 model to use a fixed max position embedding value of 512

* add doc
* add: wip RNN parameters

* fix: corrected access to tensor dim in rnn

* add: rnn function call

* merged files

* added parameter parsing

* update: rnn parameter parsing

* remove: ONNX descriptions

* update: implemented basic operations

* update: removed comment

* add: RNN test

* update: prepared test values

* fix: operations on tensors

* update: passing tests

* add: test gen script

* changed error message

---------

authored-by: misadowsk <michalsad.protondynamic@gmail.com>
* feat: added Elu operator

* feat: added hard swish

* added more tests for hard swish

* clened up

---------

authored-by: misadowsk <michalsad.protondynamic@gmail.com>
SpenserCai and others added 30 commits January 21, 2026 22:15
…ggingface#3313)

* mlx gemm opt init

* fix bug

* update

* opt

* opt

* update more shape to matmul benchmark

* remove metal_matmul_benchmark

---------

Co-authored-by: ivarflakstad <69173633+ivarflakstad@users.noreply.github.com>
* Update deps

* add imageproc text feature

* Fix compilation

---------

Co-authored-by: Eric Buehler <ericlbuehler@gmail.com>
)

* direct transfer for cuda

* allocate some random int data

* implement dummy device, test  same device logic

* update to latest cudarc version

* add cuda and metal checks

* change dependencies to use newer cudarc version

* reduce test to check for different devices

* Fix deps

* Clippy, fix workflow

* Format

* Fix handling for cuda

---------

Co-authored-by: Eric Buehler <ericlbuehler@gmail.com>
* Use upstream bindgen_cuda crate. Use separate builders for each steps. Remove some cargo build warning messages

* bump bindgen_cuda

* fix

* Fix transfer_cuda_to_device test
- fixed MutexGuard across await point clippy warnings for native build (but not solved for wasm)
- add copy to staging_buffer at the of flush_gpu_command
- fixed ternary_op_wgpu test (allow u8 buffer creation)
- fix warning in candle-wasm-tests
- removed wgpu feature from candle-test example
- improved doc comments
- restructured public api
- a ShaderLoader::load now returns a Cow(a shader loader might return a static shader string, or may generate a shader in place)
- Simplified quantized shader loading
- Fixed `MutexGuard` across async methods; on WASM, the locks will be dropped before the await point
- Added multi-thread test
# Conflicts:
#	candle-core/Cargo.toml
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.