[pull] main from huggingface:main by pull[bot] · Pull Request #41 · EricLBuehler/candle

pull · 2024-11-19T09:14:01Z

See Commits and Changes for more details.

Created by pull[bot] (v2.0.0-alpha.4)

Can you help keep this open source service alive? 💖 Please sponsor : )

* Start updating to cudarc 0.14. * Adapt a couple more things. * And a couple more fixes. * More tweaks. * And a couple more fixes. * Bump the major version number. * Proper module system for the cuda kernels. * Proper ptx loading. * Launch the sort kernel. * Custom op. * Start using the builder pattern. * More builder. * More builder. * Get candle-core to compile. * Get the tests to pass. * Get candle-nn to work too. * Support for custom cuda functions. * cudnn fixes. * Get flash attn to run. * Switch the crate versions to be alpha. * Bump the ug dependency.

* added chatGLM readme * changed wording in readme * added readme for chinese-clip * added readme for convmixer * added readme for custom ops * added readme for efficientnet * added readme for llama * added readme to mnist-training * added readme to musicgen * added readme to quantized-phi * added readme to starcoder2 * added readme to whisper-microphone * added readme to yi * added readme to yolo-v3 * added readme to whisper-microphone * added space to example in glm4 readme * fixed mamba example readme to run mamba instead of mamba-minimal * removed slash escape character * changed moondream image to yolo-v8 example image * added procedure for making the reinforcement-learning example work with a virtual environment on my machine * added simple one line summaries to the example readmes without * changed non-existant image to yolo example's bike.jpg * added backslash to sam command * removed trailing - from siglip * added SoX to silero-vad example readme * replaced procedure for uv on mac with warning that uv isn't currently compatible with pyo3 * added example to falcon readme * added --which arg to stella-en-v5 readme * fixed image path in vgg readme * fixed the image path in the vit readme * Update README.md * Update README.md * Update README.md --------- Co-authored-by: Laurent Mazare <laurent.mazare@gmail.com>

* Fix for clippy 1.86. * More clippy fixes. * More fixes.

* Add the CSM model. * Add some code to load the model. * Load the text tokenizer. * Add frame generation. * Get the sampling to work. * Rope fix. * Autoregressive generation. * Generate some audio file. * Use the actual prompt. * Support multiple turns. * Add a very barebone readme. * Move some of the shared bits to the model.

* Add the SNAC audio tokenizer. * More snac. * Again more snac. * Add some example code for snac. * Get the weights to load. * Add to the snac model. * Fixes. * Get round-tripping to work. * Save/load code files. * Clippy fix. * Fmt fix.

…ompatibility. (#2872)

* Initial commit: model weights working, prediciton incorrect * moved distilbertformaskedlm into distilbert modeling file * made maskedLM like bert example, still incorrect predictions * finally not getting NaNs, fixed attention mask * getting correct output sentences * get top k predictions * fixed output formatting slightly * added default arg for model_id * lint * moved masked token example code from distilbertformaskedlm example to distilbert example * lint * removed distilbertformaskedlm example * cleanup * clippy * removed embedding normalization from example * made output and model dependent on args instead of prompt * lint * replaced or_ok anyhow error with anyhow context * changed error message for mask token not found

* Cuda cleanup. * More fixes.

* Avoid using batched-matmul in nn::Linear. * Also avoid batched matmul in conv1d. * Also tweak the conv2d. * Batched tests. * Also cover conv2d.

* Add the Orpheus TTS. * Add a small readme. * Token fix. * Support more voices. * Clippy fixes.

* Support for cudnn conv1d. * More conv1d work. * Get the conv1d to work with cudnn. * Cleanup.

* Exclude candle-book to avoid some CI failures. * Remove the book CIs.

* Set the algo. * Expose the cudnn preferred algo for conv ops.

* Gumbel-Softmax sampling. * Add a sampling test. * Share the gumbel-softmax bits.

* Use cudarc 0.16. * Allow for disabling event tracking. * Tweaks. * Bump the ug version. * And bump the candle version too.

…ed CONTRIBUTING.md (#2897) * added CONTRIBUTING.md to candle-book * added description to candle-book introduction * Updated formatting and added different features to candle-book installation * mnist guide first draft candle-book * updated mnist guide syntax and grammar for candle-book * changed HelloWorld - Mnist to Tutorial - Mnist in SUMMARY.md * updated intro to mnist guide in candle-book

* Retrieve the current positions for rotating KV caches. * Add the function to the kv cache too. * More testing.

* add HuberLoss and add Loss Trait * 1. remove the LaTeX comment in loss.rs 2. add huberloss test * change the huberloss Loss trait into the same approach as the other loss functions in this file * cargo fmt --------- Co-authored-by: ivarflakstad <69173633+ivarflakstad@users.noreply.github.com>

* feat!: Make `ug` dep optional * fix(example/mnist-training): Run all epochs * doc(`candle-ug`): Crate documentation * fix: feature-gate the `ComputePipeline` import --------- Co-authored-by: ivarflakstad <69173633+ivarflakstad@users.noreply.github.com>

* init z-image * fixed patchify, unpatchify and latent * update z_image examples readme * fixed clippy and rustfmt * fixed z_image example readme links * support sdpa and flash-attn in Z-Image and fixed sdpa clippy warning * fix some readme * Update candle-transformers/src/models/z_image/transformer.rs Co-authored-by: ivarflakstad <69173633+ivarflakstad@users.noreply.github.com> * support --model in example --------- Co-authored-by: ivarflakstad <69173633+ivarflakstad@users.noreply.github.com>

…3278)

* replace cutlass submodule references with explicit build step * address review comment: use Box leak trick also in attn-v3 * address review comment: use "git clone --depth 1" instead of sparse checkout for compatiblity with older git versions * correct version in candle-flash-attn-build/Cargo.toml * add top-level candle-flash-attn-build crate * rustfmt --------- Co-authored-by: Jacob Gorm Hansen <jhansen@dropbox.com> Co-authored-by: ivarflakstad <69173633+ivarflakstad@users.noreply.github.com>

* Changed CONVT1D_OP to CONVT2D_OP for conv_transpose2d_bf16 * removing the test: the cause of the bug is so apparent.

* quantized and full SmolLM3 * include chrono for prompt * resolve pub consist and unused var * formatted * last spacing in format * add credits * chat template * integrate new chat template for smollm3 example * fmt and clippy * improve documentation / correct chat API in doc * skip compile on documentation --------- Co-authored-by: ivarflakstad <69173633+ivarflakstad@users.noreply.github.com>

* minimal example * toyed with cache * serve cli * cleaned up index * readme * cleaner interface * improved formatting * serve format * index header unsloth * demonstrate reasoning and format prompt * no delay on button disable * thinking based prompt; boolean switch; handle thinking reentry UI * remove clear memory profile * quant-qwen3 wasm using chat template * thinking and no thinking tweakable mid conversation * add discussion * refactored example to use chat template; removed depracated fns * clean up inputs and fmt * remove unused logging import --------- Co-authored-by: ivarflakstad <69173633+ivarflakstad@users.noreply.github.com>

* feat: add quantized lfm2 model support * fmt --------- Co-authored-by: ivarflakstad <69173633+ivarflakstad@users.noreply.github.com>

Co-authored-by: ivarflakstad <69173633+ivarflakstad@users.noreply.github.com>

…3285) * feat: reduce prefers min dtype based on stride size * fix: cargo format fix * Add temporary large/small reduce benchmarks * Improved / simplified strided indexing * Begin untangling pow2 concept from indexer * remove unused get_strided_index_u64 * Make indexer_t handle both cont and strided. indexer.last_dim is a constant * Remove Pow2Meta * Remove redundant contiguous_indexer * Remove redundant strided index impl * Remove pow2 from kernel call/signature * Some size_t -> ulong changes * Use indexer_t for all reduce based kernels * contiguous indexer last_dim default is 0 * u16 indexing does not provide speedup, and is an unlikely use case * Let indexer_t dictate indexing dtype * Introduce finalize concept. Tidy up redundant fns/macros * Tidying up * Use store concept instead of finalize * Simplify arg reduce macros * Tidying up * Add reduce kernels for large tensors. Use existing reduce macro to add implementations Remove large arg reduce kernels as we currently only support writing uint arg reduce results. * Remove u64 indexed reduce kernels (u32 should suffice). Also max block_dim is 1024, so removing 2048 case from reduce kernels * Remove suffix from reduce macro * Remove IDX from reduce macros * Tidy up reduce kernel call code * Explicit u32 in reduce call code. Remove IDX from arg reduce macros * Remove small reduce benchmark --------- Co-authored-by: Ivar Flakstad <69173633+ivarflakstad@users.noreply.github.com>

) * mlx gemm opt init * fix bug * update * opt * opt * update more shape to matmul benchmark * remove metal_matmul_benchmark --------- Co-authored-by: ivarflakstad <69173633+ivarflakstad@users.noreply.github.com>

* Update deps * add imageproc text feature * Fix compilation --------- Co-authored-by: Eric Buehler <ericlbuehler@gmail.com>

* direct transfer for cuda * allocate some random int data * implement dummy device, test same device logic * update to latest cudarc version * add cuda and metal checks * change dependencies to use newer cudarc version * reduce test to check for different devices * Fix deps * Clippy, fix workflow * Format * Fix handling for cuda --------- Co-authored-by: Eric Buehler <ericlbuehler@gmail.com>

* Use upstream bindgen_cuda crate. Use separate builders for each steps. Remove some cargo build warning messages * bump bindgen_cuda * fix * Fix transfer_cuda_to_device test

* Use cudaforge for kernel build * Fix clippy * Update cudaforge to v0.1.2 * Fix build candle-examples --------- Co-authored-by: Eric Buehler <ericlbuehler@gmail.com>

…3365)

pull bot added ⤵️ pull merge-conflict Resolve conflicts manually labels Nov 19, 2024

EricLBuehler force-pushed the main branch from bac2055 to 96279d5 Compare January 8, 2025 17:25

LaurentMazare and others added 27 commits April 3, 2025 09:12

Fix for clippy 1.86. (#2864)

9d31361

* Fix for clippy 1.86. * More clippy fixes. * More fixes.

Add the missing voices for CSM. (#2867)

bc33df7

Clippy 1.86 fixes for cuda. (#2868)

338f6a1

Add the SNAC audio tokenizer. (#2869)

e3370c6

* Add the SNAC audio tokenizer. * More snac. * Again more snac. * Add some example code for snac. * Get the weights to load. * Add to the snac model. * Fixes. * Get round-tripping to work. * Save/load code files. * Clippy fix. * Fmt fix.

Support more snac variants. (#2871)

2f3bf42

Fix hardcoded f32 dtype for attention_mask. Use the model dtype for c…

d339b01

…ompatibility. (#2872)

Cuda cleanup. (#2880)

acc5bd3

* Cuda cleanup. * More fixes.

Bump the crate version. (#2881)

19fb6da

Upgrade ug. (#2882)

d7b7ce1

Avoid using batched-matmul in nn::Linear. (#2883)

34505fd

* Avoid using batched-matmul in nn::Linear. * Also avoid batched matmul in conv1d. * Also tweak the conv2d. * Batched tests. * Also cover conv2d.

Optimize the batched matmul for the cpu backend. (#2884)

15ed0b1

Im2col cuda optimization. (#2885)

d9198de

Add the Orpheus TTS. (#2886)

b44d38d

* Add the Orpheus TTS. * Add a small readme. * Token fix. * Support more voices. * Clippy fixes.

Support for cudnn conv1d. (#2888)

f3a73f8

* Support for cudnn conv1d. * More conv1d work. * Get the conv1d to work with cudnn. * Cleanup.

Exclude candle-book to avoid some CI failures. (#2889)

2f9606b

* Exclude candle-book to avoid some CI failures. * Remove the book CIs.

Add a cudnn feature to candle-nn/candle-transformers. (#2890)

fb660b8

Expose the cudnn algo in the conv ops. (#2892)

a52b76a

* Set the algo. * Expose the cudnn preferred algo for conv ops.

Gumbel-Softmax sampling. (#2894)

2653002

* Gumbel-Softmax sampling. * Add a sampling test. * Share the gumbel-softmax bits.

Bump the crate version. (#2895)

1d1d6d4

Use cudarc 0.15.2. (#2896)

b01ebba

Use cudarc 0.16. (#2900)

e4e7b0b

* Use cudarc 0.16. * Allow for disabling event tracking. * Tweaks. * Bump the ug version. * And bump the candle version too.

Rotating kv cache positions (#2901)

7f0f83a

* Retrieve the current positions for rotating KV caches. * Add the function to the kv cache too. * More testing.

donjuanplatinum and others added 30 commits December 31, 2025 12:14

fix(candle-kernels): conditionally link stdc++ for non-MSVC targets (#…

43be23c

…3278)

Rename compute capability defines in CUDA kernels (#3275)

fd8448d

Fix MoE WMMA kernel on V100 (#3282)

c3ed240

[Metal] improve normalization (#3283)

db3d5d9

Fix BF16 conv_transpose2d using wrong kernel on Metal (#3279)

54131f1

* Changed CONVT1D_OP to CONVT2D_OP for conv_transpose2d_bf16 * removing the test: the cause of the bug is so apparent.

Mamba2 implementation (#3264)

42a4edc

feat: paddleocr-vl model and example (#3273)

f526033

chore(dep): bump cudarc to 0.18.2 (#3293)

f2d30fb

Hotfix: Remove fastmath from candle-kernels (#3309)

aaf5c86

feat: add quantized lfm2 model support (#3244)

261f727

* feat: add quantized lfm2 model support * fmt --------- Co-authored-by: ivarflakstad <69173633+ivarflakstad@users.noreply.github.com>

Support new arch of GLM4 GGUF models (#2992)

23182cf

Co-authored-by: ivarflakstad <69173633+ivarflakstad@users.noreply.github.com>

rms/layer norm accumulate in f32 for improved precision (#3315)

a3969ed

Remove test.onnx (uploaded by mistake?) (#3316)

0f6b303

Metal GEMM Dynamic Tile Selection and Batch Collapse Optimization (#3313

06cb713

) * mlx gemm opt init * fix bug * update * opt * opt * update more shape to matmul benchmark * remove metal_matmul_benchmark --------- Co-authored-by: ivarflakstad <69173633+ivarflakstad@users.noreply.github.com>

Update deps (#3320)

8d5873b

* Update deps * add imageproc text feature * Fix compilation --------- Co-authored-by: Eric Buehler <ericlbuehler@gmail.com>

[Cuda] Use upstream bindgen_cuda crate (#3328)

f041b87

* Use upstream bindgen_cuda crate. Use separate builders for each steps. Remove some cargo build warning messages * bump bindgen_cuda * fix * Fix transfer_cuda_to_device test

Bump candle version to 0.9.2 (#3329)

e53310d

Add dep versioning for candle-flash-attn-build (#3330)

3b39794

Bump float8 to 0.7.0, cudarc to 0.19.1 (#3360)

061c392

Bump float8 to 0.7.0, cudarc to 0.19.1 (#3360) (#3361)

971e7ed

Use cudaforge for kernel build (#3346)

c3bb5bf

* Use cudaforge for kernel build * Fix clippy * Update cudaforge to v0.1.2 * Fix build candle-examples --------- Co-authored-by: Eric Buehler <ericlbuehler@gmail.com>

Add candle-video library for text-to-video generation in README.md (#…

f2cb5b4

…3365)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[pull] main from huggingface:main#41

[pull] main from huggingface:main#41
pull[bot] wants to merge 345 commits intoEricLBuehler:mainfrom
huggingface:main

pull bot commented Nov 19, 2024 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

pull bot commented Nov 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

pull bot commented Nov 19, 2024 •

edited

Loading