fix candle-kernels build for CC < 700 #3300

haricot · 2026-01-12T11:02:32Z

I used cudarc driver API to automatically detect compute capabilities at build time, which seems more practical than relying on the CUDA_COMPUTE_CAP environment variable:

Works out-of-the-box without user configuration
Automatically detects multi-GPU setups
Falls back to CUDA_COMPUTE_CAP env var if driver init fails (e.g., in CI)

If you prefer a different approach (e.g., nvidia-smi or env var only), I'm happy to adjust.

Currently, the generator's compute_cap method depends on the merging of Narsil/bindgen_cuda#18. And if Narsil/bindgen_cuda#16 is merged, it would be possible to extend CUBIN generation to multiple architectures to accelerate startup and optimization.

ivarflakstad

Thank you for this!
Looks good to me 👌

@guoqingbao could you double check the build steps related to the moe kernels? ☺️

guoqingbao · 2026-01-14T13:13:32Z

Thank you for this! Looks good to me 👌

@guoqingbao could you double check the build steps related to the moe kernels? ☺️

This also looks good to me.

haricot · 2026-01-25T20:09:28Z

related #3331

DrJesseGlass · 2026-01-31T19:59:53Z

Just want to quickly point out that there are several issue. This won't solve resolves #3331 because BF16 WMMA has a stricter requirement: SM 80+ (#3349 resolves). But this does resolve FP16 WMMA that requires 70+. This and #3349 are complimentary.

I have a branch which is a subset of this issue created several months back https://github.com/DrJesseGlass/candle/tree/oldgpu/no-changes for Pascal sm_61 but assumed we weren't so interested in backwards compatibility.

This seems to merely need a cargo fmt for this to merge. But wanted to make it apparent that I can readily provide the atomicAdd polyfill for half on Pascal (CC < 70).

haricot · 2026-02-01T18:25:52Z

@DrJesseGlass Thank you for your feedback, I understand your need to Disable BF16 WMMA for pre-Ampere GPUs

My goal is to ensure backward compatibility via #2704, where I've added ALLOW_LEGACY_BF16 and ALLOW_LEGACY_FP8, as well as moe_hfma2 (WWMA fallback solution (Tests passed, but testing in real-world conditions is needed)) CC < 70-80. This should therefore work now.

The `atomicAdd` function for `__half` is already candle.

candle/candle-kernels/src/compatibility.cuh

Lines 38 to 59 in 3b39794

    
           #if __CUDA_ARCH__ < 700 
        
           // https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#atomicadd 
        
           // The 16-bit __half floating-point version of atomicAdd() is only supported by devices of compute capability 7.x and higher. 
        
           // Solution adapted from https://github.com/torch/cutorch/blob/master/lib/THC/THCAtomics.cuh#L96-L119 
        
           //__device__ __half atomicAdd(__half *address, __half val) { 
        
              //  unsigned int *address_as_ui = (unsigned int *) ((char *)address - ((size_t)address & 2)); 
        
              //  unsigned int old = *address_as_ui; 
        
              //  unsigned int assumed; 
        
              //  bool unaligned = (size_t) address & 2; 
        
              //  do { 
        
              //      assumed = old; 
        
              //      unsigned int hsum; 
        
              //      hsum = unaligned ? (old >> 16) : (old & 0xffff); 
        
              //      hsum = __half_as_ushort(__ushort_as_half(hsum) + val);  
        
              //      old = atomicCAS(address_as_ui, assumed, 
        
              //          unaligned ? (old & 0xffff) | (hsum << 16) : (old & 0xffff0000) | hsum 
        
              //      ); 
        
              // } while (assumed != old); 
        
              // return __ushort_as_half(unaligned ? (old >> 16) : (old & 0xffff)); 
        
           //} 
        
           #endif

I added Polyfill: atomicAdd for bfloat16 for cc < 800 min/max man atomicAdd bf16 min/max man

The current PR aims for CC < 700 compatibility but primarily offers a way to prepare a step for automatically supporting heterogeneous multi-GPUs.

In the futur using kernel via slang to enable automatic merge operation and backward compatibility via stensor could be a solution, but with less manual optimization setup.

haricot · 2026-02-05T04:37:10Z

Closing in favor of #2704 and cudaforge. And for wmma either part of #3349 or #2704 (reply hfma2).

haricot added 4 commits January 12, 2026 09:42

fix candle-kernels build for CC < 700

016166e

fix candle-kernels related bindgen_cuda/pull/18

0bf4169

candle-kernels/Cargo.toml delete unneeded comments

8966e59

candle-kernels/Cargo.toml restore formatting

ba56fc9

haricot changed the title ~~fix candle-kernels build for CC < 700 (depends merging Narsil/bindgen_cuda#18)~~ fix candle-kernels build for CC < 700 Jan 12, 2026

ivarflakstad reviewed Jan 13, 2026

View reviewed changes

haricot mentioned this pull request Jan 24, 2026

Fix offset CUDA for quantizations and utils EricLBuehler/mistral.rs#1836

Open

haricot added 2 commits January 25, 2026 00:09

Merge branch 'main' into fix_build_old_card

c064137

bindgen_cuda dependency to version 0.1.6

78b6108

OhashiReon mentioned this pull request Jan 29, 2026

Unable To Build for Nvidia GeForce GTX 1650 #3331

Open

style: Remove whitespace from a line

06c51b9

haricot mentioned this pull request Feb 1, 2026

Add CC register capabilities in Rust and in the CUDA builder and optional emulation bf16 fp8 #2704

Draft

haricot closed this Feb 5, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix candle-kernels build for CC < 700 #3300

fix candle-kernels build for CC < 700 #3300

Uh oh!

haricot commented Jan 12, 2026

Uh oh!

ivarflakstad left a comment

Uh oh!

guoqingbao commented Jan 14, 2026

Uh oh!

haricot commented Jan 25, 2026

Uh oh!

DrJesseGlass commented Jan 31, 2026 •

edited

Loading

Uh oh!

haricot commented Feb 1, 2026

Uh oh!

haricot commented Feb 5, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

fix candle-kernels build for CC < 700 #3300

fix candle-kernels build for CC < 700 #3300

Uh oh!

Conversation

haricot commented Jan 12, 2026

Uh oh!

ivarflakstad left a comment

Choose a reason for hiding this comment

Uh oh!

guoqingbao commented Jan 14, 2026

Uh oh!

haricot commented Jan 25, 2026

Uh oh!

DrJesseGlass commented Jan 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

haricot commented Feb 1, 2026

Uh oh!

haricot commented Feb 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

DrJesseGlass commented Jan 31, 2026 •

edited

Loading

haricot commented Feb 5, 2026 •

edited

Loading