Skip to content

Conversation

@haricot
Copy link
Contributor

@haricot haricot commented Jan 12, 2026

I used cudarc driver API to automatically detect compute capabilities at build time, which seems more practical than relying on the CUDA_COMPUTE_CAP environment variable:

  • Works out-of-the-box without user configuration
  • Automatically detects multi-GPU setups
  • Falls back to CUDA_COMPUTE_CAP env var if driver init fails (e.g., in CI)

If you prefer a different approach (e.g., nvidia-smi or env var only), I'm happy to adjust.

Currently, the generator's compute_cap method depends on the merging of Narsil/bindgen_cuda#18. And if Narsil/bindgen_cuda#16 is merged, it would be possible to extend CUBIN generation to multiple architectures to accelerate startup and optimization.

@haricot haricot changed the title fix candle-kernels build for CC < 700 (depends merging Narsil/bindgen_cuda#18) fix candle-kernels build for CC < 700 Jan 12, 2026
Copy link
Member

@ivarflakstad ivarflakstad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this!
Looks good to me 👌

@guoqingbao could you double check the build steps related to the moe kernels? ☺️

@guoqingbao
Copy link
Contributor

Thank you for this! Looks good to me 👌

@guoqingbao could you double check the build steps related to the moe kernels? ☺️

This also looks good to me.

@haricot
Copy link
Contributor Author

haricot commented Jan 25, 2026

related #3331

@DrJesseGlass
Copy link
Contributor

DrJesseGlass commented Jan 31, 2026

Just want to quickly point out that there are several issue. This won't solve resolves #3331 because BF16 WMMA has a stricter requirement: SM 80+ (#3349 resolves). But this does resolve FP16 WMMA that requires 70+. This and #3349 are complimentary.

I have a branch which is a subset of this issue created several months back https://github.com/DrJesseGlass/candle/tree/oldgpu/no-changes for Pascal sm_61 but assumed we weren't so interested in backwards compatibility.

This seems to merely need a cargo fmt for this to merge. But wanted to make it apparent that I can readily provide the atomicAdd polyfill for half on Pascal (CC < 70).

@haricot
Copy link
Contributor Author

haricot commented Feb 1, 2026

@DrJesseGlass Thank you for your feedback, I understand your need to Disable BF16 WMMA for pre-Ampere GPUs

My goal is to ensure backward compatibility via #2704, where I've added ALLOW_LEGACY_BF16 and ALLOW_LEGACY_FP8, as well as moe_hfma2 (WWMA fallback solution (Tests passed, but testing in real-world conditions is needed)) CC < 70-80. This should therefore work now.

The `atomicAdd` function for `__half` is already candle.

#if __CUDA_ARCH__ < 700
// https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#atomicadd
// The 16-bit __half floating-point version of atomicAdd() is only supported by devices of compute capability 7.x and higher.
// Solution adapted from https://github.com/torch/cutorch/blob/master/lib/THC/THCAtomics.cuh#L96-L119
//__device__ __half atomicAdd(__half *address, __half val) {
// unsigned int *address_as_ui = (unsigned int *) ((char *)address - ((size_t)address & 2));
// unsigned int old = *address_as_ui;
// unsigned int assumed;
// bool unaligned = (size_t) address & 2;
// do {
// assumed = old;
// unsigned int hsum;
// hsum = unaligned ? (old >> 16) : (old & 0xffff);
// hsum = __half_as_ushort(__ushort_as_half(hsum) + val);
// old = atomicCAS(address_as_ui, assumed,
// unaligned ? (old & 0xffff) | (hsum << 16) : (old & 0xffff0000) | hsum
// );
// } while (assumed != old);
// return __ushort_as_half(unaligned ? (old >> 16) : (old & 0xffff));
//}
#endif

I added Polyfill: atomicAdd for bfloat16 for cc < 800 min/max man atomicAdd bf16 min/max man

The current PR aims for CC < 700 compatibility but primarily offers a way to prepare a step for automatically supporting heterogeneous multi-GPUs.

In the futur using kernel via slang to enable automatic merge operation and backward compatibility via stensor could be a solution, but with less manual optimization setup.

@haricot
Copy link
Contributor Author

haricot commented Feb 5, 2026

Closing in favor of #2704 and cudaforge. And for wmma either part of #3349 or #2704 (reply hfma2).

@haricot haricot closed this Feb 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants