-
Notifications
You must be signed in to change notification settings - Fork 1.4k
fix candle-kernels build for CC < 700 #3300
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
ivarflakstad
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this!
Looks good to me 👌
@guoqingbao could you double check the build steps related to the moe kernels?
This also looks good to me. |
|
related #3331 |
|
Just want to quickly point out that there are several issue. This won't solve resolves #3331 because BF16 WMMA has a stricter requirement: SM 80+ (#3349 resolves). But this does resolve FP16 WMMA that requires 70+. This and #3349 are complimentary. I have a branch which is a subset of this issue created several months back https://github.com/DrJesseGlass/candle/tree/oldgpu/no-changes for Pascal sm_61 but assumed we weren't so interested in backwards compatibility. This seems to merely need a cargo fmt for this to merge. But wanted to make it apparent that I can readily provide the atomicAdd polyfill for half on Pascal (CC < 70). |
|
@DrJesseGlass Thank you for your feedback, I understand your need to Disable BF16 WMMA for pre-Ampere GPUs My goal is to ensure backward compatibility via #2704, where I've added The `atomicAdd` function for `__half` is already candle.candle/candle-kernels/src/compatibility.cuh Lines 38 to 59 in 3b39794
I added Polyfill: atomicAdd for bfloat16 for cc < 800 min/max man atomicAdd bf16 min/max man The current PR aims for CC < 700 compatibility but primarily offers a way to prepare a step for automatically supporting heterogeneous multi-GPUs. In the futur using kernel via slang to enable automatic merge operation and backward compatibility via stensor could be a solution, but with less manual optimization setup. |
I used cudarc driver API to automatically detect compute capabilities at build time, which seems more practical than relying on the CUDA_COMPUTE_CAP environment variable:
If you prefer a different approach (e.g., nvidia-smi or env var only), I'm happy to adjust.
Currently, the generator's compute_cap method depends on the merging of Narsil/bindgen_cuda#18. And if Narsil/bindgen_cuda#16 is merged, it would be possible to extend CUBIN generation to multiple architectures to accelerate startup and optimization.