Skip to content

Conversation

@asglover
Copy link

Summary

Fixes #3366

  • Add #ifndef NO_BF16_KERNEL guards to moe_wmma.cumoe_wmma_gguf.cu already had these guards but moe_wmma.cu was missing them, causing compilation failures on pre-Ampere GPUs
  • Wire up -DNO_BF16_KERNEL in build.rs — detect compute capability via cudaforge::detect_compute_cap() and pass the define when compute cap < 80

Problem

bf16 WMMA fragment types (nv_bfloat16 with nvcuda::wmma) require compute capability >= 8.0 (Ampere). On sm_75 (Turing/T4) and older GPUs, compiling these fragments produces "incomplete type" errors:

error: incomplete type "nvcuda::wmma::fragment<nvcuda::wmma::matrix_a, 16, 16, 16, nv_bfloat16, nvcuda::wmma::row_major>" is not allowed

The NO_BF16_KERNEL preprocessor guard was already present in moe_wmma_gguf.cu but:

  1. It was never actually passed by the build script
  2. moe_wmma.cu was missing the guard entirely

Test plan

  • Build candle-kernels on a T4 (sm_75) — compiles successfully with bf16 WMMA kernels excluded
  • Build on an A100/A10 (sm_80+) — bf16 WMMA kernels should still be compiled as before

bf16 WMMA fragment types (nv_bfloat16 with nvcuda::wmma) are only
supported on sm_80+ (Ampere and later). On older architectures like
sm_75 (Turing/T4), compiling these fragments produces "incomplete type"
errors.

moe_wmma_gguf.cu already had #ifndef NO_BF16_KERNEL guards but
moe_wmma.cu was missing them, and the build script never passed the
-DNO_BF16_KERNEL define.

This commit:
- Adds matching #ifndef NO_BF16_KERNEL guards to moe_wmma.cu
- Updates build.rs to detect compute capability via cudaforge and
  pass -DNO_BF16_KERNEL when building for GPUs with compute cap < 80
@asglover
Copy link
Author

This is a claude generated PR, but it does fix the lack of guards on the new MOE kernels. I'm happy to pull it or modify it to make it mergable and up to your standards. It have tested that it allows for candle-kernels to be built on GitHub runners.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Candle Kernels tries to build wmma kernels on unsupported hardware.

1 participant