Skip to content

Conversation

@SpenserCai
Copy link

Support build cubin & Add unittest

This pull PR provides support for compiling cubin with bindgen_cuda. In some cases, it is necessary to directly compile cubin through bindgen_cuda, so support has been added, ensuring backward compatibility, and unit tests as well as the compilation of cubin on candle-kernel have all passed.

Test Result

Unit test

   Finished `test` profile [unoptimized + debuginfo] target(s) in 0.13s
     Running unittests src/lib.rs (target/debug/deps/bindgen_cuda-5ef708f2e28b6cdd)

running 0 tests

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s

     Running tests/build_cubin_format_test.rs (target/debug/deps/build_cubin_format_test-7b7273e571d76e15)

running 1 test
test test_cubin_bindings_format ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.17s

     Running tests/build_cubin_test.rs (target/debug/deps/build_cubin_test-9ae47ea18fa439e2)

running 1 test
test test_build_cubin ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.19s

     Running tests/build_ptx_format_test.rs (target/debug/deps/build_ptx_format_test-0ed271a1a58016d6)

running 1 test
test test_ptx_bindings_format ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.17s

     Running tests/build_test.rs (target/debug/deps/build_test-d94cd1b6361e80f1)

running 1 test
test test_build_ptx ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.16s

   Doc-tests bindgen_cuda

running 17 tests
test src/lib.rs - (line 41) ... ignored
test src/lib.rs - (line 62) ... ignored
test src/lib.rs - (line 98) - compile ... ok
test src/lib.rs - Builder::kernel_paths_glob (line 138) - compile ... ok
test src/lib.rs - Builder::include_paths_glob (line 150) - compile ... ok
test src/lib.rs - Builder::arg (line 173) - compile ... ok
test src/lib.rs - Builder::include_paths (line 129) - compile ... ok
test src/lib.rs - (line 87) - compile ... ok
test src/lib.rs - Builder::build_ptx (line 311) - compile ... ok
test src/lib.rs - Builder::out_dir (line 164) - compile ... ok
test src/lib.rs - (line 31) - compile ... ok
test src/lib.rs - Builder::build_lib (line 195) - compile ... ok
test src/lib.rs - Builder::build_cubin (line 419) - compile ... ok
test src/lib.rs - Builder::cuda_root (line 183) - compile ... ok
test src/lib.rs - (line 52) - compile ... ok
test src/lib.rs - Builder::watch (line 110) - compile ... ok
test src/lib.rs - Builder::kernel_paths (line 96) - compile ... ok

test result: ok. 15 passed; 0 failed; 2 ignored; 0 measured; 0 filtered out; finished in 0.03s

Test in candle-kernel build cubin

   Compiling bindgen_cuda v0.1.5 (https://github.com/SpenserCai/bindgen_cuda?branch=support_build_cubin#f89d081f)
   Compiling candle-kernels v0.9.2-alpha.1 (https://github.com/SpenserCai/candle?branch=cuda_kerenl_suport_cubin#c7e54ce1)
   Compiling synstructure v0.13.2
   Compiling zerocopy-derive v0.8.31
   Compiling bytemuck_derive v1.10.2
   Compiling serde_derive v1.0.228
   Compiling thiserror-impl v1.0.69
   Compiling tracing-attributes v0.1.31
   Compiling num_enum_derive v0.7.5
   Compiling displaydoc v0.2.5
   Compiling zerofrom-derive v0.1.6
   Compiling yoke-derive v0.7.5
   Compiling num_enum v0.7.5
   Compiling bytemuck v1.24.0
   Compiling zerofrom v0.1.6
   Compiling yoke v0.7.5
   Compiling num-complex v0.4.6
   Compiling dyn-stack v0.13.2
   Compiling tracing v0.1.43
   Compiling num v0.4.3
   Compiling safetensors v0.4.5
   Compiling safetensors v0.6.2
   Compiling ppv-lite86 v0.2.21
   Compiling rand_chacha v0.9.0
   Compiling rand v0.9.2
   Compiling rand_distr v0.5.1
   Compiling half v2.7.1
   Compiling gemm-common v0.18.2
   Compiling float8 v0.3.0
   Compiling gemm-f32 v0.18.2
   Compiling gemm-c32 v0.18.2
   Compiling gemm-f64 v0.18.2
   Compiling gemm-c64 v0.18.2
   Compiling gemm-f16 v0.18.2
   Compiling gemm v0.18.2
   Compiling ug v0.5.0
   Compiling float8 v0.5.0
   Compiling ug-cuda v0.5.0
   Compiling candle-core v0.9.2-alpha.1 (https://github.com/SpenserCai/candle?branch=cuda_kerenl_suport_cubin#c7e54ce1)
   Compiling cuda_kernel_test v0.1.0 (/home/ubuntu/data/dev/candle_support_cubin/cuda_kernel_test)
    Finished `release` profile [optimized] target(s) in 1m 34s
     Running `target/release/cuda_kernel_test`
=== CUDA Kernel Test Suite ===

🔧 Feature: CUDA enabled
✓ CUDA device created successfully

--- Device Information ---
Device Type: CUDA GPU
GPU ID: 0

Kernel Module Analysis:
  Sample Module: AFFINE
  Module Size: 50664 bytes

  ✓ Format Detected: CUBIN (ELF binary)Magic Number: 0x7f 'E' 'L' 'F'CUDA Module Format: CUBIN (pre-compiled binary)Note: Using architecture-specific optimized kernels
-------------------------

=== Running Kernel Tests ===

1. Testing Binary Kernels (BINARY module)Addition: [1,2,3,4] + [5,6,7,8] = [6.0, 8.0, 10.0, 12.0]Subtraction: [5,6,7,8] - [1,2,3,4] = [4.0, 4.0, 4.0, 4.0]Multiplication: [1,2,3,4] * [5,6,7,8] = [5.0, 12.0, 21.0, 32.0]Division: [5,6,7,8] / [1,2,3,4] = [5.0, 3.0, 2.3333333, 2.0]

2. Testing Unary Kernels (UNARY module)Sqrt: sqrt([1,4,9,16]) = [1.0, 2.0, 3.0, 4.0]Negation: -[1,4,9,16] = [-1.0, -4.0, -9.0, -16.0]Exp: exp([0,1,2])[1.0, 2.72, 7.39]

3. Testing Cast Kernels (CAST module)F32 -> F16 cast successful
   ✓ F16 -> F32 cast successful: [1.0, 2.0, 3.0, 4.0]F32 -> BF16 cast successful
   ✓ BF16 -> F32 cast successful: [1.0, 2.0, 3.0, 4.0]

4. Testing Reduce Kernels (REDUCE module)Sum all: sum([[1,2,3],[4,5,6]]) = 21Sum dim 0: [5.0, 7.0, 9.0]Sum dim 1: [6.0, 15.0]Mean all: 3.5Max dim 1: [3.0, 6.0]Min dim 1: [1.0, 4.0]

5. Testing Matrix Multiplication2x2 matmul: [[1,2],[3,4]] @ [[5,6],[7,8]] = [[19.0, 22.0], [43.0, 50.0]]64x128 @ 128x32 = 64x32 matmul successful
   ✓ Batch matmul (4, 16x32 @ 32x16) = (4, 16x16) successful

✅ All kernel tests passed!

@SpenserCai
Copy link
Author

@Narsil @ivarflakstad This PR is designed to enable precompilation of Cubin for loading on Candle. The support for CUDARC has been completed (chelsea0x3b/cudarc#505). If possible, could you please help review the PR. Thank you very much

@SpenserCai
Copy link
Author

@haricot @ivarflakstad Fixed

@SpenserCai SpenserCai requested a review from haricot December 22, 2025 10:41
Copy link
Contributor

@haricot haricot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved: tested locally with:
NVCC_CCBIN="gcc-14" cargo test --test build_cubin_test -- --nocapture --> OK
Thank you!

Reflection on another PR topic: We could add support for a CUDA_ROOT_LEGACY or NVCC_BIN variable to be able to point to a legacy CUDA root or a specific nvcc binary (useful when the driver is newer than nvcc. ex: driver cuda_13.0 vs nvcc 12.9).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants