My CUDA SGEMM kernel for square row-major matrices, featuring shared-memory block tiling, thread coarsening, vectorized float4 memory access, and double buffering. Benchmarked against cuBLAS on an NVIDIA A100-SXM4-40GB, the kernel reaches up to 15.69 TFLOPS, or 87.14% of cuBLAS throughput at N=4096.
This project implements a custom CUDA SGEMM kernel for
for square matrices stored in row-major order. The kernel is designed as a learning-oriented but performance-conscious implementation of modern GEMM optimization ideas:
- block tiling at thread-block level
- thread tiling / register blocking at per-thread level
- shared-memory staging for A and B tiles
- double-buffered shared memory across K-tiles
- register-level ping-pong across
bk - vectorized
float4global/shared accesses where the layout permits - A-tile transposition in shared memory to make fixed-
k, varying-mfragments contiguous and more bank-friendly during the outer-product update
The code is tuned for Ampere-class GPUs (sm_80) and currently benchmarks square problem sizes 4096 and 8192.
The current benchmark uses:
BM = 128BN = 128BK = 8TM = 8TN = 8
which means:
- one thread block computes a
128 x 128tile ofC - one thread accumulates an
8 x 8register tile - the K dimension is streamed in chunks of
8
Each thread block computes one BM x BN tile of the output matrix, while each thread accumulates a TM x TN sub-tile entirely in registers. This increases arithmetic intensity and reduces repeated memory traffic.
Tiles of A and B are cooperatively loaded from global memory into shared memory. This allows the block to reuse data across many fused multiply-add operations.
A is read from global memory in row-major order, but stored in shared memory as [BK][BM]. This reorganizes the data so that, for a fixed k, the thread can load several consecutive m-direction values of A with a single float4 read. In practice, this also gives a more bank-friendly access pattern during the outer-product microkernel.
Shared memory is split into two stages. While the current stage is being consumed, the next K-tile is prefetched and then written into the inactive stage, reducing pipeline stalls between outer iterations.
Within one K-tile, the kernel uses a two-buffer fragment scheme in registers: while computing with the current bk, it preloads the next bk+1 fragment from shared memory. The final bk = BK - 1 slice is then finished directly from registers before the next shared-memory stage is activated.
The implementation uses float4 loads/stores where addresses are contiguous, reducing instruction count and improving bandwidth utilization.
.
├── my-sgemm-kernel.cu # hand-written SGEMM kernel
├── benchmark.cu # cuBLAS comparison benchmark and CSV output
├── Makefile # build script (targets sm_80)
└── results.csv # benchmark results
Requirements:
- NVIDIA GPU with CUDA support
- CUDA Toolkit /
nvcc - cuBLAS
Build the benchmark:
makeThis compiles benchmark.cu with:
nvcc -O3 -arch=sm_80 --use_fast_math -lcublas./benchmarkThe benchmark writes results to results.csv and prints a summary table to stdout.
Hardware used for the attached results:
- GPU: NVIDIA A100-SXM4-40GB
- Precision: FP32 SGEMM
- Compared against: cuBLAS
cublasSgemm - Problem sizes:
N = 4096,8192
The benchmark computes throughput as:
where t is the average runtime in seconds.
| N | My kernel (ms) | My kernel (TFLOPS) | cuBLAS (ms) | cuBLAS (TFLOPS) | % of cuBLAS |
|---|---|---|---|---|---|
| 4096 | 8.7590 | 15.6912 | 7.6327 | 18.0066 | 87.14% |
| 8192 | 62.3727 | 17.6281 | 59.2369 | 18.5613 | 94.97% |
- The kernel sustains 15.69 TFLOPS at
N=4096. - At
N=8192, it reaches 17.63 TFLOPS. - The best result in the current benchmark is 94.97% of cuBLAS throughput on A100, while the smaller case reaches 87.14% of cuBLAS at
N=4096.
This implementation is intentionally focused on a tuned square-matrix benchmark configuration rather than a fully general GEMM library. The current kernel assumes:
- row-major input/output layout
- square matrices in the benchmark driver
widthis divisible byBM,BN, andBK- launch configuration matches the template parameters
- the chosen tiling parameters satisfy the vectorized access pattern used in the kernel