my-sgemm-lab

My CUDA SGEMM kernel for square row-major matrices, featuring shared-memory block tiling, thread coarsening, vectorized float4 memory access, and double buffering. Benchmarked against cuBLAS on an NVIDIA A100-SXM4-40GB, the kernel reaches up to 15.69 TFLOPS, or 87.14% of cuBLAS throughput at N=4096.

Overview

This project implements a custom CUDA SGEMM kernel for

$C \leftarrow A \times B + C$

for square matrices stored in row-major order. The kernel is designed as a learning-oriented but performance-conscious implementation of modern GEMM optimization ideas:

block tiling at thread-block level
thread tiling / register blocking at per-thread level
shared-memory staging for A and B tiles
double-buffered shared memory across K-tiles
register-level ping-pong across bk
vectorized float4 global/shared accesses where the layout permits
A-tile transposition in shared memory to make fixed-k, varying-m fragments contiguous and more bank-friendly during the outer-product update

The code is tuned for Ampere-class GPUs (sm_80) and currently benchmarks square problem sizes 4096 and 8192.

Kernel configuration

The current benchmark uses:

BM = 128
BN = 128
BK = 8
TM = 8
TN = 8

which means:

one thread block computes a 128 x 128 tile of C
one thread accumulates an 8 x 8 register tile
the K dimension is streamed in chunks of 8

Implementation highlights

1. Hierarchical tiling

Each thread block computes one BM x BN tile of the output matrix, while each thread accumulates a TM x TN sub-tile entirely in registers. This increases arithmetic intensity and reduces repeated memory traffic.

2. Shared-memory staging

Tiles of A and B are cooperatively loaded from global memory into shared memory. This allows the block to reuse data across many fused multiply-add operations.

3. A-tile transposition in shared memory

A is read from global memory in row-major order, but stored in shared memory as [BK][BM]. This reorganizes the data so that, for a fixed k, the thread can load several consecutive m-direction values of A with a single float4 read. In practice, this also gives a more bank-friendly access pattern during the outer-product microkernel.

4. Double buffering across K-tiles

Shared memory is split into two stages. While the current stage is being consumed, the next K-tile is prefetched and then written into the inactive stage, reducing pipeline stalls between outer iterations.

5. Register-level ping-pong inside a K-tile

Within one K-tile, the kernel uses a two-buffer fragment scheme in registers: while computing with the current bk, it preloads the next bk+1 fragment from shared memory. The final bk = BK - 1 slice is then finished directly from registers before the next shared-memory stage is activated.

6. Vectorized memory access

The implementation uses float4 loads/stores where addresses are contiguous, reducing instruction count and improving bandwidth utilization.

Repository structure

.
├── my-sgemm-kernel.cu   # hand-written SGEMM kernel
├── benchmark.cu        # cuBLAS comparison benchmark and CSV output
├── Makefile            # build script (targets sm_80)
└── results.csv         # benchmark results

Build

Requirements:

NVIDIA GPU with CUDA support
CUDA Toolkit / nvcc
cuBLAS

Build the benchmark:

make

This compiles benchmark.cu with:

nvcc -O3 -arch=sm_80 --use_fast_math -lcublas

Run

./benchmark

The benchmark writes results to results.csv and prints a summary table to stdout.

Benchmark setup

Hardware used for the attached results:

GPU: NVIDIA A100-SXM4-40GB
Precision: FP32 SGEMM
Compared against: cuBLAS cublasSgemm
Problem sizes: N = 4096, 8192

The benchmark computes throughput as:

$\text{TFLOPS} = \frac{2N^3}{t \cdot 10^{12}}$

where t is the average runtime in seconds.

Results

N	My kernel (ms)	My kernel (TFLOPS)	cuBLAS (ms)	cuBLAS (TFLOPS)	% of cuBLAS
4096	8.7590	15.6912	7.6327	18.0066	87.14%
8192	62.3727	17.6281	59.2369	18.5613	94.97%

Takeaways

The kernel sustains 15.69 TFLOPS at N=4096.
At N=8192, it reaches 17.63 TFLOPS.
The best result in the current benchmark is 94.97% of cuBLAS throughput on A100, while the smaller case reaches 87.14% of cuBLAS at N=4096.

Notes and current assumptions

This implementation is intentionally focused on a tuned square-matrix benchmark configuration rather than a fully general GEMM library. The current kernel assumes:

row-major input/output layout
square matrices in the benchmark driver
width is divisible by BM, BN, and BK
launch configuration matches the template parameters
the chosen tiling parameters satisfy the vectorized access pattern used in the kernel

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
Makefile		Makefile
README.md		README.md
benchmark.cu		benchmark.cu
my-sgemm-kernel.cu		my-sgemm-kernel.cu
results.csv		results.csv
results.png		results.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

my-sgemm-lab

Overview

Kernel configuration

Implementation highlights

1. Hierarchical tiling

2. Shared-memory staging

3. A-tile transposition in shared memory

4. Double buffering across K-tiles

5. Register-level ping-pong inside a K-tile

6. Vectorized memory access

Repository structure

Build

Run

Benchmark setup

Results

Takeaways

Notes and current assumptions

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

my-sgemm-lab

Overview

Kernel configuration

Implementation highlights

1. Hierarchical tiling

2. Shared-memory staging

3. A-tile transposition in shared memory

4. Double buffering across K-tiles

5. Register-level ping-pong inside a K-tile

6. Vectorized memory access

Repository structure

Build

Run

Benchmark setup

Results

Takeaways

Notes and current assumptions

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages