Skip to content

Leodavinci-3407/my-sgemm-lab

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

my-sgemm-lab

My CUDA SGEMM kernel for square row-major matrices, featuring shared-memory block tiling, thread coarsening, vectorized float4 memory access, and double buffering. Benchmarked against cuBLAS on an NVIDIA A100-SXM4-40GB, the kernel reaches up to 15.69 TFLOPS, or 87.14% of cuBLAS throughput at N=4096.

Overview

This project implements a custom CUDA SGEMM kernel for

$C \leftarrow A \times B + C$

for square matrices stored in row-major order. The kernel is designed as a learning-oriented but performance-conscious implementation of modern GEMM optimization ideas:

  • block tiling at thread-block level
  • thread tiling / register blocking at per-thread level
  • shared-memory staging for A and B tiles
  • double-buffered shared memory across K-tiles
  • register-level ping-pong across bk
  • vectorized float4 global/shared accesses where the layout permits
  • A-tile transposition in shared memory to make fixed-k, varying-m fragments contiguous and more bank-friendly during the outer-product update

The code is tuned for Ampere-class GPUs (sm_80) and currently benchmarks square problem sizes 4096 and 8192.

Kernel configuration

The current benchmark uses:

  • BM = 128
  • BN = 128
  • BK = 8
  • TM = 8
  • TN = 8

which means:

  • one thread block computes a 128 x 128 tile of C
  • one thread accumulates an 8 x 8 register tile
  • the K dimension is streamed in chunks of 8

Implementation highlights

1. Hierarchical tiling

Each thread block computes one BM x BN tile of the output matrix, while each thread accumulates a TM x TN sub-tile entirely in registers. This increases arithmetic intensity and reduces repeated memory traffic.

2. Shared-memory staging

Tiles of A and B are cooperatively loaded from global memory into shared memory. This allows the block to reuse data across many fused multiply-add operations.

3. A-tile transposition in shared memory

A is read from global memory in row-major order, but stored in shared memory as [BK][BM]. This reorganizes the data so that, for a fixed k, the thread can load several consecutive m-direction values of A with a single float4 read. In practice, this also gives a more bank-friendly access pattern during the outer-product microkernel.

4. Double buffering across K-tiles

Shared memory is split into two stages. While the current stage is being consumed, the next K-tile is prefetched and then written into the inactive stage, reducing pipeline stalls between outer iterations.

5. Register-level ping-pong inside a K-tile

Within one K-tile, the kernel uses a two-buffer fragment scheme in registers: while computing with the current bk, it preloads the next bk+1 fragment from shared memory. The final bk = BK - 1 slice is then finished directly from registers before the next shared-memory stage is activated.

6. Vectorized memory access

The implementation uses float4 loads/stores where addresses are contiguous, reducing instruction count and improving bandwidth utilization.

Repository structure

.
├── my-sgemm-kernel.cu   # hand-written SGEMM kernel
├── benchmark.cu        # cuBLAS comparison benchmark and CSV output
├── Makefile            # build script (targets sm_80)
└── results.csv         # benchmark results

Build

Requirements:

  • NVIDIA GPU with CUDA support
  • CUDA Toolkit / nvcc
  • cuBLAS

Build the benchmark:

make

This compiles benchmark.cu with:

nvcc -O3 -arch=sm_80 --use_fast_math -lcublas

Run

./benchmark

The benchmark writes results to results.csv and prints a summary table to stdout.

Benchmark setup

Hardware used for the attached results:

  • GPU: NVIDIA A100-SXM4-40GB
  • Precision: FP32 SGEMM
  • Compared against: cuBLAS cublasSgemm
  • Problem sizes: N = 4096, 8192

The benchmark computes throughput as:

$\text{TFLOPS} = \frac{2N^3}{t \cdot 10^{12}}$

where t is the average runtime in seconds.

Results

N My kernel (ms) My kernel (TFLOPS) cuBLAS (ms) cuBLAS (TFLOPS) % of cuBLAS
4096 8.7590 15.6912 7.6327 18.0066 87.14%
8192 62.3727 17.6281 59.2369 18.5613 94.97%

Takeaways

  • The kernel sustains 15.69 TFLOPS at N=4096.
  • At N=8192, it reaches 17.63 TFLOPS.
  • The best result in the current benchmark is 94.97% of cuBLAS throughput on A100, while the smaller case reaches 87.14% of cuBLAS at N=4096.

Notes and current assumptions

This implementation is intentionally focused on a tuned square-matrix benchmark configuration rather than a fully general GEMM library. The current kernel assumes:

  • row-major input/output layout
  • square matrices in the benchmark driver
  • width is divisible by BM, BN, and BK
  • launch configuration matches the template parameters
  • the chosen tiling parameters satisfy the vectorized access pattern used in the kernel

About

My CUDA SGEMM kernel for square row-major matrices. Achieved 87.14% of cuBLAS throughput at N=4096 on NVIDIA A100-SXM4-40GB.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors