This project was created to explore kernel optimization in CUDA and show the performance of difference of Matrix Multiplication Optimization Techniques check out the blog post this project is based on https://siboehm.com/articles/22/CUDA-MMM
all of the functions above main are wrapper functions that allows the kernels to be of type kernel info. These functions also calculate the block and grid size and allow for all of the kernels to be grouped into one vector for easier testing.
in main some matrix sizes are intialized, some information about the gpu being used is printed, then the kernels are all added to a vector to be passed into the testing functions. To test different matrix sizes simply add your matrix sizes in main and call compare_kernels(kernels,size1,size2,size3); as shown at the end of main.
all of the kernels can be found in this folder, here is a high level explanation of the optimization each kernel is based on.
-
Naive Kernel
The most basic matrix multiplication using CUDA. important to remember that CUDA kernels are written in the context of a single thread so their is not a double loop format for matrix multiplication in a kernel.
-
GMEM Coalescing
the most important thing to understand GMEM coalescing is that sequential memory access of threads in warps can be grouped and executed together. This means that if a warp of threads 0 through 31 each access elements of the same index 0 through 31 then number of memory accesses can be minimized.
Another way to understand why this is inefficient is by looking at the output matrix. The output of the kernel is created so that threads in a warp are in the same column. This prevent memory coalescing because column are not sequential in memory
-
Shared Memory Coalescing
shared memory is a L1 memory hat is on each streaming multiprocessor (SM)
using the keyword __shared __ in CUDA allows for the use of L1 memory explicitly.
this can allow for a great increase in speed because data that is used by many threads in a sm can be stored on this SMEM to increase speeds
Implementation: a chunk of Matrix A and Matrix B are loaded into shared memory (one float by each thread). then as much work as possible is done on these two blocks, with each thread still in charge of one entry of C. when all work is done in each chunk, the chunks are moved along the columns and A and rows of B. This ^ can be thought of as similar to the naive kernel but with larger blocks of numbers (or chunks) instead of a singular number.
note: occupancy is the ratio between active warps per SM and total num of warps
-
1D BlockTiling
this kernel is very similar to smem coalescing, but each thread computes a column of values in the output array. this allow for more effiecent use of SMEM.
This kernel also divides into rectangles in the output matrix. This allows for each thread to compute a column of the output without loading SMEM with a square matrix that could be too large.
-
2D BlockTiling
This kernel is similar to the one above but applies the same optimization again such that each thread calculates a small block of outputs in C instead of a singular column
while it seems at first that each thread calculating more outputs would slowdown the kernel, it greatly reduces the number of loads per output in C which increases the overall speed of the kernel by reducing memory load time overhead
-
Vectorize
the first optimization in this kernel is transposing the A block of the matrix. This allows access to sequential elements of A (coalesce memory access), like the earlier kernels were doing with B, such that A can load 128b at once instead of 32b as it was doing before. This is accomplished by transposing the small block of A as it is loaded into SMEM. This allows for A to be transposed without adding any extra transposition calculations.
The next optimization is vectorizing loads and stores from GMEM. This essentially means that four floats are combined into a type of float4 and loaded at once to speed up GMEM load times.
the syntax reinterpret_cast<float4*> → reinterpret as float4 pointer. This helps the compiler know that the data will be aligned so that it can load in 128b chunks (float4) instead of 32b chunks (one float).
-
WarpTiling
The warptiling kernel is much more complex; it adds another layer of computation at the warp level that helps threads in a warp share registers efficiently. This reduces shared memory bank conflicts by having threads work with register data instead of constantly accessing shared memory.
The kernel implements hierarchical tiling where each warp computes on a specific sub-region, allowing better resource utilization. This, combined with earlier optimizations for global memory coalescing and shared memory blocking, allows the warptiling kernel to achieve performance that can match or exceed cuBLAS for specific matrix sizes and hardware configurations.
this function tests a singular kernel. It does this by performing 10 warmup runs, then testing the kernel a specified number of times, uses cuda events to benchmark the kernel, and comparison to a passed kernel (h_C_cpu_reference) to ensure accuracy, it then returns all of the relevant info
this functions is used to compare a vector of kernels passed to it. it accomplishes this by first creating a random matrix for testing then doing 1 cpu matrix multiplication for accuracy testing. Next t calls test_kernel on each kernel and populates a results vector with the results. This results is used to output the speeds and accuracy of each kernel in the comparisons section.
a file that implements cublas matrix multiplication so that it can be used as a benchmark for the other kernels