A personal experimental CUDA tensor core library for efficient GEMM operations.
FlashTile2 currently provides the following core primitives:
- Global Memory (
global.h): Global tile views with row/column-major layouts - Shared Memory (
share.h): Shared memory tiles with optional swizzling - Register Memory (
reg.h): Register tiles for tensor core operations
- G2S (Global to Shared) (
copy/g2s.h): Asynchronous copy from global to shared memory - S2R (Shared to Register) (
copy/s2r.h): Efficient shared to register transfers usingldmatrix - R2G (Register to Global) (
copy/r2g.h): Register to global memory stores
- MMA Operations (
compute.h): Tensor core matrix multiply-accumulate - Math Utilities (
math.h): SIMD operations and vectorized math
- SwizzleLayout (
swizzle.h): Compile-time swizzle pattern generation for bank conflict avoidance
Here's a simple example of a tiled GEMM kernel using FlashTile2:
#include "flashtile2/flashtile2.h"
using namespace flashtile2;
// Problem and block dimensions
constexpr int kM = 1024, kN = 1024, kK = 2048;
constexpr int BLK_M = 256, BLK_N = 128, BLK_K = 64;
// Define GEMM traits
using Traits = kernel::TiledGemmTraits<
Shape<kM, kN, kK>, // Problem shape
Shape<BLK_M, BLK_N, BLK_K>, // Block shape
half, // Input type
float, // Accumulator type
MMA_Atom_16x16x16, // MMA atom
TileLayout<2, 2>, // Warp layout, here we use a grid of 2x2 warps
true, // Use swizzle
128 // Swizzle bytes
>;
// Launch kernel
dim3 grid(kN / BLK_N, kM / BLK_M);
dim3 block(Traits::kNumThreads);
kernel::tiled_gemm<Traits><<<grid, block, Traits::kSmemSize>>>(d_A, d_B, d_C);- CUDA Toolkit: 12.4+
- CMake: 3.25+
- C++ Compiler: C++20 support required
- GPU: Compute capability 8.0+ (Ampere, Ada Lovelace, Hopper)
git clone https://github.com/DucHUNG312/flashtile2.git
cd flashtile2mkdir build && cd build
cmake -G Ninja \
-DCMAKE_BUILD_TYPE=Release \
-DCMAKE_CUDA_ARCHITECTURES=89 \
..
ninjactest --output-on-failure./benchmark/gemm_benchThis project draws inspiration from the following resources:
-
TileFusion - An experimental C++ macro kernel template library for tile processing. https://github.com/microsoft/TileFusion
-
CUTLASS - CUDA Templates for Linear Algebra Subroutines by NVIDIA https://github.com/NVIDIA/cutlass
-
NVIDIA CUDA Programming Guide - PTX ISA and Tensor Core Programming Documentation https://docs.nvidia.com/cuda/parallel-thread-execution