Skip to content

DucHUNG312/flashtile2

Repository files navigation

FlashTile2

A personal experimental CUDA tensor core library for efficient GEMM operations.

Features

FlashTile2 currently provides the following core primitives:

Memory Hierarchy

  • Global Memory (global.h): Global tile views with row/column-major layouts
  • Shared Memory (share.h): Shared memory tiles with optional swizzling
  • Register Memory (reg.h): Register tiles for tensor core operations

Data Movement

  • G2S (Global to Shared) (copy/g2s.h): Asynchronous copy from global to shared memory
  • S2R (Shared to Register) (copy/s2r.h): Efficient shared to register transfers using ldmatrix
  • R2G (Register to Global) (copy/r2g.h): Register to global memory stores

Compute

  • MMA Operations (compute.h): Tensor core matrix multiply-accumulate
  • Math Utilities (math.h): SIMD operations and vectorized math

Swizzle

  • SwizzleLayout (swizzle.h): Compile-time swizzle pattern generation for bank conflict avoidance

Quick Start

Here's a simple example of a tiled GEMM kernel using FlashTile2:

#include "flashtile2/flashtile2.h"

using namespace flashtile2;

// Problem and block dimensions
constexpr int kM = 1024, kN = 1024, kK = 2048;
constexpr int BLK_M = 256, BLK_N = 128, BLK_K = 64;

// Define GEMM traits
using Traits = kernel::TiledGemmTraits<
    Shape<kM, kN, kK>,        // Problem shape
    Shape<BLK_M, BLK_N, BLK_K>, // Block shape
    half,                      // Input type
    float,                     // Accumulator type
    MMA_Atom_16x16x16,         // MMA atom
    TileLayout<2, 2>,          // Warp layout, here we use a grid of 2x2 warps
    true,                      // Use swizzle
    128                        // Swizzle bytes
>;

// Launch kernel
dim3 grid(kN / BLK_N, kM / BLK_M);
dim3 block(Traits::kNumThreads);
kernel::tiled_gemm<Traits><<<grid, block, Traits::kSmemSize>>>(d_A, d_B, d_C);

Requirements

  • CUDA Toolkit: 12.4+
  • CMake: 3.25+
  • C++ Compiler: C++20 support required
  • GPU: Compute capability 8.0+ (Ampere, Ada Lovelace, Hopper)

Installation

Clone the repository

git clone https://github.com/DucHUNG312/flashtile2.git
cd flashtile2

Build

mkdir build && cd build
cmake -G Ninja \
  -DCMAKE_BUILD_TYPE=Release \
  -DCMAKE_CUDA_ARCHITECTURES=89 \
  ..
ninja

Run tests

ctest --output-on-failure

Run benchmarks

./benchmark/gemm_bench

References

This project draws inspiration from the following resources:

  1. TileFusion - An experimental C++ macro kernel template library for tile processing. https://github.com/microsoft/TileFusion

  2. CUTLASS - CUDA Templates for Linear Algebra Subroutines by NVIDIA https://github.com/NVIDIA/cutlass

  3. NVIDIA CUDA Programming Guide - PTX ISA and Tensor Core Programming Documentation https://docs.nvidia.com/cuda/parallel-thread-execution

About

Personal experimental CUDA tensor core library

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors