FlashTile2

A personal experimental CUDA tensor core library for efficient GEMM operations.

Features

FlashTile2 currently provides the following core primitives:

Memory Hierarchy

Global Memory (global.h): Global tile views with row/column-major layouts
Shared Memory (share.h): Shared memory tiles with optional swizzling
Register Memory (reg.h): Register tiles for tensor core operations

Data Movement

G2S (Global to Shared) (copy/g2s.h): Asynchronous copy from global to shared memory
S2R (Shared to Register) (copy/s2r.h): Efficient shared to register transfers using ldmatrix
R2G (Register to Global) (copy/r2g.h): Register to global memory stores

Compute

MMA Operations (compute.h): Tensor core matrix multiply-accumulate
Math Utilities (math.h): SIMD operations and vectorized math

Swizzle

SwizzleLayout (swizzle.h): Compile-time swizzle pattern generation for bank conflict avoidance

Quick Start

Here's a simple example of a tiled GEMM kernel using FlashTile2:

#include "flashtile2/flashtile2.h"

using namespace flashtile2;

// Problem and block dimensions
constexpr int kM = 1024, kN = 1024, kK = 2048;
constexpr int BLK_M = 256, BLK_N = 128, BLK_K = 64;

// Define GEMM traits
using Traits = kernel::TiledGemmTraits<
    Shape<kM, kN, kK>,        // Problem shape
    Shape<BLK_M, BLK_N, BLK_K>, // Block shape
    half,                      // Input type
    float,                     // Accumulator type
    MMA_Atom_16x16x16,         // MMA atom
    TileLayout<2, 2>,          // Warp layout, here we use a grid of 2x2 warps
    true,                      // Use swizzle
    128                        // Swizzle bytes
>;

// Launch kernel
dim3 grid(kN / BLK_N, kM / BLK_M);
dim3 block(Traits::kNumThreads);
kernel::tiled_gemm<Traits><<<grid, block, Traits::kSmemSize>>>(d_A, d_B, d_C);

Requirements

CUDA Toolkit: 12.4+
CMake: 3.25+
C++ Compiler: C++20 support required
GPU: Compute capability 8.0+ (Ampere, Ada Lovelace, Hopper)

Installation

Clone the repository

git clone https://github.com/DucHUNG312/flashtile2.git
cd flashtile2

Build

mkdir build && cd build
cmake -G Ninja \
  -DCMAKE_BUILD_TYPE=Release \
  -DCMAKE_CUDA_ARCHITECTURES=89 \
  ..
ninja

Run tests

ctest --output-on-failure

Run benchmarks

./benchmark/gemm_bench

References

This project draws inspiration from the following resources:

TileFusion - An experimental C++ macro kernel template library for tile processing. https://github.com/microsoft/TileFusion
CUTLASS - CUDA Templates for Linear Algebra Subroutines by NVIDIA https://github.com/NVIDIA/cutlass
NVIDIA CUDA Programming Guide - PTX ISA and Tensor Core Programming Documentation https://docs.nvidia.com/cuda/parallel-thread-execution

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.github/workflows		.github/workflows
3rdparty		3rdparty
benchmark		benchmark
cmake		cmake
example		example
include/flashtile2		include/flashtile2
scripts		scripts
test		test
.clang-format		.clang-format
.clang-tidy		.clang-tidy
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
Config.cmake.in		Config.cmake.in
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FlashTile2

Features

Memory Hierarchy

Data Movement

Compute

Swizzle

Quick Start

Requirements

Installation

Clone the repository

Build

Run tests

Run benchmarks

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

FlashTile2

Features

Memory Hierarchy

Data Movement

Compute

Swizzle

Quick Start

Requirements

Installation

Clone the repository

Build

Run tests

Run benchmarks

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages