Skip to content
View florianmattana's full-sized avatar

Block or report florianmattana

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
florianmattana/README.md

Florian Mattana

GPU Kernel Engineer

I write CUDA kernels at the PTX level for inference workloads on consumer Blackwell GPUs (SM120) where no wrappers, no libraries, and no tooling exist yet.

C/C++ CUDA Inline PTX Nsight Compute Nsight Systems Tensor Cores FP4 SM120

Blog · LinkedIn · Twitter


Why SM120

Most GPU kernel work targets datacenter chips — B200, B100, H100. I target consumer Blackwell (RTX 5070 Ti, SM120): different MMA instructions, single-scale scale_vec::1X, FP4 packed in 8-bit containers, and zero cuda::ptx wrapper support. If it runs on SM120, I wrote the PTX by hand.


🔬 Current Project


FP4 Fused Attention

Fused GEMM → softmax → GEMM attention kernel using FP4 E2M1 quantization with UE8M0 block scaling. Inline PTX mma.sync — scores matrix stays in registers between both GEMMs.

📄 Full technical writeup →


🔧 Projects

CUDA-Kernels GEMM · Reduction · Prefix Scan · Softmax · Flash Attention Built from scratch, profiled with NCU.

Best GEMM → 58.8% of cuBLAS on RTX 5070 Ti

GPU Profiling Guide 20,000+ words on Nsight Systems & Nsight Compute. Napkin math · Roofline · Bottleneck classification · Source tab deep dive.


🤝 Open-Source

Project What I Did
model-kernels 5 compilation + 2 precision fixes on INT8 fused attention — max error 1.69 → 1.37 · 2 PRs merged
🔀 ThunderKittens · Stanford HazyResearch Narrowing-conversion fix in base-type packing
🔍 FlashInfer SM120 benchmarks · Identified #34988 — fused FP4 kernel slower than unfused

📝 Articles

Building an FP4 Fused Attention Kernel for Consumer Blackwell GPUs From hardware instructions to working kernel — inline PTX, block scaling, register budget, MMA fragment mapping.

Exploring PTX: A Close Look at Tile Optimization in CUDA

From Silicon to Thread Identity: How CUDA Threads Know Who They Are


Hardware: RTX 5070 Ti · SM120 · 12 GB GDDR7 · 672 GB/s

Pinned Loading

  1. HazyResearch/ThunderKittens HazyResearch/ThunderKittens Public

    Tile primitives for speedy kernels

    Cuda 3.3k 275

  2. fp4-fused-attention-sm120 fp4-fused-attention-sm120 Public

    FP4 fused attention kernel for NVIDIA SM120 GPUs (RTX 5080, RTX 5070 Ti, RTX 5070, RTX 5060 Ti) using inline PTX

    C++ 2