Mattana Florian florianmattana

Florian Mattana

GPU Kernel Engineer

I write CUDA kernels at the PTX level for inference workloads on consumer Blackwell GPUs (SM120) where no wrappers, no libraries, and no tooling exist yet.

C/C++ CUDA Inline PTX Nsight Compute Nsight Systems Tensor Cores FP4 SM120

Blog · LinkedIn · Twitter

Why SM120

Most GPU kernel work targets datacenter chips — B200, B100, H100. I target consumer Blackwell (RTX 5070 Ti, SM120): different MMA instructions, single-scale scale_vec::1X, FP4 packed in 8-bit containers, and zero cuda::ptx wrapper support. If it runs on SM120, I wrote the PTX by hand.

🔬 Current Project

Fused GEMM → softmax → GEMM attention kernel using FP4 E2M1 quantization with UE8M0 block scaling. Inline PTX mma.sync — scores matrix stays in registers between both GEMMs.

📄 Full technical writeup →

🔧 Projects

CUDA-Kernels GEMM · Reduction · Prefix Scan · Softmax · Flash Attention Built from scratch, profiled with NCU.

Best GEMM → 58.8% of cuBLAS on RTX 5070 Ti

GPU Profiling Guide 20,000+ words on Nsight Systems & Nsight Compute. Napkin math · Roofline · Bottleneck classification · Source tab deep dive.

🤝 Open-Source

	Project	What I Did
✅	model-kernels	5 compilation + 2 precision fixes on INT8 fused attention — max error 1.69 → 1.37 · 2 PRs merged
🔀	ThunderKittens · Stanford HazyResearch	Narrowing-conversion fix in base-type packing
🔍	FlashInfer	SM120 benchmarks · Identified #34988 — fused FP4 kernel slower than unfused

📝 Articles

Building an FP4 Fused Attention Kernel for Consumer Blackwell GPUs From hardware instructions to working kernel — inline PTX, block scaling, register budget, MMA fragment mapping.

Exploring PTX: A Close Look at Tile Optimization in CUDA

From Silicon to Thread Identity: How CUDA Threads Know Who They Are

_{Hardware: RTX 5070 Ti · SM120 · 12 GB GDDR7 · 672 GB/s}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mattana Florian florianmattana

Achievements