GPU Kernel Engineer
I write CUDA kernels at the PTX level for inference workloads on consumer Blackwell GPUs (SM120) where no wrappers, no libraries, and no tooling exist yet.
C/C++ CUDA Inline PTX Nsight Compute Nsight Systems Tensor Cores FP4 SM120
Most GPU kernel work targets datacenter chips — B200, B100, H100.
I target consumer Blackwell (RTX 5070 Ti, SM120): different MMA instructions, single-scale scale_vec::1X, FP4 packed in 8-bit containers, and zero cuda::ptx wrapper support.
If it runs on SM120, I wrote the PTX by hand.
|
|
Fused GEMM → softmax → GEMM attention kernel using FP4 E2M1 quantization with UE8M0 block scaling.
Inline PTX |
|
CUDA-Kernels GEMM · Reduction · Prefix Scan · Softmax · Flash Attention Built from scratch, profiled with NCU.
|
GPU Profiling Guide 20,000+ words on Nsight Systems & Nsight Compute. Napkin math · Roofline · Bottleneck classification · Source tab deep dive. |
| Project | What I Did | |
|---|---|---|
| ✅ | model-kernels | 5 compilation + 2 precision fixes on INT8 fused attention — max error 1.69 → 1.37 · 2 PRs merged |
| 🔀 | ThunderKittens · Stanford HazyResearch | Narrowing-conversion fix in base-type packing |
| 🔍 | FlashInfer | SM120 benchmarks · Identified #34988 — fused FP4 kernel slower than unfused |
Building an FP4 Fused Attention Kernel for Consumer Blackwell GPUs From hardware instructions to working kernel — inline PTX, block scaling, register budget, MMA fragment mapping.
From Silicon to Thread Identity: How CUDA Threads Know Who They Are

