A personal GPU kernel lab for running kernels across every major framework on NVIDIA Blackwell (B200) using Modal.
One command to run anything:
modal run scripts/run.py --task kernels/cuda/vector_add.cu
modal run scripts/run.py --task kernels/triton/vector_add.py
modal run scripts/run.py --task kernels/cute/vector_add.cu
modal run scripts/run.py --task kernels/cute_dsl/vector_add.py
modal run scripts/run.py --task kernels/cutlass/gemm.cu
modal run scripts/run.py --task kernels/quack/rmsnorm.py
modal run scripts/run.py --task repos/cutlass/examples/python/CuTeDSL/blackwell/dense_gemm.py| Framework | Directory | Type | Notes |
|---|---|---|---|
| CUDA | kernels/cuda/ |
.cu |
Raw CUDA, compiled with nvcc |
| CuTe C++ | kernels/cute/ |
.cu |
Layout algebra, part of CUTLASS 3.x |
| CUTLASS | kernels/cutlass/ |
.cu |
High-performance GEMM/conv primitives |
| Triton | kernels/triton/ |
.py |
OpenAI's Python GPU kernel DSL |
| CuTe DSL | kernels/cute_dsl/ |
.py |
NVIDIA's Python JIT compiler for CuTe |
| Quack | kernels/quack/ |
.py |
Production CuTe DSL kernels (rmsnorm, softmax, etc.) |
CUTLASS repo examples can be run directly from repos/cutlass/ — they resolve against the image's cloned copy, not your local files.
gpulab/
├── kernels/ # your kernels — mounted fresh on every run
│ ├── cuda/
│ ├── cute/
│ ├── cute_dsl/
│ ├── cutlass/
│ ├── triton/
│ └── quack/
├── repos/ # local copies for reading — never uploaded to Modal
│ └── cutlass/
├── src/
│ └── gpulab/
│ ├── __init__.py
│ ├── compiler.py # compile flags per backend
│ ├── runner.py # dispatch logic
│ └── modal_app.py # Modal image + remote function
├── scripts/
│ └── run.py # CLI entrypoint
├── pyproject.toml
└── .gitignore
git clone https://github.com/your-handle/gpulab.git
cd gpulab
conda create -n gpulab python=3.12
conda activate gpulabpip install -e .
pip install modalmodal setupThis opens a browser window to authenticate. Once done, your credentials are saved locally.
Verify it works:
modal profile listmodal run scripts/run.py --task kernels/cuda/vector_add.cuThe first run builds the container image — this takes a few minutes. Every run after that is fast. You should see:
╭──────────────────── kernels/cuda/vector_add.cu ────────────────────╮
│ cuda vector_add: n=1048576 c[0]=3.0 PASSED │
╰────────────────────────────────────────────────────────────────────╯
# CUDA
modal run scripts/run.py --task kernels/cuda/vector_add.cu
# CuTe C++
modal run scripts/run.py --task kernels/cute/vector_add.cu
# CUTLASS
modal run scripts/run.py --task kernels/cutlass/gemm.cu
# Triton
modal run scripts/run.py --task kernels/triton/vector_add.py
# CuTe DSL
modal run scripts/run.py --task kernels/cute_dsl/vector_add.py
# Quack
modal run scripts/run.py --task kernels/quack/rmsnorm.py
# Official CUTLASS examples from the repo
modal run scripts/run.py --task repos/cutlass/examples/python/CuTeDSL/blackwell/dense_gemm.py
# Extra nvcc flags (e.g. for profiling)
modal run scripts/run.py --task kernels/cuda/vector_add.cu --flags="-lineinfo"Write a standard int main(). Drop the file in the right directory and run it — the backend is auto-detected from the folder name.
// kernels/cuda/my_kernel.cu
#include <stdio.h>
#include <cuda_runtime.h>
__global__ void my_kernel(...) { ... }
int main() {
// allocate, launch, verify
printf("PASSED\n");
return 0;
}modal run scripts/run.py --task kernels/cuda/my_kernel.cuDefine a run(**params) function that returns a string. The runner calls this function.
# kernels/triton/my_kernel.py
import torch
import triton
import triton.language as tl
@triton.jit
def my_kernel(...):
...
def run(**params):
# set up tensors, launch kernel, verify
return "my_kernel: PASSED"modal run scripts/run.py --task kernels/triton/my_kernel.pyIf your script runs computation at the module level (no run() function), that is fine too — the runner handles it gracefully.
The runner infers which backend to use from the file path:
| Path contains | Backend | What happens |
|---|---|---|
cuda/ |
cuda | nvcc -arch=sm_100 |
cute/ |
cutlass | nvcc + CUTLASS includes |
cutlass/ |
cutlass | nvcc + CUTLASS includes + -std=c++17 |
cute_dsl/ |
cute_dsl | Python import |
triton/ |
triton | Python import |
quack/ |
quack | Python import |
repos/ |
repos_cutlass | Resolved against image's /root/cutlass |
The container image is built once and cached. It contains:
nvidia/cuda:13.1.1-cudnn-devel-ubuntu24.04(base)torch,triton,numpy,rich,jax[cuda12],nvidia-cutlass-dsl[cu13],quack-kernels[cu13]- CUTLASS cloned at
/root/cutlass
Only kernels/ and src/ are uploaded on every run — both are small and fast. repos/ is never uploaded.
When does the image rebuild? Only when you change modal_app.py — adding a new pip package, a new run_commands, etc. Changes to kernels/ and src/ never trigger a rebuild.
Follow these four steps:
1. Install it in the image — edit src/gpulab/modal_app.py:
.pip_install("new-package")
# or
.run_commands("git clone https://github.com/org/repo.git /root/repo")2. Add include paths — edit src/gpulab/compiler.py if it is a C++ framework:
INCLUDES["new_backend"] = ["-I/root/repo/include"]3. Add a compile function — edit src/gpulab/compiler.py:
def compile_new_backend(src, binary, extra_flags):
_nvcc(src, binary, INCLUDES["new_backend"], extra_flags)4. Add backend detection — edit src/gpulab/runner.py:
if "new_backend" in parts: return "new_backend"This triggers one image rebuild. After that, every run with the new framework is fast.
For repos you want to clone locally and read but run from the image:
# Clone locally for reading
git clone https://github.com/org/repo repos/repoAdd to image in modal_app.py:
.run_commands("git clone --depth 1 https://github.com/org/repo.git /root/repo")The local copy in repos/ is for reading the code. The container uses its own clone.
modal command not found or wrong version
Make sure which modal and which python both point to your conda env:
conda activate gpulab
which modal # should contain gpulab
which python # should contain gpulabIf modal points to ~/.local/bin/modal, add your conda env to PATH:
echo 'export PATH="$CONDA_PREFIX/bin:$PATH"' >> ~/.bashrc
source ~/.bashrcFiles not found in container
If you get FileNotFoundError for a file that exists locally, the most common cause is a nested .git directory blocking Modal's file sync. Fix:
rm -rf repos/myrepo/.gitOr better: do not mount repos/ at all — clone repos into the image instead.
cute/tensor.hpp: No such file or directory
Your kernel is in kernels/cuda/ but includes CuTe headers. Move it to kernels/cute/ — the runner will automatically add CUTLASS include paths.
CUTLASS compile errors about C++17
The compile_cutlass function already passes -std=c++17. If you are calling compile_cuda on a CUTLASS kernel, make sure the file is in kernels/cutlass/ or kernels/cute/.
Quack crashes with No module named 'jax.numpy'
JAX is missing. Make sure jax[cuda12] is in your pip_install() in modal_app.py and rebuild the image.
All kernels are compiled for sm_100 (NVIDIA Blackwell, B200/B300). To target a different GPU, update ARCH in src/gpulab/compiler.py:
# H100 (Hopper)
ARCH = ["-arch=sm_90", "-gencode", "arch=compute_90,code=sm_90"]
# A100 (Ampere)
ARCH = ["-arch=sm_80", "-gencode", "arch=compute_80,code=sm_80"]And update gpu="B200" in src/gpulab/modal_app.py to match.
- Modal for the GPU infrastructure
- NVIDIA CUTLASS for CuTe and CUTLASS
- Dao-AILab Quack for production CuTe DSL kernels
- OpenAI Triton for the Python GPU kernel DSL