RenderFormer CUDA Engine

Standalone CUDA/C++20 inference engine for RenderFormer (SIGGRAPH 2025)

Scenes rendered by the CUDA engine at 512×512 — Cornell box, Veach MIS, Stanford bunny, room interior.

Project Page | Paper | Model | Official Python Code

Loads model weights directly from safetensors, renders scenes from HDF5 — no Python or PyTorch runtime required. Includes an interactive real-time viewer with trackball camera.

Features

Standalone C++20/CUDA — zero Python dependencies at runtime, loads safetensors + HDF5 directly
8.3× faster than PyTorch reference (78ms vs 1.5s per view at 512×512)
Near real-time at 320×320 (~34ms/view, 30 FPS)
Interactive viewer — GLFW/OpenGL with trackball camera, double-buffered async rendering
cuDNN fused SDPA — tensor-core flash attention for cross-attention layers
cuDNN implicit GEMM — all DPT conv2d/conv_transpose2d via cuDNN backend
CUDA Graph capture — 732-node graph replay with zero CPU launch overhead
Full fp16 inference — mixed-precision encoder + full fp16 decoder with 2-pass online softmax
KV cache — cross-attention K,V computed once after encoder, reused across all views
Zero-alloc hot path — all GPU buffers preallocated at session creation

Performance

Benchmarked on RTX 3080 Ti (12GB), model renderformer-v1.1-swin-large (483M params):

	Python (PyTorch fp16)	CUDA Engine (512×512)	CUDA Engine (320×320)
Encoder (one-time)	~4.5s	112ms	112ms
Per-view render	~1.5s	78ms	34ms
Total pipeline	PyTorch + FlashAttn + CUDA	cuBLAS + cuDNN + CUDA	same

8.3× faster than PyTorch at 512×512. Near real-time (30 FPS) at 320×320.

Multi-scene benchmark (512×512)

Scene	Triangles	Per-View	PSNR vs Python (EXR)
veach-mis	4,575	80ms	27.4 dB
cbox	5,633	78ms	26.8 dB
cbox-bunny	6,209	79ms	45.8 dB
room	7,141	84ms	33.5 dB

Resolution scaling (cbox, 5633 triangles)

Resolution	Per-View	Pixel Tokens
256×256	21.5ms	1,024
320×320	33.6ms	1,600
384×384	44ms	2,304
512×512	78ms	4,096

Per-view time scales linearly with pixel token count (~20μs/token).

Left: Python (PyTorch fp32) | Right: CUDA Engine (fp16) — PSNR 26.8 dB

Quick Start

git clone --recursive https://github.com/BANANASJIM/renderformer-cuda.git
cd renderformer-cuda
./scripts/setup.sh

The setup script will check dependencies, download model weights (~1.9 GB), detect your GPU architecture, and build.

Manual Build

cmake -B build -DCMAKE_CUDA_ARCHITECTURES=86  # 86=Ampere, 89=Ada, 90=Hopper
cmake --build build -j$(nproc)

Requirements

CUDA Toolkit 12+ (cuBLAS, cuDNN)
HDF5 C library
C++20 compiler (GCC 11+ or Clang 14+)
CMake 3.24+
GLFW3 + OpenGL (optional, for interactive viewer)

Install on Ubuntu/Debian

sudo apt install cmake libhdf5-dev zlib1g-dev libglfw3-dev
# CUDA Toolkit: https://developer.nvidia.com/cuda-downloads
# cuDNN: https://developer.nvidia.com/cudnn

Install on Arch Linux

sudo pacman -S cmake hdf5 zlib glfw
# CUDA + cuDNN from official NVIDIA repos or AUR

Usage

Render a Scene

# Convert scene JSON → HDF5 (requires Python + renderformer repo)
python3 scene_processor/convert_scene.py examples/cbox.json --output_h5_path cbox.h5

# Render with CUDA engine
./build/renderformer \
  --model models/model.safetensors \
  --scene cbox.h5 \
  --output output/ \
  --res 512

Outputs per view: view_N.exr (HDR linear) and view_N.png (LDR, AgX tonemapped).

Interactive Viewer

./build/renderformer_app \
  --model models/model.safetensors \
  --scene cbox.h5

Controls: mouse drag to orbit, scroll to zoom, WASD to pan, R to reset camera.

Architecture

src/
  core/        cuda_utils.h, tensor.h         — GPU memory, stream, cuBLAS handles
  io/          safetensors, hdf5, image_writer — model/scene loading, EXR/PNG output
  kernels/     rmsnorm, rope, nerf_pe,         — fused CUDA kernels
               softmax, activations,
               ray_gen, coord_transform
  layers/      linear, attention, transformer,  — neural network layers
               dpt
  model/       config.h, renderformer          — model config, weight mapping
  pipeline/    render_session, renderer        — session-based rendering API
  app/         viewer, camera                  — interactive GLFW/OpenGL viewer
test/          test_compare, test_encoder       — numerical validation tests
scripts/       setup.sh, validate.sh           — build and validation scripts

Key Design Choices

Mixed-precision encoder — fp32 hidden state, fp16 GEMMs via tensor cores, fp32 RMSNorm + residual
Full fp16 decoder — all GEMMs + softmax in fp32 with fused fp16 cast
cuDNN fused SDPA for cross-attention — single tensor-core kernel for Q×K^T→softmax→×V
cuBLAS batched GEMM for swin self-attention — cuDNN SDPA too slow for small 64×64 windows
cuDNN implicit GEMM for all DPT convolutions — replaces im2col + cuBLAS GEMM
CUDA Graph capture (732 nodes) — zero CPU-side launch overhead per frame
Cross-attention KV cache — K,V computed once after encoder, pre-padded for aligned GEMM
Session-based API — all GPU buffers preallocated at session creation, zero mallocs in hot path

Numerical Accuracy

PSNR 27–46 dB (HDR EXR) vs PyTorch fp32 reference across test scenes. The fp16 precision loss is imperceptible in rendered images.

License

MIT License. See LICENSE for details.

Model weights are from microsoft/renderformer-v1.1-swin-large (MIT License).

Citation

@inproceedings{zeng2025renderformer,
    title     = {RenderFormer: Transformer-based Neural Rendering of Triangle Meshes with Global Illumination},
    author    = {Chong Zeng and Yue Dong and Pieter Peers and Hongzhi Wu and Xin Tong},
    booktitle = {ACM SIGGRAPH 2025 Conference Papers},
    year      = {2025}
}

Acknowledgements

RenderFormer — original PyTorch implementation by Microsoft Research
NVIDIA cuDNN Frontend — header-only C++ library for cuDNN graph API
stb_image_write — PNG output
tinyexr — EXR HDR output

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
medias		medias
scripts		scripts
src		src
test		test
third_party		third_party
.gitignore		.gitignore
.gitmodules		.gitmodules
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RenderFormer CUDA Engine

Features

Performance

Quick Start

Manual Build

Requirements

Usage

Render a Scene

Interactive Viewer

Architecture

Key Design Choices

Numerical Accuracy

License

Citation

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

License

BANANASJIM/renderformer-cuda

Folders and files

Latest commit

History

Repository files navigation

RenderFormer CUDA Engine

Features

Performance

Quick Start

Manual Build

Requirements

Usage

Render a Scene

Interactive Viewer

Architecture

Key Design Choices

Numerical Accuracy

License

Citation

Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages