-
Notifications
You must be signed in to change notification settings - Fork 1
CUDA Renderer Performance: Vectorized Memory Access and Register Pressure #63
Description
Profiling render_kernel on an RTX 4060 Laptop GPU (sm_89) with Nsight Compute reveals two independent performance bottlenecks. Both have been confirmed at the hardware instruction level via cuobjdump. This issue tracks the investigation and fixes.
Environment
- GPU: NVIDIA GeForce RTX 4060 Laptop GPU (sm_89)
- CUDA: 12.4
- Profiling tool: Nsight Compute 2024.1.1
- Kernel:
render_kernel— path tracer using Cook-Torrance BRDF, BVH traversal, NEE
Bottleneck 1 — Scalar Global Memory Loads (Priority)
Evidence
cuobjdump -sass on libpbr_cuda.a shows every global memory load in the kernel is a 32-bit scalar LDG.E:
LDG.E R6, [R4.64]
LDG.E R8, [R4.64+0x4]
LDG.E R9, [R4.64+0x8]
Reading a single Vec3 (e.g. MaterialData::baseColor) generates three separate global memory instructions. With float4 the compiler should emit a single LDG.E.128:
LDG.E.128 R6, [R4.64]
Nsight Compute confirms the impact:
| Metric | Value |
|---|---|
| L1TEX Global Load Access Pattern | 4.0 of 32 bytes utilized per sector |
| Uncoalesced Global Accesses | 651,005,289 excessive sectors (36% of total) |
| Est. Speedup | 29.21% |
Root Cause
Both traceRayBVH and traceShadowRayBVH declare int stack[64] — 256 bytes per thread. When the compiler inlines both functions into pathTracer_CookTorrance, both stacks coexist in the register file simultaneously (512 bytes per thread). The hardware cannot fit enough warps per SM, destroying the GPU's ability to hide memory latency by switching warps. Excess stack spills to local memory with strided, uncoalesced access patterns.
The actual required stack depth for median-split BVH with MAX_LEAF_SIZE = 4:
depth = ceil(log2(N / 4))
1,000 triangles -> depth 8
10,000 triangles -> depth 12
100,000 triangles -> depth 15
stack[64] is never needed.
Fix
core_renderer.hpp — reduce stack size and prevent inlining:
// Prevent compiler from merging both stacks into one activation frame fgt_device_gpu __noinline__ bool traceRayBVH(...) { int stack[32]; // was 64 ... }
fgt_device_gpu noinline bool traceShadowRayBVH(...) {
int stack[32]; // was 64
...
}
The __noinline__ attribute is the more important change — it ensures the two stacks never coexist in registers simultaneously regardless of stack size.
To determine the exact required depth, instrument BVHBuilder::buildRecursive:
int buildRecursive(..., int depth = 0) {
m_maxDepth = std::max(m_maxDepth, depth);
...
buildRecursive(..., depth + 1);
}
Print m_maxDepth after build and set GPU stack to that value plus a small buffer (e.g. +4).
Additional Fixes (Low Effort)
F_Schlick — replace powf with manual multiply
// Before — full math library call float pow5 = powf(1.0f - VoH, 5.0f);
// After — 4 multiplications
float x = 1.0f - VoH;
float x2 = x * x;
float pow5 = x2 * x2 * x;
sampleHemisphere — eliminate acosf
// Before float theta = acosf(sqrtf(1.0f - u)); float xs = sinf(theta) * cosf(phi); float ys = sinf(theta) * sinf(phi); float zs = cosf(theta);
// After — identical distribution, no acosf
float zs = sqrtf(1.0f - u);
float r = sqrtf(u);
float xs = r * cosf(phi);
float ys = r * sinf(phi);
AABB::center() — arithmetic bug
// Before — divides by 0.5 which multiplies by 2 mid = mid / 0.5;
// After
mid = mid * 0.5f;
SAH partition — uncomment
BVHBuilder::partition has a complete SAH implementation commented out, replaced by median split. SAH produces a shallower, more balanced tree which directly reduces required BVH stack depth and reduces traversal divergence. Uncomment it.