Skip to content

CUDA Renderer Performance: Vectorized Memory Access and Register Pressure #63

@juanchuletas

Description

@juanchuletas

Profiling render_kernel on an RTX 4060 Laptop GPU (sm_89) with Nsight Compute reveals two independent performance bottlenecks. Both have been confirmed at the hardware instruction level via cuobjdump. This issue tracks the investigation and fixes.


Environment

  • GPU: NVIDIA GeForce RTX 4060 Laptop GPU (sm_89)
  • CUDA: 12.4
  • Profiling tool: Nsight Compute 2024.1.1
  • Kernel: render_kernel — path tracer using Cook-Torrance BRDF, BVH traversal, NEE

Bottleneck 1 — Scalar Global Memory Loads (Priority)

Evidence

cuobjdump -sass on libpbr_cuda.a shows every global memory load in the kernel is a 32-bit scalar LDG.E:

LDG.E R6, [R4.64]
LDG.E R8, [R4.64+0x4]
LDG.E R9, [R4.64+0x8]

Reading a single Vec3 (e.g. MaterialData::baseColor) generates three separate global memory instructions. With float4 the compiler should emit a single LDG.E.128:

LDG.E.128 R6, [R4.64]

Nsight Compute confirms the impact:

Metric Value
L1TEX Global Load Access Pattern 4.0 of 32 bytes utilized per sector
Uncoalesced Global Accesses 651,005,289 excessive sectors (36% of total)
Est. Speedup 29.21%

Root Cause

Both traceRayBVH and traceShadowRayBVH declare int stack[64] — 256 bytes per thread. When the compiler inlines both functions into pathTracer_CookTorrance, both stacks coexist in the register file simultaneously (512 bytes per thread). The hardware cannot fit enough warps per SM, destroying the GPU's ability to hide memory latency by switching warps. Excess stack spills to local memory with strided, uncoalesced access patterns.

The actual required stack depth for median-split BVH with MAX_LEAF_SIZE = 4:

depth = ceil(log2(N / 4))

1,000 triangles -> depth 8
10,000 triangles -> depth 12
100,000 triangles -> depth 15

stack[64] is never needed.

Fix

core_renderer.hpp — reduce stack size and prevent inlining:

// Prevent compiler from merging both stacks into one activation frame
fgt_device_gpu __noinline__ bool traceRayBVH(...) {
    int stack[32];  // was 64
    ...
}

fgt_device_gpu noinline bool traceShadowRayBVH(...) {
int stack[32]; // was 64
...
}

The __noinline__ attribute is the more important change — it ensures the two stacks never coexist in registers simultaneously regardless of stack size.

To determine the exact required depth, instrument BVHBuilder::buildRecursive:

int buildRecursive(..., int depth = 0) {
    m_maxDepth = std::max(m_maxDepth, depth);
    ...
    buildRecursive(..., depth + 1);
}

Print m_maxDepth after build and set GPU stack to that value plus a small buffer (e.g. +4).


Additional Fixes (Low Effort)

F_Schlick — replace powf with manual multiply

// Before — full math library call
float pow5 = powf(1.0f - VoH, 5.0f);

// After — 4 multiplications
float x = 1.0f - VoH;
float x2 = x * x;
float pow5 = x2 * x2 * x;

sampleHemisphere — eliminate acosf

// Before
float theta = acosf(sqrtf(1.0f - u));
float xs = sinf(theta) * cosf(phi);
float ys = sinf(theta) * sinf(phi);
float zs = cosf(theta);

// After — identical distribution, no acosf
float zs = sqrtf(1.0f - u);
float r = sqrtf(u);
float xs = r * cosf(phi);
float ys = r * sinf(phi);

AABB::center() — arithmetic bug

// Before — divides by 0.5 which multiplies by 2
mid = mid / 0.5;

// After
mid = mid * 0.5f;

SAH partition — uncomment

BVHBuilder::partition has a complete SAH implementation commented out, replaced by median split. SAH produces a shallower, more balanced tree which directly reduces required BVH stack depth and reduces traversal divergence. Uncomment it.


Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions