Skip to content

Optimize Path Tracer Performance with Emissive Geometry #48

@juanchuletas

Description

@juanchuletas

Problem

After implementing emissive geometry sampling (#41), render performance has degraded significantly:

Resolution Before Emissives After Emissives
4K @ 200spp ~20s ~80s (estimated)
1080p @ 200spp ~5s ~20s

Root cause: Each path bounce now shoots an additional shadow ray to a sampled emissive triangle, effectively doubling BVH traversal cost per bounce.

Proposed Optimizations

1. Progressive Rendering Loop (High Priority)

Current implementation:

// Single kernel launch - 200 samples computed per pixel in one go
render_kernel<<<grid, block>>>(..., samplesPerPixel=200, ...);

Problems:

  • GPU timeout risk on long renders
  • No progress feedback
  • No preview capability
  • curand_init() called per pixel (expensive, ~1000 cycles)

Proposed implementation:

// Multiple kernel launches - 1 sample per launch, accumulate results
for (int sample = 0; sample < totalSamples; sample++) {
    render_kernel<<<grid, block>>>(..., sampleIndex=sample);
// Optional: progress callback, preview update
if (sample % 10 == 0) {
    reportProgress(sample, totalSamples);
}

}
// Finalize: divide accumulated buffer by sample count
finalize_kernel<<<grid, block>>>(framebuffer, imageSize, 1.0f / totalSamples);

Benefits:

  • No GPU timeout (each launch is fast)
  • Real-time progress feedback
  • Can display progressive preview during render
  • Enables future adaptive sampling
  • Remove curand_init() entirely - use fungt::RNG seeded by sampleIndex

2. Shadow Ray Early Exit (High Priority)

Current: traceRayBVH() finds closest hit - unnecessary for shadow rays.

Proposed: Add traceShadowRayBVH() that returns on ANY hit:

fgt_device_gpu bool traceShadowRayBVH(
    const fungt::Ray& ray,
    const Triangle* tris,
    const BVHNode* bvhNodes,
    int numNodes,
    float maxDist)
{
    int stack[64];
    int stackPtr = 0;
    stack[stackPtr++] = 0;
while (stackPtr &gt; 0) {
    int nodeIdx = stack[--stackPtr];
    const BVHNode&amp; node = bvhNodes[nodeIdx];

    if (!Intersection::intersectAABB(ray, node.m_boundingBox, 0.001f, maxDist))
        continue;

    if (node.isLeaf()) {
        for (int i = 0; i &lt; node.triCount; i++) {
            int triIdx = node.firstTriIdx + i;
            HitData temp;
            if (Intersection::MollerTrumbore(ray, tris[triIdx], 0.001f, maxDist, temp)) {
                return true;  // EARLY EXIT
            }
        }
    } else {
        stack[stackPtr++] = node.leftChild;
        stack[stackPtr++] = node.rightChild;
    }
}
return false;

}

Expected impact: 30-50% faster shadow ray tests.


3. Limit NEE to First N Bounces (Medium Priority)

Emissive contribution after bounce 2-3 is negligible for most scenes.

// Only do NEE on first 3 bounces
if (numOfEmissiveTris > 0 && bounce < 3) {
    // NEE sampling code
}

Expected impact: ~40% reduction in shadow rays.


4. Remove Unused curandState (Quick Win)

Current kernel initializes both RNGs but only uses one:

fungt::RNG rng(idx * 1337ULL + 123ULL);      // Used
curandState randomState;
curand_init(seed + idx, 0, 0, &randomState);  // NOT USED - 1000 cycles wasted

Remove curandState entirely and update pathTracer_CookTorrance signature.


5. Block Size Tuning (Quick Win)

Path tracing has divergent branching. Smaller blocks = less warp divergence.

// Current
dim3 block(16, 16);  // 256 threads

// Proposed
dim3 block(8, 8); // 64 threads - test performance


6. Shared Memory for Light Data (Low Priority)

Cache emissive triangle indices in shared memory:

__shared__ int sharedEmissive[64];

int tid = threadIdx.y * blockDim.x + threadIdx.x;
if (tid < numOfEmissiveTris && tid < 64) {
sharedEmissive[tid] = emissiveTris[tid];
}
__syncthreads();

Expected impact: Minimal (5-10%) - emissive list is small and likely L2 cached.


Tasks

  • Implement progressive rendering loop in RenderScene()
  • Add finalize_kernel() for GPU-side division
  • Implement traceShadowRayBVH() with early exit
  • Limit NEE to first 3 bounces
  • Remove curandState from kernel and path tracer
  • Benchmark 8x8 vs 16x16 block sizes
  • (Optional) Add shared memory caching for emissive indices

Acceptance Criteria

  • 1080p @ 200spp renders in < 10s (50% improvement)
  • Progress feedback during render
  • No GPU timeout on 4K @ 500spp
  • No visual quality regression

Technical Notes

Progressive Accumulation Math

Each kernel launch adds one sample:

framebuffer[idx] += sample_contribution

Final normalization:

framebuffer[idx] /= totalSamples

RNG Seeding for Progressive Rendering

Each sample needs unique randomness:

fungt::RNG rng(pixelIndex * 1337ULL + sampleIndex * 7919ULL);

Using prime multipliers ensures good distribution across both pixel and sample dimensions.

Related

Labels

enhancement performance path-tracer cuda

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions