-
Notifications
You must be signed in to change notification settings - Fork 1
Optimize Path Tracer Performance with Emissive Geometry #48
Description
Problem
After implementing emissive geometry sampling (#41), render performance has degraded significantly:
| Resolution | Before Emissives | After Emissives |
|---|---|---|
| 4K @ 200spp | ~20s | ~80s (estimated) |
| 1080p @ 200spp | ~5s | ~20s |
Root cause: Each path bounce now shoots an additional shadow ray to a sampled emissive triangle, effectively doubling BVH traversal cost per bounce.
Proposed Optimizations
1. Progressive Rendering Loop (High Priority)
Current implementation:
// Single kernel launch - 200 samples computed per pixel in one go
render_kernel<<<grid, block>>>(..., samplesPerPixel=200, ...);
Problems:
- GPU timeout risk on long renders
- No progress feedback
- No preview capability
curand_init()called per pixel (expensive, ~1000 cycles)
Proposed implementation:
// Multiple kernel launches - 1 sample per launch, accumulate results for (int sample = 0; sample < totalSamples; sample++) { render_kernel<<<grid, block>>>(..., sampleIndex=sample);// Optional: progress callback, preview update if (sample % 10 == 0) { reportProgress(sample, totalSamples); }
}
// Finalize: divide accumulated buffer by sample count
finalize_kernel<<<grid, block>>>(framebuffer, imageSize, 1.0f / totalSamples);
Benefits:
- No GPU timeout (each launch is fast)
- Real-time progress feedback
- Can display progressive preview during render
- Enables future adaptive sampling
- Remove
curand_init()entirely - usefungt::RNGseeded bysampleIndex
2. Shadow Ray Early Exit (High Priority)
Current: traceRayBVH() finds closest hit - unnecessary for shadow rays.
Proposed: Add traceShadowRayBVH() that returns on ANY hit:
fgt_device_gpu bool traceShadowRayBVH( const fungt::Ray& ray, const Triangle* tris, const BVHNode* bvhNodes, int numNodes, float maxDist) { int stack[64]; int stackPtr = 0; stack[stackPtr++] = 0;while (stackPtr > 0) { int nodeIdx = stack[--stackPtr]; const BVHNode& node = bvhNodes[nodeIdx]; if (!Intersection::intersectAABB(ray, node.m_boundingBox, 0.001f, maxDist)) continue; if (node.isLeaf()) { for (int i = 0; i < node.triCount; i++) { int triIdx = node.firstTriIdx + i; HitData temp; if (Intersection::MollerTrumbore(ray, tris[triIdx], 0.001f, maxDist, temp)) { return true; // EARLY EXIT } } } else { stack[stackPtr++] = node.leftChild; stack[stackPtr++] = node.rightChild; } } return false;
}
Expected impact: 30-50% faster shadow ray tests.
3. Limit NEE to First N Bounces (Medium Priority)
Emissive contribution after bounce 2-3 is negligible for most scenes.
// Only do NEE on first 3 bounces
if (numOfEmissiveTris > 0 && bounce < 3) {
// NEE sampling code
}
Expected impact: ~40% reduction in shadow rays.
4. Remove Unused curandState (Quick Win)
Current kernel initializes both RNGs but only uses one:
fungt::RNG rng(idx * 1337ULL + 123ULL); // Used
curandState randomState;
curand_init(seed + idx, 0, 0, &randomState); // NOT USED - 1000 cycles wasted
Remove curandState entirely and update pathTracer_CookTorrance signature.
5. Block Size Tuning (Quick Win)
Path tracing has divergent branching. Smaller blocks = less warp divergence.
// Current dim3 block(16, 16); // 256 threads
// Proposed
dim3 block(8, 8); // 64 threads - test performance
6. Shared Memory for Light Data (Low Priority)
Cache emissive triangle indices in shared memory:
__shared__ int sharedEmissive[64];
int tid = threadIdx.y * blockDim.x + threadIdx.x;
if (tid < numOfEmissiveTris && tid < 64) {
sharedEmissive[tid] = emissiveTris[tid];
}
__syncthreads();
Expected impact: Minimal (5-10%) - emissive list is small and likely L2 cached.
Tasks
- Implement progressive rendering loop in
RenderScene() - Add
finalize_kernel()for GPU-side division - Implement
traceShadowRayBVH()with early exit - Limit NEE to first 3 bounces
- Remove
curandStatefrom kernel and path tracer - Benchmark 8x8 vs 16x16 block sizes
- (Optional) Add shared memory caching for emissive indices
Acceptance Criteria
- 1080p @ 200spp renders in < 10s (50% improvement)
- Progress feedback during render
- No GPU timeout on 4K @ 500spp
- No visual quality regression
Technical Notes
Progressive Accumulation Math
Each kernel launch adds one sample:
framebuffer[idx] += sample_contribution
Final normalization:
framebuffer[idx] /= totalSamples
RNG Seeding for Progressive Rendering
Each sample needs unique randomness:
fungt::RNG rng(pixelIndex * 1337ULL + sampleIndex * 7919ULL);
Using prime multipliers ensures good distribution across both pixel and sample dimensions.
Related
- Depends on: Emissive Geometry Sampling (Area Lights) #41 (Emissive Geometry Sampling)
- Related to: Future adaptive sampling implementation
Labels
enhancement performance path-tracer cuda