Optimize Path Tracer Performance with Emissive Geometry

<h2>Problem</h2>
<p>After implementing emissive geometry sampling (#41), render performance has degraded significantly:</p>

Resolution | Before Emissives | After Emissives
-- | -- | --
4K @ 200spp | ~20s | ~80s (estimated)
1080p @ 200spp | ~5s | ~20s


<p><strong>Root cause:</strong> Each path bounce now shoots an additional shadow ray to a sampled emissive triangle, effectively doubling BVH traversal cost per bounce.</p>
<h2>Proposed Optimizations</h2>
<h3>1. Progressive Rendering Loop (High Priority)</h3>
<p><strong>Current implementation:</strong></p>
<pre><code class="language-cpp">// Single kernel launch - 200 samples computed per pixel in one go
render_kernel&lt;&lt;&lt;grid, block&gt;&gt;&gt;(..., samplesPerPixel=200, ...);
</code></pre>
<p><strong>Problems:</strong></p>
<ul>
<li>GPU timeout risk on long renders</li>
<li>No progress feedback</li>
<li>No preview capability</li>
<li><code>curand_init()</code> called per pixel (expensive, ~1000 cycles)</li>
</ul>
<p><strong>Proposed implementation:</strong></p>
<pre><code class="language-cpp">// Multiple kernel launches - 1 sample per launch, accumulate results
for (int sample = 0; sample &lt; totalSamples; sample++) {
    render_kernel&lt;&lt;&lt;grid, block&gt;&gt;&gt;(..., sampleIndex=sample);
    
    // Optional: progress callback, preview update
    if (sample % 10 == 0) {
        reportProgress(sample, totalSamples);
    }
}
// Finalize: divide accumulated buffer by sample count
finalize_kernel&lt;&lt;&lt;grid, block&gt;&gt;&gt;(framebuffer, imageSize, 1.0f / totalSamples);
</code></pre>
<p><strong>Benefits:</strong></p>
<ul>
<li>No GPU timeout (each launch is fast)</li>
<li>Real-time progress feedback</li>
<li>Can display progressive preview during render</li>
<li>Enables future adaptive sampling</li>
<li>Remove <code>curand_init()</code> entirely - use <code>fungt::RNG</code> seeded by <code>sampleIndex</code></li>
</ul>
<hr>
<h3>2. Shadow Ray Early Exit (High Priority)</h3>
<p><strong>Current:</strong> <code>traceRayBVH()</code> finds closest hit - unnecessary for shadow rays.</p>
<p><strong>Proposed:</strong> Add <code>traceShadowRayBVH()</code> that returns on ANY hit:</p>
<pre><code class="language-cpp">fgt_device_gpu bool traceShadowRayBVH(
    const fungt::Ray&amp; ray,
    const Triangle* tris,
    const BVHNode* bvhNodes,
    int numNodes,
    float maxDist)
{
    int stack[64];
    int stackPtr = 0;
    stack[stackPtr++] = 0;

    while (stackPtr &gt; 0) {
        int nodeIdx = stack[--stackPtr];
        const BVHNode&amp; node = bvhNodes[nodeIdx];

        if (!Intersection::intersectAABB(ray, node.m_boundingBox, 0.001f, maxDist))
            continue;

        if (node.isLeaf()) {
            for (int i = 0; i &lt; node.triCount; i++) {
                int triIdx = node.firstTriIdx + i;
                HitData temp;
                if (Intersection::MollerTrumbore(ray, tris[triIdx], 0.001f, maxDist, temp)) {
                    return true;  // EARLY EXIT
                }
            }
        } else {
            stack[stackPtr++] = node.leftChild;
            stack[stackPtr++] = node.rightChild;
        }
    }
    return false;
}
</code></pre>
<p><strong>Expected impact:</strong> 30-50% faster shadow ray tests.</p>
<hr>
<h3>3. Limit NEE to First N Bounces (Medium Priority)</h3>
<p>Emissive contribution after bounce 2-3 is negligible for most scenes.</p>
<pre><code class="language-cpp">// Only do NEE on first 3 bounces
if (numOfEmissiveTris &gt; 0 &amp;&amp; bounce &lt; 3) {
    // NEE sampling code
}
</code></pre>
<p><strong>Expected impact:</strong> ~40% reduction in shadow rays.</p>
<hr>
<h3>4. Remove Unused curandState (Quick Win)</h3>
<p>Current kernel initializes both RNGs but only uses one:</p>
<pre><code class="language-cpp">fungt::RNG rng(idx * 1337ULL + 123ULL);      // Used
curandState randomState;
curand_init(seed + idx, 0, 0, &amp;randomState);  // NOT USED - 1000 cycles wasted
</code></pre>
<p>Remove <code>curandState</code> entirely and update <code>pathTracer_CookTorrance</code> signature.</p>
<hr>
<h3>5. Block Size Tuning (Quick Win)</h3>
<p>Path tracing has divergent branching. Smaller blocks = less warp divergence.</p>
<pre><code class="language-cpp">// Current
dim3 block(16, 16);  // 256 threads

// Proposed
dim3 block(8, 8);    // 64 threads - test performance
</code></pre>
<hr>
<h3>6. Shared Memory for Light Data (Low Priority)</h3>
<p>Cache emissive triangle indices in shared memory:</p>
<pre><code class="language-cpp">__shared__ int sharedEmissive[64];

int tid = threadIdx.y * blockDim.x + threadIdx.x;
if (tid &lt; numOfEmissiveTris &amp;&amp; tid &lt; 64) {
    sharedEmissive[tid] = emissiveTris[tid];
}
__syncthreads();
</code></pre>
<p><strong>Expected impact:</strong> Minimal (5-10%) - emissive list is small and likely L2 cached.</p>
<hr>
<h2>Tasks</h2>
<ul>
<li>[ ] Implement progressive rendering loop in <code>RenderScene()</code></li>
<li>[ ] Add <code>finalize_kernel()</code> for GPU-side division</li>
<li>[ ] Implement <code>traceShadowRayBVH()</code> with early exit</li>
<li>[ ] Limit NEE to first 3 bounces</li>
<li>[ ] Remove <code>curandState</code> from kernel and path tracer</li>
<li>[ ] Benchmark 8x8 vs 16x16 block sizes</li>
<li>[ ] (Optional) Add shared memory caching for emissive indices</li>
</ul>
<h2>Acceptance Criteria</h2>
<ul>
<li>[ ] 1080p @ 200spp renders in &lt; 10s (50% improvement)</li>
<li>[ ] Progress feedback during render</li>
<li>[ ] No GPU timeout on 4K @ 500spp</li>
<li>[ ] No visual quality regression</li>
</ul>
<h2>Technical Notes</h2>
<h3>Progressive Accumulation Math</h3>
<p>Each kernel launch adds one sample:</p>
<pre><code>framebuffer[idx] += sample_contribution
</code></pre>
<p>Final normalization:</p>
<pre><code>framebuffer[idx] /= totalSamples
</code></pre>
<h3>RNG Seeding for Progressive Rendering</h3>
<p>Each sample needs unique randomness:</p>
<pre><code class="language-cpp">fungt::RNG rng(pixelIndex * 1337ULL + sampleIndex * 7919ULL);
</code></pre>
<p>Using prime multipliers ensures good distribution across both pixel and sample dimensions.</p>
<h2>Related</h2>
<ul>
<li>Depends on: #41 (Emissive Geometry Sampling)</li>
<li>Related to: Future adaptive sampling implementation</li>
</ul>
<h2>Labels</h2>
<p><code>enhancement</code> <code>performance</code> <code>path-tracer</code> <code>cuda</code></p></body></html>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize Path Tracer Performance with Emissive Geometry #48

Problem

Proposed Optimizations

1. Progressive Rendering Loop (High Priority)

2. Shadow Ray Early Exit (High Priority)

3. Limit NEE to First N Bounces (Medium Priority)

4. Remove Unused curandState (Quick Win)

5. Block Size Tuning (Quick Win)

6. Shared Memory for Light Data (Low Priority)

Tasks

Acceptance Criteria

Technical Notes

Progressive Accumulation Math

RNG Seeding for Progressive Rendering

Related

Labels

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Optimize Path Tracer Performance with Emissive Geometry #48

Description

Problem

Proposed Optimizations

1. Progressive Rendering Loop (High Priority)

2. Shadow Ray Early Exit (High Priority)

3. Limit NEE to First N Bounces (Medium Priority)

4. Remove Unused curandState (Quick Win)

5. Block Size Tuning (Quick Win)

6. Shared Memory for Light Data (Low Priority)

Tasks

Acceptance Criteria

Technical Notes

Progressive Accumulation Math

RNG Seeding for Progressive Rendering

Related

Labels

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions