Skip to content

Conversation

@kvark
Copy link
Owner

@kvark kvark commented Sep 17, 2024

Experiment branched off #161

The idea is to only fetch GBuffers once at init of the workgroup. We can place that data in workgroup memory and re-use spatially. This would reduce the amount of VRAM traffic (and latency).
In practice, it turned out to be significantly slower:
Local4-gbuffer-merge

I suspect this could be due to:

  • the driver being better at occupancy when shaders are smaller. NSight confirms this to some extent. It's still not entirely straightforward, since NVidia can have variable register occupancy during the shader execution.
  • separate gbuffer pass is more local, it doesn't re-shuffle the groups into clusters
  • gbuffer pass able to mix the latency of VRAM access with RT core utilization, while the merged pass becomes more blocked on RTCore

Update

Can confirm this is due to locality as the biggest factor. Here is a run with group shuffling disabled. It's much faster.
Local4a-gbuffer-tight

@kvark kvark added the type: experiment Experimental code label Sep 17, 2024
@kvark kvark mentioned this pull request Sep 17, 2024
16 tasks
@kvark kvark changed the title Experiment/merge gbuffer pass ReSTIR merged GBuffer pass Sep 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

type: experiment Experimental code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants