The 4-slot limit is an artificial restriction in the Slang layer, not a hardware constraint. Let me break down the layers:
Hardware reality
Push constants/root constants have a fixed hardware size:
| Platform |
Guaranteed minimum |
Typical desktop |
| Vulkan |
128 bytes |
256 bytes |
| DX12 |
64 DWORDs (256 bytes) |
256 bytes |
| Metal |
~4 KB (function constants) |
4 KB |
At 4 bytes per u32 index, 256 bytes gives you 64 slots. Goldy's backends already allocate 16 (MAX_PUSH_CONSTANT_INDICES), using only 64 bytes. The hardware can comfortably handle this.
Where the artificial limit comes from
The GoldyDynamicSlots struct was written for simple ping-pong demos (swap 2 buffers each frame), not for real compute pipelines:
public struct GoldyDynamicSlots {
uint slots[4]; // Up to 4 dynamic buffer indices
The DX12 path makes it even more restrictive by packing into uint4 and using a switch statement that only handles 4 cases. This is just underdesigned.
Why a fixed array at all?
The deeper question. Push constants are the fastest way to communicate per-dispatch data to a shader — they're loaded directly into shader registers with zero indirection, no memory fetch, no cache miss. But they're small and fixed-size. The alternatives form a trade-off spectrum:
-
Push constants (current): Zero latency, no indirection, hardware size cap (~64 indices practical). Optimal for the common case.
-
Indirection buffer: Store an array of resource indices in a GPU buffer, pass a single push constant pointing to the offset. Unlimited slots, but adds one memory read before you can access any resource. This is how production bindless engines handle large material/draw-data tables.
-
Buffer Device Address: Pass raw 64-bit GPU pointers via push constants. 8 bytes each → half as many as u32 indices in the same space. More powerful but more limited in count.
The right design for Goldy
Given Goldy's philosophy of "abstract meaning, expose cost," the current design is wrong in two ways:
First, the 4-slot shader-side limit should trivially match the 16-slot backend limit. That's just a bug.
Second, even 16 shouldn't be a hard ceiling. For use cases beyond 16, the established pattern is to pack indices into a buffer:
// Shader: one push constant slot points to an index buffer
uint indexBufferSlot = goldy_dyn_scattered<uint>(0);
// Then read as many indices as you need from that buffer
uint actualResourceIdx = indexBufferSlot[bindingNumber];
RWStructuredBuffer<MyType> resource = goldy_scattered<MyType>(actualResourceIdx);
This is exactly how production engines do it — push constants for the fast path, buffer indirection for unbounded bindings. Goldy could formalize this as a goldy_indirect_binding pattern.
The philosophical takeaway: there's no reason for a general-purpose GPU library to impose an arbitrary binding count limit below what the hardware supports. The limit should be the hardware's limit, clearly documented, with an escape hatch (buffer indirection) for anything beyond that.
The 4-slot limit is an artificial restriction in the Slang layer, not a hardware constraint. Let me break down the layers:
Hardware reality
Push constants/root constants have a fixed hardware size:
At 4 bytes per
u32index, 256 bytes gives you 64 slots. Goldy's backends already allocate 16 (MAX_PUSH_CONSTANT_INDICES), using only 64 bytes. The hardware can comfortably handle this.Where the artificial limit comes from
The
GoldyDynamicSlotsstruct was written for simple ping-pong demos (swap 2 buffers each frame), not for real compute pipelines:The DX12 path makes it even more restrictive by packing into
uint4and using a switch statement that only handles 4 cases. This is just underdesigned.Why a fixed array at all?
The deeper question. Push constants are the fastest way to communicate per-dispatch data to a shader — they're loaded directly into shader registers with zero indirection, no memory fetch, no cache miss. But they're small and fixed-size. The alternatives form a trade-off spectrum:
Push constants (current): Zero latency, no indirection, hardware size cap (~64 indices practical). Optimal for the common case.
Indirection buffer: Store an array of resource indices in a GPU buffer, pass a single push constant pointing to the offset. Unlimited slots, but adds one memory read before you can access any resource. This is how production bindless engines handle large material/draw-data tables.
Buffer Device Address: Pass raw 64-bit GPU pointers via push constants. 8 bytes each → half as many as u32 indices in the same space. More powerful but more limited in count.
The right design for Goldy
Given Goldy's philosophy of "abstract meaning, expose cost," the current design is wrong in two ways:
First, the 4-slot shader-side limit should trivially match the 16-slot backend limit. That's just a bug.
Second, even 16 shouldn't be a hard ceiling. For use cases beyond 16, the established pattern is to pack indices into a buffer:
This is exactly how production engines do it — push constants for the fast path, buffer indirection for unbounded bindings. Goldy could formalize this as a
goldy_indirect_bindingpattern.The philosophical takeaway: there's no reason for a general-purpose GPU library to impose an arbitrary binding count limit below what the hardware supports. The limit should be the hardware's limit, clearly documented, with an escape hatch (buffer indirection) for anything beyond that.