Skip to content

allow >16 dynamic buffer slots #55

@koubaa

Description

@koubaa

The 4-slot limit is an artificial restriction in the Slang layer, not a hardware constraint. Let me break down the layers:

Hardware reality

Push constants/root constants have a fixed hardware size:

Platform Guaranteed minimum Typical desktop
Vulkan 128 bytes 256 bytes
DX12 64 DWORDs (256 bytes) 256 bytes
Metal ~4 KB (function constants) 4 KB

At 4 bytes per u32 index, 256 bytes gives you 64 slots. Goldy's backends already allocate 16 (MAX_PUSH_CONSTANT_INDICES), using only 64 bytes. The hardware can comfortably handle this.

Where the artificial limit comes from

The GoldyDynamicSlots struct was written for simple ping-pong demos (swap 2 buffers each frame), not for real compute pipelines:

public struct GoldyDynamicSlots {
    uint slots[4];  // Up to 4 dynamic buffer indices

The DX12 path makes it even more restrictive by packing into uint4 and using a switch statement that only handles 4 cases. This is just underdesigned.

Why a fixed array at all?

The deeper question. Push constants are the fastest way to communicate per-dispatch data to a shader — they're loaded directly into shader registers with zero indirection, no memory fetch, no cache miss. But they're small and fixed-size. The alternatives form a trade-off spectrum:

  1. Push constants (current): Zero latency, no indirection, hardware size cap (~64 indices practical). Optimal for the common case.

  2. Indirection buffer: Store an array of resource indices in a GPU buffer, pass a single push constant pointing to the offset. Unlimited slots, but adds one memory read before you can access any resource. This is how production bindless engines handle large material/draw-data tables.

  3. Buffer Device Address: Pass raw 64-bit GPU pointers via push constants. 8 bytes each → half as many as u32 indices in the same space. More powerful but more limited in count.

The right design for Goldy

Given Goldy's philosophy of "abstract meaning, expose cost," the current design is wrong in two ways:

First, the 4-slot shader-side limit should trivially match the 16-slot backend limit. That's just a bug.

Second, even 16 shouldn't be a hard ceiling. For use cases beyond 16, the established pattern is to pack indices into a buffer:

// Shader: one push constant slot points to an index buffer
uint indexBufferSlot = goldy_dyn_scattered<uint>(0);
// Then read as many indices as you need from that buffer
uint actualResourceIdx = indexBufferSlot[bindingNumber];
RWStructuredBuffer<MyType> resource = goldy_scattered<MyType>(actualResourceIdx);

This is exactly how production engines do it — push constants for the fast path, buffer indirection for unbounded bindings. Goldy could formalize this as a goldy_indirect_binding pattern.

The philosophical takeaway: there's no reason for a general-purpose GPU library to impose an arbitrary binding count limit below what the hardware supports. The limit should be the hardware's limit, clearly documented, with an escape hatch (buffer indirection) for anything beyond that.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions