Skip to content

Improve wait_for API to return Result #285

@Vanuan

Description

@Vanuan

Motivation

The wait_for API provides this synchronization by blocking until the GPU reaches a specific synchronization point, preventing the CPU from destroying or reusing GPU resources while they're still in use. When you submit work to the GPU (like rendering a frame), it continues running asynchronously. If the CPU immediately tries to delete textures or reuse buffers, you could corrupt memory or crash. This function provides a safe way to wait until the GPU has finished with a specific batch of work, identified by a sync point.

Currently, it returns a simple boolean that loses critical error information like a timeout, device loss, or other backend-specific failure. Any failure look the same to the caller - false value.

Details

The CommandDevice trait defines the contract for GPU synchronization. It includes:

submit() which returns a SyncPoint representing GPU work completion
wait_for() which blocks until that work finishes or times out

Each backend implements this differently:

Vulkan: Uses timeline semaphores with nanosecond precision
GLES/WebGL: Uses GL sync objects with millisecond precision
Metal: Polls command buffer status in a loop

The current boolean return forces all callers to treat any failure identically, preventing proper error handling and recovery strategies.

Backend-specific implementations

Details

Vulkan Backend Implementation

The Vulkan backend implements wait_for using timeline semaphores. It locks the queue to access the timeline semaphore, creates a wait info structure with the sync point's progress value, and calls the Vulkan driver. The implementation maps the timeout from milliseconds to nanoseconds and converts the Vulkan Result to a boolean using .is_ok(), which discards the specific error type.

The key issue is that different error conditions require different handling strategies:

TIMEOUT: The operation timed out but the device is still functional
DEVICE_LOST: The GPU was lost and needs recovery
OUT_OF_DATE: The surface is out of date (common during resize)
...

By converting all these to false, the API forces callers to treat every failure identically, limiting robust error recovery.

GLES/WebGL Backend Implementation

The GLES implementation uses OpenGL's sync objects (gl.client_wait_sync) to block until GPU work completes. It converts the millisecond timeout to nanoseconds, with special handling for WebGL's 1-second timeout limit. The function returns true only when the GPU signals completion (ALREADY_SIGNALED or CONDITION_SATISFIED) and false for timeouts or any other error conditions.

Key behavior: A zero timeout enables non-blocking polling (useful for checking if resources are available), while !0 (max u32) creates an indefinite block (used when you must wait before proceeding). The current boolean return collapses all error types into a simple success/failure, which is why there's interest in migrating to a Result type for better error handling.

Metal Backend Implementation

The Metal implementation uses a simple polling loop because Metal's command buffers don't support efficient blocking waits like Vulkan's timeline semaphores. It records the start time, then continuously checks the command buffer status. When the status is "Completed", it returns true. The key limitation is that error states are silently ignored - if the command buffer fails, the loop continues until timeout, then returns false just like a timeout condition.

The polling approach with 1ms sleeps is inefficient but necessary given Metal's API constraints. This design loses valuable error information that could help applications distinguish between timeouts, device loss, or actual command buffer errors.

Existing usage of wait_for

Details

FramePacer

The FramePacer uses wait_for to enforce a strict "one frame at a time" execution model, guaranteeing that the previous frame's GPU work completes before the next frame begins and before any temporary resources are recycled.

The FramePacer maintains a sync point from the previous frame and blocks indefinitely (!0 timeout) until GPU work completes. This blocking wait happens at three critical points:

Frame start - wait_for_previous_frame() blocks until the previous frame finishes
Frame end - Called automatically after submitting the current frame
Cleanup - Ensures all GPU work is done before destroying the FramePacer

After the wait succeeds, the code safely destroys buffers and acceleration structures from the previous frame, knowing the GPU no longer accesses them. The current implementation assumes the wait always succeeds, which is why it returns void rather than handling errors - a design choice that needs reconsideration for robust error handling.

BufferBelt

GPU operations are asynchronous - when you submit work to the GPU, it continues executing long after your CPU code returns. This creates a critical problem: how do you safely reuse GPU resources like buffers without corrupting data that's still being used? The BufferBelt solves this by tracking when the GPU finishes with each buffer chunk through sync points, enabling efficient resource recycling.

The BufferBelt maintains two pools: active buffers currently being filled, and buffers waiting for GPU completion. When allocating space:

  1. First it tries to fit your request in an active buffer - this is fastest as no GPU synchronization is needed
  2. If that fails, it searches the recycled pool for buffers the GPU has finished with
  3. The key check is gpu.wait_for(sp, 0) - a non-blocking poll that asks "is the GPU done?"
  4. If no recycled buffers are ready, it creates a new chunk from the GPU

The zero timeout is crucial - it means "check and return immediately", preventing the CPU from stalling while waiting for the GPU. This design enables high-throughput applications to continuously submit work without blocking.

Texture Cleanup in EGUI

When EGUI renders UI elements, it creates GPU textures for fonts and images. These textures have a lifecycle: they're created, used for rendering, then eventually become obsolete when fonts change or images are updated. The critical problem is that the GPU might still be reading from a texture when the CPU tries to delete it, which would cause crashes or visual corruption.

The texture deletion system in EGUI works as a two-phase cleanup:

  1. Mark for deletion: When textures become obsolete, they're added to textures_to_delete with their associated GPU sync point

  2. Safe deletion check: The triage_deletions() function periodically checks if the GPU has finished using each texture by calling context.wait_for(sp, 0) with a zero timeout. This is a non-blocking poll:

    If wait_for returns true, the GPU is still using the texture
    If wait_for returns false, the GPU has finished and the texture can be safely destroyed

  3. Actual deletion: Only textures whose GPU work has completed are destroyed through destroy_texture_view() and destroy_texture() calls.

This approach ensures GPU safety without blocking the rendering thread, as the zero-timeout wait never stalls execution.

Shader Hot Reload

When shaders are hot-reloaded during development, the GPU might still be executing graphics commands that reference the old shader code. Destroying shaders while they're in use would cause undefined behavior and crashes. The renderer needs to wait for all GPU work to complete before replacing shaders, ensuring no GPU commands reference outdated resources.

The shader hot reload system blocks indefinitely until the GPU finishes processing the current frame. This happens through gpu.wait_for(sync_point, !0) where !0 represents an infinite timeout. The sync point tracks when all previously submitted GPU commands have completed.

Once the wait succeeds, the renderer joins any background shader compilation tasks and proceeds to update the shader pipelines.


P.S. This is follow up to #248 inspired by zed-industries/zed#43070

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions