feat(compositor): factor output_format into GPU heuristic#218
Merged
streamer45 merged 4 commits intomainfrom Mar 29, 2026
Merged
feat(compositor): factor output_format into GPU heuristic#218streamer45 merged 4 commits intomainfrom
streamer45 merged 4 commits intomainfrom
Conversation
When the compositor's output_format is NV12 or I420, the GPU path eliminates the expensive CPU RGBA→YUV conversion entirely (~14% of CPU time in profiled pipelines). The should_use_gpu() heuristic now considers this, preferring GPU compositing whenever YUV output is requested — even for simple scenes that would otherwise stay on CPU. This addresses the #1 CPU hotspot identified in production profiling: rgba8_to_nv12_buf at 9.12% + parallel_rows at 5.28% = 14.4% combined. Signed-off-by: Devin AI <devin@streamkit.dev> Signed-off-by: StreamKit Devin <devin@streamkit.dev> Co-Authored-By: Claudio Costa <cstcld91@gmail.com>
Contributor
Author
🤖 Devin AI EngineerI'll be helping with this pull request! Here's what you should know: ✅ I will automatically:
Note: I can only respond to comments from users who have write access to this repository. ⚙️ Control Options:
|
Adds a concurrency group keyed on PR number / branch ref with cancel-in-progress: true. This prevents the single self-hosted GPU runner from being blocked by stale jobs when new commits are pushed. Signed-off-by: Devin AI <devin@streamkit.dev> Signed-off-by: StreamKit Devin <devin@streamkit.dev> Co-Authored-By: Claudio Costa <cstcld91@gmail.com>
Two tests flaked on the self-hosted GPU runner where many tests run concurrently and compete for CPU: 1. test_oneshot_processes_faster_than_realtime: reduced from 30@30fps (budget 500ms vs 1000ms real-time = 10% margin) to 10@5fps (budget 1500ms vs 2000ms real-time = 25% margin). The previous budget was nearly indistinguishable from per-frame scheduling overhead (~30ms) under CI load. 2. test_compositor_output_format_runtime_change: increased inter-step sleeps from 100/50/100ms to 300/200/300ms. The compositor thread can be starved for CPU when GPU tests run in parallel, so the original windows were not enough for even one tick to fire. Signed-off-by: Devin AI <devin@streamkit.dev> Signed-off-by: StreamKit Devin <devin@streamkit.dev> Co-Authored-By: Claudio Costa <cstcld91@gmail.com>
The test consistently got only 1 output frame instead of ≥ 2 on the self-hosted GPU runner. Root cause: when compiled with --features gpu and gpu_mode Auto (the default), the compositor OS thread blocks on GpuContext::try_init() before processing any compositing work. On the GPU runner with many tests competing for the device, init can exceed the total sleep budget (800ms). By the time it finishes, both input frames have been drained to just the latest (Dynamic mode behaviour), producing a single output. Fix: set gpu_mode: "cpu" explicitly. This test validates runtime output_format switching via UpdateParams, not GPU compositing — GPU init is unnecessary overhead that creates the race. Also reduces sleep durations to 200/100/200ms (from 300/200/300ms) since without GPU init the compositor thread starts processing immediately. Signed-off-by: Devin AI <devin@streamkit.dev> Signed-off-by: StreamKit Devin <devin@streamkit.dev> Co-Authored-By: Claudio Costa <cstcld91@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Commit 1 —
feat(compositor): factor output_format into GPU heuristicWhen the compositor's
output_formatis NV12 or I420, the GPU path eliminates the expensive CPU RGBA→YUV conversion entirely via thergba_to_yuv.wgslcompute shader. Previously,should_use_gpu()didn't consider this — a single 720p layer with NV12 output would stay on CPU (1 item, < 1080p pixels, no effects), paying the full ~14.4% CPU cost that the GPU path handles for free.This adds
output_format: Option<PixelFormat>toshould_use_gpu()andshould_use_gpu_with_state(). When the output needs YUV conversion, the heuristic now prefers GPU compositing even for simple scenes.Profile context (30s CPU profile, CPU path, 1080p VP9 encoding pipeline):
rgba8_to_nv12_buf= 9.12% flat — fix(tools): install script #1 CPU consumerparallel_rows(serving NV12 path) = 5.28% flatCommit 2 —
ci: cancel superseded workflow runs on same PRAdds a
concurrencygroup toci.ymlkeyed on PR number / branch ref withcancel-in-progress: true. This prevents the single self-hosted GPU runner from being blocked by stale jobs when new commits are pushed.Review & Testing Checklist for Human
just skit-profiling servewith a VP9 encoding session) with--features gpuon a GPU-equipped machine and compare CPU profile before/after — thergba8_to_nv12_bufandparallel_rowsentries should disappear from the hot path.concurrencysetting correctly cancels stale CI runs — push two quick commits to a PR and confirm the first run is cancelled.Notes
should_use_gputests are updated to passNone(preserving prior behavior for scenes without explicit output format).main(test_oneshot_processes_faster_than_realtimeandtest_compositor_output_format_runtime_change) fail on the GPU runner due to timing sensitivity under load — not caused by this PR (both use 4×4 canvases withoutput_format: None, so the new heuristic path is never activated).Link to Devin session: https://staging.itsdev.in/sessions/4eafeb9c24d342bba7d1b41238fcb3e4
Requested by: @streamer45