Introduce capture pool for cuda graph replay #3352
Draft
+174
−31
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Looking for feedback on this approach for beginning to make Candle compatible with CUDA graphs. We've tried a couple of approaches internally and I think this one strikes the best optimum balance with what we have access to in cudarc.
Motivation
CUDA graph capture records the exact kernel launch parameters and any memcpy nodes created during capture. In candle’s CUDA backend we were building small, temporary
Vec<usize>argument buffers (dims/strides/etc.) and uploading them withclone_htodas part of capture. Those host vectors were dropped immediately after capture, and the device buffers could also be freed, leaving subsequent graph launches to reference invalid host/device memory.In practice this will show up as crashes/segfaults when replaying captured graphs.
What This Does
CudaDevicethat owns both:Vec<usize>) used for HtoD memcpy nodes during capture, andCudaSlice<usize>) holding those parameters.CudaDevice::capture_htod_slice(...)when a stream is actively capturing, so the allocations outlive capture and remain valid for subsequent graph launches.By tying the lifetime of these host/device argument buffers to the
CudaDevice(via the capture pool), the memory referenced by captured graph nodes stays allocated after capture completes. This prevents graph replays from dereferencing freed host pointers (for captured HtoD copies) or freed device pointers (for kernel argument buffers), eliminating invalid-memory accesses across launches.Notes / Follow-ups
This is a functional correctness fix to make CUDA graph replay safe. If we can get cudarc to natively support arena/pooled allocation, we can probably clean up this design (see chelsea0x3b/cudarc#519).