Skip to content

Conversation

@zackangelo
Copy link
Contributor

Looking for feedback on this approach for beginning to make Candle compatible with CUDA graphs. We've tried a couple of approaches internally and I think this one strikes the best optimum balance with what we have access to in cudarc.

Motivation

CUDA graph capture records the exact kernel launch parameters and any memcpy nodes created during capture. In candle’s CUDA backend we were building small, temporary Vec<usize> argument buffers (dims/strides/etc.) and uploading them with clone_htod as part of capture. Those host vectors were dropped immediately after capture, and the device buffers could also be freed, leaving subsequent graph launches to reference invalid host/device memory.

In practice this will show up as crashes/segfaults when replaying captured graphs.

What This Does

  • Introduces a CUDA-graph capture pool on CudaDevice that owns both:
    • the host-side parameter vectors (Vec<usize>) used for HtoD memcpy nodes during capture, and
    • the corresponding device buffers (CudaSlice<usize>) holding those parameters.
  • Updates the CUDA backend’s parameter uploads (dims/strides/etc.) to route through CudaDevice::capture_htod_slice(...) when a stream is actively capturing, so the allocations outlive capture and remain valid for subsequent graph launches.

By tying the lifetime of these host/device argument buffers to the CudaDevice (via the capture pool), the memory referenced by captured graph nodes stays allocated after capture completes. This prevents graph replays from dereferencing freed host pointers (for captured HtoD copies) or freed device pointers (for kernel argument buffers), eliminating invalid-memory accesses across launches.

Notes / Follow-ups

This is a functional correctness fix to make CUDA graph replay safe. If we can get cudarc to natively support arena/pooled allocation, we can probably clean up this design (see chelsea0x3b/cudarc#519).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant