Introduce capture pool for cuda graph replay #3352

zackangelo · 2026-01-31T22:38:45Z

Looking for feedback on this approach for beginning to make Candle compatible with CUDA graphs. We've tried a couple of approaches internally and I think this one strikes the best optimum balance with what we have access to in cudarc.

Motivation

CUDA graph capture records the exact kernel launch parameters and any memcpy nodes created during capture. In candle’s CUDA backend we were building small, temporary Vec<usize> argument buffers (dims/strides/etc.) and uploading them with clone_htod as part of capture. Those host vectors were dropped immediately after capture, and the device buffers could also be freed, leaving subsequent graph launches to reference invalid host/device memory.

In practice this will show up as crashes/segfaults when replaying captured graphs.

What This Does

Introduces a CUDA-graph capture pool on CudaDevice that owns both:
- the host-side parameter vectors (Vec<usize>) used for HtoD memcpy nodes during capture, and
- the corresponding device buffers (CudaSlice<usize>) holding those parameters.
Updates the CUDA backend’s parameter uploads (dims/strides/etc.) to route through CudaDevice::capture_htod_slice(...) when a stream is actively capturing, so the allocations outlive capture and remain valid for subsequent graph launches.

By tying the lifetime of these host/device argument buffers to the CudaDevice (via the capture pool), the memory referenced by captured graph nodes stays allocated after capture completes. This prevents graph replays from dereferencing freed host pointers (for captured HtoD copies) or freed device pointers (for kernel argument buffers), eliminating invalid-memory accesses across launches.

Notes / Follow-ups

This is a functional correctness fix to make CUDA graph replay safe. If we can get cudarc to natively support arena/pooled allocation, we can probably clean up this design (see chelsea0x3b/cudarc#519).

zackangelo added 2 commits January 31, 2026 22:32

introduce capture pool for cuda graph replay

1699a3f

cache clone_htod allocations during graph capture

1fff1b3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce capture pool for cuda graph replay #3352

Introduce capture pool for cuda graph replay #3352

zackangelo commented Jan 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Introduce capture pool for cuda graph replay #3352

Are you sure you want to change the base?

Introduce capture pool for cuda graph replay #3352

Conversation

zackangelo commented Jan 31, 2026

Motivation

What This Does

Notes / Follow-ups

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant