Async CPU to GPU prefetch in CachedDataLoader for large data

## Motivation

For large per-sample data (MBs+), the CPU→GPU transfer in `CachedDataLoader.sample_batch()` becomes the dominant per-step cost. Currently, transfer happens synchronously before each training step.

Profiling on 05_linear_regression (20K-dim observations, small network) showed ~1.8ms transfer per batch. With larger data and heavier networks (where forward+backward takes 50-100ms+), this transfer can be effectively hidden behind GPU compute.

## Proposed approach

1. **Pre-allocate pinned buffers** once per dataloader, sized to `(batch_size, *sample_shape)` — avoids per-batch `pin_memory()` allocation overhead
2. **Background thread** copies next batch into pinned buffer and calls `.to(device, non_blocking=True)` — the DMA transfer releases the GIL, so it genuinely overlaps with GPU compute
3. **Double-buffering** — one batch on GPU being trained, the next being transferred simultaneously

## Notes from prior attempt

A simpler version (per-batch `pin_memory()` + threading) was prototyped and benchmarked. It performed *worse* than synchronous transfer (7.84ms vs 7.44ms) because:
- Per-batch pinned memory allocation added ~0.4ms overhead
- The training step was too short (~3ms) for the overlap to help
- GIL contention on CPU-side numpy indexing

Pre-allocated pinned buffers would eliminate the allocation overhead, making this viable when GPU compute time is large enough to hide the transfer.

## Files involved

- `falcon/core/raystore.py` — `CachedDataLoader.sample_batch()`
- `falcon/contrib/stepwise_estimator.py` — `_train()` would call `enable_prefetch(device)`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Async CPU to GPU prefetch in CachedDataLoader for large data #20

Motivation

Proposed approach

Notes from prior attempt

Files involved

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Async CPU to GPU prefetch in CachedDataLoader for large data #20

Description

Motivation

Proposed approach

Notes from prior attempt

Files involved

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions