Skip to content

Async CPU to GPU prefetch in CachedDataLoader for large data #20

@cweniger

Description

@cweniger

Motivation

For large per-sample data (MBs+), the CPU→GPU transfer in CachedDataLoader.sample_batch() becomes the dominant per-step cost. Currently, transfer happens synchronously before each training step.

Profiling on 05_linear_regression (20K-dim observations, small network) showed ~1.8ms transfer per batch. With larger data and heavier networks (where forward+backward takes 50-100ms+), this transfer can be effectively hidden behind GPU compute.

Proposed approach

  1. Pre-allocate pinned buffers once per dataloader, sized to (batch_size, *sample_shape) — avoids per-batch pin_memory() allocation overhead
  2. Background thread copies next batch into pinned buffer and calls .to(device, non_blocking=True) — the DMA transfer releases the GIL, so it genuinely overlaps with GPU compute
  3. Double-buffering — one batch on GPU being trained, the next being transferred simultaneously

Notes from prior attempt

A simpler version (per-batch pin_memory() + threading) was prototyped and benchmarked. It performed worse than synchronous transfer (7.84ms vs 7.44ms) because:

  • Per-batch pinned memory allocation added ~0.4ms overhead
  • The training step was too short (~3ms) for the overlap to help
  • GIL contention on CPU-side numpy indexing

Pre-allocated pinned buffers would eliminate the allocation overhead, making this viable when GPU compute time is large enough to hide the transfer.

Files involved

  • falcon/core/raystore.pyCachedDataLoader.sample_batch()
  • falcon/contrib/stepwise_estimator.py_train() would call enable_prefetch(device)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions