-
Notifications
You must be signed in to change notification settings - Fork 2
Open
Description
Motivation
For large per-sample data (MBs+), the CPU→GPU transfer in CachedDataLoader.sample_batch() becomes the dominant per-step cost. Currently, transfer happens synchronously before each training step.
Profiling on 05_linear_regression (20K-dim observations, small network) showed ~1.8ms transfer per batch. With larger data and heavier networks (where forward+backward takes 50-100ms+), this transfer can be effectively hidden behind GPU compute.
Proposed approach
- Pre-allocate pinned buffers once per dataloader, sized to
(batch_size, *sample_shape)— avoids per-batchpin_memory()allocation overhead - Background thread copies next batch into pinned buffer and calls
.to(device, non_blocking=True)— the DMA transfer releases the GIL, so it genuinely overlaps with GPU compute - Double-buffering — one batch on GPU being trained, the next being transferred simultaneously
Notes from prior attempt
A simpler version (per-batch pin_memory() + threading) was prototyped and benchmarked. It performed worse than synchronous transfer (7.84ms vs 7.44ms) because:
- Per-batch pinned memory allocation added ~0.4ms overhead
- The training step was too short (~3ms) for the overlap to help
- GIL contention on CPU-side numpy indexing
Pre-allocated pinned buffers would eliminate the allocation overhead, making this viable when GPU compute time is large enough to hide the transfer.
Files involved
falcon/core/raystore.py—CachedDataLoader.sample_batch()falcon/contrib/stepwise_estimator.py—_train()would callenable_prefetch(device)
Metadata
Metadata
Assignees
Labels
No labels