-
Notifications
You must be signed in to change notification settings - Fork 2
Open
Description
Problem
When cache_on_device: true is set in the estimator loop config, CachedDataLoader moves the entire training buffer to GPU as torch tensors. For large observation vectors (e.g., 1M bins), this silently exhausts GPU memory.
PyTorch/CUDA does not raise an OOM error in this case — instead the process stalls indefinitely as CUDA's memory allocator retries internally. The driver shows no error; training appears hung at "Initializing model...".
Observed behavior
- 05_linear_regression with 1024 samples × 1M bins,
cache_on_device: true nvidia-smishows 40209/40960 MiB used (A100-40GB), 0% GPU util- Process stalls for 15+ minutes with no error or progress
Expected behavior
Either:
- Detect insufficient GPU memory before attempting the transfer and fall back to CPU caching with a warning
- Only cache small tensors (theta, logprob) on GPU; keep large observation vectors on CPU
- Raise an explicit error if the buffer exceeds a configurable GPU memory budget
Suggested approach
- Add a
max_gpu_cache_bytesthreshold (or per-key size check) so large observation arrays stay on CPU - Wrap the
.to(device)in a try/except fortorch.cuda.OutOfMemoryErrorand fall back to CPU with a warning - Consider making
cache_on_deviceaccept a list of key patterns rather than a blanket bool
Context
The torch tensor cache (CachedDataLoader) was introduced to speed up batch sampling. CPU-side torch tensors already provide ~5x speedup over numpy. The GPU cache is an optional optimization that only makes sense when the buffer fits comfortably in GPU memory alongside the model.
Metadata
Metadata
Assignees
Labels
No labels