Skip to content

Conversation

@vijk777
Copy link
Collaborator

@vijk777 vijk777 commented Jan 27, 2026

Summary

  • Replaces RandomChunkLoader with PipelineChunkLoader using 3-stage parallelism
  • Enables overlap of disk I/O, CPU→GPU transfer, and training via double buffering (gpu_prefetch=2)

Architecture

cpu_loader thread: disk -> cpu (pinned memory) -> cpu_queue
gpu_transfer thread: cpu_queue -> gpu -> gpu_queue
main thread: gpu_queue -> training

Profiling

  • Chrome tracing profiler for pipeline visualization (pipeline_trace.json)
  • Events: epoch, chunk, train, gpu_queue_wait, disk_load, gpu_transfer
  • Only profiles first 5 epochs to limit file size
  • View traces at chrome://tracing or https://ui.perfetto.dev/

vijk777 and others added 7 commits January 27, 2026 08:59
Introduces a new chunk loader that overlaps disk I/O, CPU->GPU transfer,
and training using a 3-thread pipeline architecture:
- cpu_loader thread: disk -> cpu_queue (pinned memory)
- gpu_transfer thread: cpu_queue -> gpu_queue (device tensors)
- main thread: gpu_queue -> training

Key changes:
- Add PipelineChunkLoader class with configurable gpu_prefetch parameter
- Add comprehensive test suite including 10s timing stress test
- Update load_dataset() to use PipelineChunkLoader
- Enable double buffering (gpu_prefetch=2) in latent_stag.py

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- add PipelineProfiler class for recording pipeline events
- profile disk_load, gpu_transfer, gpu_queue_wait, and epoch events
- save trace to pipeline_trace.json for viewing in chrome://tracing
- profiler is disabled by default, enabled in latent_stag.py

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
adds "train" event on main thread to show when training is happening,
making it easy to visualize overlap with data loading/transfer.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
wraps each chunk's processing (get_next_chunk + train) with a "chunk"
event, making it easy to see chunk boundaries in the trace.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
profiles the first 5 epochs to capture warmup and steady-state
behavior while keeping file size manageable (~32KB vs ~650KB for
100 epochs).

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
set gpu_prefetch=2 to overlap cpu->gpu transfer with training.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
deletes:
- chunk_loader.py (replaced by pipeline_chunk_loader.py)
- chunk_loader_test.py
- test_prefetch_youtube.py
- CHUNKED_STREAMING_INTEGRATION.md (obsolete)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@vijk777 vijk777 merged commit ef1c0f0 into main Jan 27, 2026
1 check passed
@vijk777 vijk777 deleted the vj/double_buffer branch January 27, 2026 18:05
@vijk777
Copy link
Collaborator Author

vijk777 commented Jan 27, 2026

This doesn't make a huge difference, but we have full overlap now of loading from disk, transfer to gpu mem, training.
image

on the a100 machines:

  • loading chunk which is mem mapped to pinned memory 64k x 13741 neurons (3.6gb) ~ 1.5s (~1.5gb/s)
  • chunk transfer from pinned mem -> gpu mem ~ 200ms (~18gb/s)
  • training per chunk ~ 7s

Training time dominates.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants