feat: overlap chunk loading, gpu transfer and training #131

vijk777 · 2026-01-27T18:04:39Z

Summary

Replaces RandomChunkLoader with PipelineChunkLoader using 3-stage parallelism
Enables overlap of disk I/O, CPU→GPU transfer, and training via double buffering (gpu_prefetch=2)

Architecture

cpu_loader thread: disk -> cpu (pinned memory) -> cpu_queue
gpu_transfer thread: cpu_queue -> gpu -> gpu_queue
main thread: gpu_queue -> training

Profiling

Chrome tracing profiler for pipeline visualization (pipeline_trace.json)
Events: epoch, chunk, train, gpu_queue_wait, disk_load, gpu_transfer
Only profiles first 5 epochs to limit file size
View traces at chrome://tracing or https://ui.perfetto.dev/

Introduces a new chunk loader that overlaps disk I/O, CPU->GPU transfer, and training using a 3-thread pipeline architecture: - cpu_loader thread: disk -> cpu_queue (pinned memory) - gpu_transfer thread: cpu_queue -> gpu_queue (device tensors) - main thread: gpu_queue -> training Key changes: - Add PipelineChunkLoader class with configurable gpu_prefetch parameter - Add comprehensive test suite including 10s timing stress test - Update load_dataset() to use PipelineChunkLoader - Enable double buffering (gpu_prefetch=2) in latent_stag.py Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- add PipelineProfiler class for recording pipeline events - profile disk_load, gpu_transfer, gpu_queue_wait, and epoch events - save trace to pipeline_trace.json for viewing in chrome://tracing - profiler is disabled by default, enabled in latent_stag.py Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

adds "train" event on main thread to show when training is happening, making it easy to visualize overlap with data loading/transfer. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

wraps each chunk's processing (get_next_chunk + train) with a "chunk" event, making it easy to see chunk boundaries in the trace. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

profiles the first 5 epochs to capture warmup and steady-state behavior while keeping file size manageable (~32KB vs ~650KB for 100 epochs). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

set gpu_prefetch=2 to overlap cpu->gpu transfer with training. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

deletes: - chunk_loader.py (replaced by pipeline_chunk_loader.py) - chunk_loader_test.py - test_prefetch_youtube.py - CHUNKED_STREAMING_INTEGRATION.md (obsolete) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

vijk777 · 2026-01-27T18:57:58Z

This doesn't make a huge difference, but we have full overlap now of loading from disk, transfer to gpu mem, training.

on the a100 machines:

loading chunk which is mem mapped to pinned memory 64k x 13741 neurons (3.6gb) ~ 1.5s (~1.5gb/s)
chunk transfer from pinned mem -> gpu mem ~ 200ms (~18gb/s)
training per chunk ~ 7s

Training time dominates.

vijk777 and others added 7 commits January 27, 2026 08:59

feat: add train event marker to pipeline profiler

342c8e6

adds "train" event on main thread to show when training is happening, making it easy to visualize overlap with data loading/transfer. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

feat: add chunk event marker to pipeline profiler

0db019a

wraps each chunk's processing (get_next_chunk + train) with a "chunk" event, making it easy to see chunk boundaries in the trace. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

feat: only profile first 5 epochs to limit trace file size

f56a93f

profiles the first 5 epochs to capture warmup and steady-state behavior while keeping file size manageable (~32KB vs ~650KB for 100 epochs). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

feat: enable double buffering in latent.py

a220711

set gpu_prefetch=2 to overlap cpu->gpu transfer with training. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

chore: remove old RandomChunkLoader in favor of PipelineChunkLoader

accc0e7

deletes: - chunk_loader.py (replaced by pipeline_chunk_loader.py) - chunk_loader_test.py - test_prefetch_youtube.py - CHUNKED_STREAMING_INTEGRATION.md (obsolete) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

vijk777 merged commit ef1c0f0 into main Jan 27, 2026
1 check passed

vijk777 deleted the vj/double_buffer branch January 27, 2026 18:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: overlap chunk loading, gpu transfer and training #131

feat: overlap chunk loading, gpu transfer and training #131

Uh oh!

vijk777 commented Jan 27, 2026

Uh oh!

Uh oh!

vijk777 commented Jan 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: overlap chunk loading, gpu transfer and training #131

feat: overlap chunk loading, gpu transfer and training #131

Uh oh!

Conversation

vijk777 commented Jan 27, 2026

Summary

Architecture

Profiling

Uh oh!

Uh oh!

vijk777 commented Jan 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants