Add single-node DDP support, distributed utils, samplers, and gradient accumulation by agporto · Pull Request #7 · agporto/BioEncoder

agporto · 2026-03-13T18:06:49Z

Enable single-node multi-GPU training using PyTorch Distributed Data Parallel (DDP) and provide utilities to manage distributed state and device placement.
Make data loading and metric computation robust in multi-process settings by adding DistributedSampler support and cross-rank gather/aggregation.
Improve reproducibility and training flexibility by adding rank-aware seeding, gradient accumulation, and safer device/map_location handling for checkpoint I/O.

Added distributed helper utilities in bioencoder/core/utils.py including is_distributed, get_rank, get_world_size, is_main_process, init_distributed, and teardown_distributed, plus a safe _all_gather_cat for gathering variable-length tensors.
Extended build_loaders to accept distributed, rank, and world_size and create DistributedSampler instances when enabled, and updated dataset loaders to accept provided samplers.
Updated embedding/validation/training functions to accept a device parameter and to perform distributed gathering/aggregation (compute_embeddings, validation_constructive, validation_ce) as well as mixed-precision and gradient accumulation logic in train_epoch_constructive and train_epoch_ce.
Made model checkpoint loading/saving device-aware with map_location, and changed build_model/script callers (lr_finder.py, swa.py, train.py) to construct and use a device object instead of calling .cuda() directly.
Integrated DDP in train.py with torch.nn.parallel.DistributedDataParallel wrapping, optional SyncBatchNorm conversion, sampler epoch setting per-epoch, rank-aware seeding (set_seed now accepts rank_offset), main-process-only logging/tensorboard/writes, and proper distributed teardown.
Added distributed configuration block to bioencoder_configs/train_stage1.yml and train_stage2.yml and documented single-node multi-GPU usage in help/03-training.md.

Ran automated smoke tests that import the package and execute bioencoder.scripts.train in --dry-run mode on a single GPU; the dry-run completed successfully.
Executed lr_finder and swa scripts in a single-GPU environment against a small dataset as an automated smoke check; both ran and returned expected outputs.
No multi-GPU torchrun/DDP CI job was run here, but distributed code paths were covered by the smoke tests when distributed.enabled was set to False and by unit-like checks for device/map_location handling; those checks passed.

Fix DDP overwrite barrier deadlock risk in train setup

50924fa

agporto added the codex label Mar 13, 2026 — with ChatGPT Codex Connector

Provide feedback