Draft
Conversation
Performance optimizations to the data loading pipeline: 1. Switch to_tensor from float64 to float32 - Halves memory bandwidth for all particle tensors - Faster GPU operations (especially on consumer GPUs with slow FP64) 2. Vectorize batch_idx construction in get_batch - Replace Python for-loop with torch.repeat_interleave - O(1) tensor ops instead of O(n_events) Python iterations 3. Vectorize PID one-hot encoding - Replace O(n_particles * n_pids) Python loops with dict lookup + scatter - Single-pass vectorized assignment 4. Vectorize renumber_clusters - Use torch.unique(return_inverse=True) instead of Python loop + index table - Eliminates temporary mapping tensor allocation 5. Replace debug prints with logging in get_batch - Removes per-batch print() calls that flush stdout on every iteration - Switched to Python logging module (debug/warning levels) 6. Optimize EventDatasetCollection.get_idx with np.searchsorted - O(log n) binary search vs O(n) linear scan over dataset thresholds 7. Optimize concat_event_collection - Skip torch.cat for single-element batches (avoids unnecessary copy) 8. Vectorize add_batch_number - Use torch.cumsum + torch.repeat_interleave instead of Python loop 9. Improve DataLoader configuration - Enable shuffle=True for training (better convergence) - Add adaptive prefetch_factor based on batch_size/num_workers - Increase default num_workers from 1 to 4 Also includes test updates for the float32 change. Co-authored-by: Gregor Kržmanc <gregor.krzmanc@cern.ch>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Nine targeted optimizations to the data loading and preprocessing pipeline that reduce per-batch overhead and improve GPU utilization during training.
Optimizations
functions_data.pyfunctions_data.pytorch.repeat_interleavefunctions_data.pyrenumber_clustersfunctions_data.pytorch.unique(return_inverse=True)— one kernel instead of Python loopprint()with loggingfunctions_data.pyEventDatasetCollection.get_idxdataset.pynp.searchsortedvs O(n) linear scantorch.catfor single-element batchesfunctions_data.pyconcat_event_collectionadd_batch_numberfunctions_data.pytorch.cumsum+repeat_interleaveinstead of Python loop + list appendtrain_utils.py,parser_args.pyshuffle=Truefor training, adaptiveprefetch_factor, defaultnum_workers4→ from 1Why these matter
The training loop spends significant time in CPU-side data preprocessing (
get_batch,concat_events, tensor construction) between GPU forward/backward passes. These optimizations target the hottest code paths:get_batchruns on every training step — vectorizing batch_idx, PID encoding, and removing print statements directly reduces wall-clock time per steprenumber_clustersis called on every batch during filtering — the vectorized version avoids Python-level iteration and temporary tensor allocationTesting