Hi, authors!
I’m trying to fine-tune your excllent aurora model to my downstream tasks and I’m running into a serious I/O bottleneck with ERA5 due to the dataset size—data loading dominates and GPU utilization stays low. At the moment I store each training sample as HDF5 file, but I suspect the per-file overhead and random reads are killing throughput. Could you share how you handled the ERA5 “huge data + I/O bottleneck” problem in your pipeline?
If you have scripts/configs or pointers in the repo for preprocessing and recommended layout, I’d really appreciate them—thanks!
Thank you again for your work and your time.