This was recently brought to my attention. I am glad that you are able to get better performance than standard fsspec.
First a couple of notes
- fsspec provides multiple possible (memory) caching mechanisms. The default is "readahead", which is good for typical access patterns on more or less sequential reading, but poor for HDF5. "First" is often better, if most of the metadata is near the start of the file
- fsspec also has a file-based cache, either whole files or partial files
- the kerchunk project can scan the metadata once for HDF5 files, and store them elsewhere (e.g., JSON file), and provide fast, parallel reads thereafter
Secondly, may I suggest that you consider upstreaming this code to fsspec, so that many users can get automatic access? It could even become the default caching mechanism for HDF5 in the same way that fsspec provides a parquet module optimised to that format.
This was recently brought to my attention. I am glad that you are able to get better performance than standard fsspec.
First a couple of notes
Secondly, may I suggest that you consider upstreaming this code to fsspec, so that many users can get automatic access? It could even become the default caching mechanism for HDF5 in the same way that fsspec provides a parquet module optimised to that format.