Thank you for creating Remfile - it makes opening lots of large NWB files for small metadata actually tolerable!
I've noticed that using object references results in eager reads of unnecessary amounts of data:
import time
import h5py
import remfile
LARGE_HDF5_URL = 'https://dandiarchive.s3.amazonaws.com/blobs/f78/fe2/f78fe2a6-3dc9-4c12-a288-fbf31ce6fc1c'
SMALL_HDF5_URL = 'https://dandiarchive.s3.amazonaws.com/blobs/56c/31a/56c31a1f-a6fb-4b73-ab7d-98fb5ef9a553' # useful for testing quickly
url = LARGE_HDF5_URL
nwb = h5py.File(remfile.File(url), mode="r")
# this is an instance of <HDF5 object reference>:
object_reference = nwb['units/electrode_group'][0]
# the location `object_reference` points to (which can't be determined from the object reference itself)
url_to_actual_location = {
LARGE_HDF5_URL: '/general/extracellular_ephys/17216703352 1-281',
SMALL_HDF5_URL: '/general/extracellular_ephys/18005110031 1-281',
}
# 1. accessing the location directly and reading metadata is fast:
t0 = time.time()
_ = nwb[url_to_actual_location[url]].name
print(f"1. Time to get referenced object data directly: {time.time() - t0:.2f} s")
# 2. when using the object reference, a lazy accessor seems to be returned initially
# (which is fast):
t0 = time.time()
lazy_object_data = nwb[object_reference]
print(f"2. Time to get lazy object reference: {time.time() - t0:.2f} s")
# 3. when the same component is accessed, it is much slower than in 1. - suggests
# more data than necessary is being read
t0 = time.time()
reference_path = lazy_object_data.name
print(f"3. Time to get referenced object data: {time.time() - t0:.2f} s")
assert reference_path == url_to_actual_location[url]
# 4. subsequent access of a different component is fast - supporting the idea that
# more data than necessary is being read (and cached) in 3.
t0 = time.time()
second_object_reference = nwb['units/electrode_group'][-1]
second_reference_path = nwb[second_object_reference].name
print(f"4. Time to get second referenced object data: {time.time() - t0:.2f} s")
assert second_reference_path != url_to_actual_location[url]
output:
1. Time to get referenced object data directly: 0.07 s
2. Time to get lazy object reference: 0.00 s
3. Time to get referenced object data: 119.33 s
4. Time to get second referenced object data: 0.01 s
I don't think this is an issue with Remfile itself: times were similar with @martindurant's suggestion of using fsspec and cache_type="first" (though the initial opening of the file was much improved from cache_type="readahead" - almost as fast as Remfile). I was just hoping that you might have some insight into what's going on here, and a way to go faster!
Thank you for creating Remfile - it makes opening lots of large NWB files for small metadata actually tolerable!
I've noticed that using object references results in eager reads of unnecessary amounts of data:
output:
I don't think this is an issue with Remfile itself: times were similar with @martindurant's suggestion of using fsspec and
cache_type="first"(though the initial opening of the file was much improved fromcache_type="readahead"- almost as fast as Remfile). I was just hoping that you might have some insight into what's going on here, and a way to go faster!