Skip to content

Object references  #10

@bjhardcastle

Description

@bjhardcastle

Thank you for creating Remfile - it makes opening lots of large NWB files for small metadata actually tolerable!

I've noticed that using object references results in eager reads of unnecessary amounts of data:

import time
import h5py
import remfile

LARGE_HDF5_URL = 'https://dandiarchive.s3.amazonaws.com/blobs/f78/fe2/f78fe2a6-3dc9-4c12-a288-fbf31ce6fc1c'
SMALL_HDF5_URL = 'https://dandiarchive.s3.amazonaws.com/blobs/56c/31a/56c31a1f-a6fb-4b73-ab7d-98fb5ef9a553' # useful for testing quickly

url = LARGE_HDF5_URL

nwb = h5py.File(remfile.File(url), mode="r")

# this is an instance of <HDF5 object reference>:
object_reference = nwb['units/electrode_group'][0]

# the location `object_reference` points to (which can't be determined from the object reference itself)
url_to_actual_location = {
    LARGE_HDF5_URL: '/general/extracellular_ephys/17216703352 1-281',
    SMALL_HDF5_URL: '/general/extracellular_ephys/18005110031 1-281',
}

# 1. accessing the location directly and reading metadata is fast:
t0 = time.time()
_ = nwb[url_to_actual_location[url]].name
print(f"1. Time to get referenced object data directly: {time.time() - t0:.2f} s")

# 2. when using the object reference, a lazy accessor seems to be returned initially
# (which is fast):
t0 = time.time()
lazy_object_data = nwb[object_reference]
print(f"2. Time to get lazy object reference: {time.time() - t0:.2f} s")

# 3. when the same component is accessed, it is much slower than in 1. - suggests
#    more data than necessary is being read
t0 = time.time()
reference_path = lazy_object_data.name
print(f"3. Time to get referenced object data: {time.time() - t0:.2f} s")
assert reference_path == url_to_actual_location[url]      

# 4. subsequent access of a different component is fast - supporting the idea that 
#    more data than necessary is being read (and cached) in 3. 
t0 = time.time()
second_object_reference = nwb['units/electrode_group'][-1]
second_reference_path = nwb[second_object_reference].name
print(f"4. Time to get second referenced object data: {time.time() - t0:.2f} s")
assert second_reference_path != url_to_actual_location[url]

output:

1. Time to get referenced object data directly: 0.07 s
2. Time to get lazy object reference: 0.00 s
3. Time to get referenced object data: 119.33 s
4. Time to get second referenced object data: 0.01 s

I don't think this is an issue with Remfile itself: times were similar with @martindurant's suggestion of using fsspec and cache_type="first" (though the initial opening of the file was much improved from cache_type="readahead" - almost as fast as Remfile). I was just hoping that you might have some insight into what's going on here, and a way to go faster!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions