Object references 

Thank you for creating Remfile - it makes opening lots of large NWB files for small metadata actually tolerable!

I've noticed that using object references results in eager reads of unnecessary amounts of data:
```python
import time
import h5py
import remfile

LARGE_HDF5_URL = 'https://dandiarchive.s3.amazonaws.com/blobs/f78/fe2/f78fe2a6-3dc9-4c12-a288-fbf31ce6fc1c'
SMALL_HDF5_URL = 'https://dandiarchive.s3.amazonaws.com/blobs/56c/31a/56c31a1f-a6fb-4b73-ab7d-98fb5ef9a553' # useful for testing quickly

url = LARGE_HDF5_URL

nwb = h5py.File(remfile.File(url), mode="r")

# this is an instance of <HDF5 object reference>:
object_reference = nwb['units/electrode_group'][0]

# the location `object_reference` points to (which can't be determined from the object reference itself)
url_to_actual_location = {
    LARGE_HDF5_URL: '/general/extracellular_ephys/17216703352 1-281',
    SMALL_HDF5_URL: '/general/extracellular_ephys/18005110031 1-281',
}

# 1. accessing the location directly and reading metadata is fast:
t0 = time.time()
_ = nwb[url_to_actual_location[url]].name
print(f"1. Time to get referenced object data directly: {time.time() - t0:.2f} s")

# 2. when using the object reference, a lazy accessor seems to be returned initially
# (which is fast):
t0 = time.time()
lazy_object_data = nwb[object_reference]
print(f"2. Time to get lazy object reference: {time.time() - t0:.2f} s")

# 3. when the same component is accessed, it is much slower than in 1. - suggests
#    more data than necessary is being read
t0 = time.time()
reference_path = lazy_object_data.name
print(f"3. Time to get referenced object data: {time.time() - t0:.2f} s")
assert reference_path == url_to_actual_location[url]      

# 4. subsequent access of a different component is fast - supporting the idea that 
#    more data than necessary is being read (and cached) in 3. 
t0 = time.time()
second_object_reference = nwb['units/electrode_group'][-1]
second_reference_path = nwb[second_object_reference].name
print(f"4. Time to get second referenced object data: {time.time() - t0:.2f} s")
assert second_reference_path != url_to_actual_location[url]
```
output:
```
1. Time to get referenced object data directly: 0.07 s
2. Time to get lazy object reference: 0.00 s
3. Time to get referenced object data: 119.33 s
4. Time to get second referenced object data: 0.01 s
```

I don't think this is an issue with Remfile itself: times were similar with @martindurant's [suggestion](https://github.com/magland/remfile/issues/9) of using fsspec and `cache_type="first"` (though the initial opening of the file was much improved from `cache_type="readahead"` - almost as fast as Remfile). I was just hoping that you might have some insight into what's going on here, and a way to go faster!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Object references #10

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Object references #10

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions