`open_mfdataarray` for a large number of Febus files #37

Linvill · 2025-05-19T16:08:36Z

Linvill
May 19, 2025

First of all, what a great library - thank you for it!

I'm trying to index a large number of Febus files (12,550 files, each 1 GB, containing 1 minute of DAS data over 1,081 channels sampled at 4 kHz) using open_mfdataarray like this:

data_collection = xd.open_mfdataarray(
    h5_file_paths,
    dim="first",
    tolerance=np.timedelta64(260, "us"),
    squeeze=False,
    engine="febus",
    verbose=True
)

The metadata fetching is fast (~157 it/s), which I noticed is thanks to parallelization. However, linking the data array still takes quite some time—even though it's a VirtualStack. The time needed starts at about 1.5 s/iteration at 0%, but increases steadily, reaching around 24 s/iteration at just 10% progress.

Am I doing something wrong? Is there a way to speed up this second part?

Thanks a lot for any tips or help!

Answered by atrabattoni

May 21, 2025

Hi @Linvill.

I changed a little something in the fix/faster-interpcoord-append branch. If you want you can try to install that branch with

pip install git+https://github.com/xdas-dev/xdas.git@fix/faster-interpcoord-append

Tell me if it makes things faster.

View full answer

atrabattoni · 2025-05-19T17:36:35Z

atrabattoni
May 19, 2025
Maintainer

Thank you for repporting this issue !

I might have an idea where things are getting slow. Have you tried to open one or a few files and investigate the timing information ? you can either use xdas.plot_availability(da, dim="time") or look at da["time"].get_discontinuities(). You probably have many gaps or overlaps that may come from an incorrect removal of the overlaps. You can specify the correct overlap by specifing e.g. overlaps=(100, 100), offset=1000 in any open function. In the case here 100 points are removed right and left and the timestamp given is supposed to correspond to index 1000 (usually the center of the blocks). Unfortunately reading Febus files is an experience in it's own, and something that we need to fix.

Nevertheless the aggregation should be faster, I need to change a few line of codes.

Let me know if this is the case for you.

0 replies

atrabattoni · 2025-05-19T18:37:53Z

atrabattoni
May 19, 2025
Maintainer

Might be related to #24.

0 replies

atrabattoni · 2025-05-21T06:21:56Z

atrabattoni
May 21, 2025
Maintainer

Hi @Linvill.

I changed a little something in the fix/faster-interpcoord-append branch. If you want you can try to install that branch with

pip install git+https://github.com/xdas-dev/xdas.git@fix/faster-interpcoord-append

Tell me if it makes things faster.

0 replies

Linvill · 2025-05-21T16:00:06Z

Linvill
May 21, 2025
Author

Hi @atrabattoni,

Your changes worked - I can now index all 12,550 files. Thank you very much for the quick fix!

In my data, I’m seeing gaps of 0.250112 ms instead of the expected 0.250 ms (at 4 kHz sampling rate), occurring roughly every second. I’ve worked around this by setting the tolerance to 0.26 ms. However, I’m also seeing 1.000250112 s gaps every other minute.

I’ll now have to check how we’re processing the data in the first place… But yes, your tool has been extremely helpful in giving us an overview, and thanks to the indexing, we can now start working systematically on the dataset. Right now, I wonder what happens if we fetch data during a gap, guess I will find out soon!:)

Thanks again!

0 replies

atrabattoni · 2025-05-21T16:40:18Z

atrabattoni
May 21, 2025
Maintainer

Happy to know that it worked ! I will add this feature into version 0.2.3.

0 replies

atrabattoni · 2025-05-26T12:12:58Z

atrabattoni
May 26, 2025
Maintainer

@Linvill : with @ClaudioStrumia we also realized that febus only provides timing information that are us accurate. When casting those to ns timestamps it creates those yyyy-mm-ddT:hh:mm:ss.000000017 kind of timestamps. We fixed this, and incorporate those changes in the dev branch that you can install like this:

pip install git+https://github.com/xdas-dev/xdas.git@dev

Let me know if it further ease working with febus files.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`open_mfdataarray` for a large number of Febus files #37

Uh oh!

{{title}}

Uh oh!

Replies: 6 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

open_mfdataarray for a large number of Febus files #37

Uh oh!

Linvill May 19, 2025

Replies: 6 comments

Uh oh!

atrabattoni May 19, 2025 Maintainer

Uh oh!

atrabattoni May 19, 2025 Maintainer

Uh oh!

atrabattoni May 21, 2025 Maintainer

Uh oh!

Linvill May 21, 2025 Author

Uh oh!

atrabattoni May 21, 2025 Maintainer

Uh oh!

Uh oh!

atrabattoni May 26, 2025 Maintainer

`open_mfdataarray` for a large number of Febus files #37

Linvill
May 19, 2025

atrabattoni
May 19, 2025
Maintainer

atrabattoni
May 19, 2025
Maintainer

atrabattoni
May 21, 2025
Maintainer

Linvill
May 21, 2025
Author

atrabattoni
May 21, 2025
Maintainer

atrabattoni
May 26, 2025
Maintainer