Cache results instead of using `store=True` #365

psomhorst · 2025-03-26T15:58:40Z

psomhorst
Mar 26, 2025
Maintainer

Features of eitprocessing can/are be reused within a single pipeline. An example is BreathDetection (BD), which can be used by other features (EELI, TIV). We have thought of and partly implemented some solutions to prevent re-running breath detection. However, these come with some disadvantages.

To move forward to a system of reproducible pipelines, I'd like to propose a single workflow that works for all cases.

Run all intermediate steps yourself

A simple solution is to run intermediate steps yourself before continuing to the next step. This adds steps for the end user. They have to run BD with the proper settings themselves and pass the results to e.g. TIV. This is not conductive to easy and reproducible pipelines.

Save/load results with a predictable label

BD results are saved to a sequence with a predictable label, based on the input. E.g. BD of "filtered impedance" data gets a label "breath detection - filtered impedance" and is stored in the sequence. When BD is run, it first checks whether the result already exists. If so, it returns that, instead of recalculating. This is a very simple way of caching with some downsides. First, you have to provide the sequence when running BD so that the results can be stored. This makes the interface less approachable. Second, it does not care about any changes in the data. You might have an entirely different object with the same label, resulting in getting the cached results from an different object. Third, this adds data to the sequence object, that is potentially unwanted. E.g., if you want to store to disk relevant results, but not BD data.

I propose a new (well, previously conceived, but never implemented) design.

Automatic caching of intermediate steps

I propose no results are automatically added to a sequence. My suggestion is to have end users only manually store results (sequence.data.add(some_result)). We could make this slightly easier by adding a .store() method to all objects (BD().find_breaths(some_Data).store(sequence)).

This means that to use the results of reused algorithms, we have to cache them. Python has several ways of automatically caching intermediate steps. The obvious built-in approach is the LRU cache from functools. You apply this cache to a method (e.g. find_breaths). The cache takes the arguments that were passed to the method, and the return values of the method, and saves them in a hash map. When running the method with the same arguments, the results are retrieved from the cache and returned.
The LRU component automatically removes values that are not often used if the cache becomes too full.

This feature hinges on the inputs of the methods to be hashable. For a generic method, e.g. BreathDetection().find_breaths(), the inputs are a) the BreathDetection object, and b) one or several DataContainers.
A BreathDetection object can be made hashable by freezing them (https://docs.python.org/3/library/dataclasses.html#frozen-instances), so they cannot be altered after initialization (see https://docs.python.org/3/library/dataclasses.html#module-contents under unsafe_hash). This would mean that for all these objects, that you can't alter them after creation. This is a small downside, but not insurmountable.
DataContainers are not yet hashable, and this is not a trivial task. We have to think about what a hash of a DataContainer is comprised of. Is it purely the contained data (e.g. time, values, or intervals, values), or also aspects like the label, description, etc. These will then have to be frozen or locked before using them in analysis (or freeze/lock them as soon as they are used for analysis).

On locking and freezing

Ideally, DataContainer objects (ContinuousData, IntervalData, SparseData and EITData) don't change at all after initialization. This would make it easier to work with hashes, etc, and forces the end user to get all components together before creating the object. This would remove the possibility to e.g. manually remove a single breath from some data.

Dataclasses support creating frozen objects, which removes the option to set attributes after initialization. This does not, however, protect against changing the attributes themselves, e.g. appending elements to a list. We have previously worked on freezing objects, which locks numpy arrays (so they cannot be changed), but does not protect against changing e.g. lists.

To make DataContainers hashable, it would make sense to:

prevent the usage of lists and prefer numpy arrays, tuples or sets
replace unhashable dicts with e.g. named tuples or config classes
find a way to properly hash numpy arrays

On hashing numpy arrays

Hashing numpy arrays is not supported by default. There are plenty of custom implementations available. However, these are not very fast for EIT-sized datasets. A solution might be:

if the object is not locked, don't return a hash
calculate the hash from the array bytes once, and store on the hash
as soon as you unlock the object, remove the stored hash

This should result in a fast way to repeatedly hash objects, thus making it possible to use them for caching results.

On saving intermediate steps

Of course it is very useful to do store intermediate steps. In my proposal, there is no easy way to do this.

It is possible to retrieve results from a cache, but we would need to write a nice interface for that.
A second option is to have an each method optionally return the intermediate steps (BD().find_breaths(..., return_intermediate=True)). However, this would mess with function signatures.
A third option is to always return intermediate results (and possible other stuff). This would return in e.g. breaths, other_stuff = BD().find_breaths(), and be in the way of BD().find_breaths(some_Data).store(sequence), automatically passing the return value to a next function, etc.
A fourth option is to save intermediate results that were used to produce an object into the object itself. E.g. with breaths = BD().find_breaths(), breaths.derived_from or another attribute would contain intermediate steps.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Cache results instead of using `store=True` #365

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Cache results instead of using store=True #365

Uh oh!

Uh oh!

psomhorst Mar 26, 2025 Maintainer

Run all intermediate steps yourself

Save/load results with a predictable label

Automatic caching of intermediate steps

On locking and freezing

On hashing numpy arrays

On saving intermediate steps

Replies: 0 comments

Cache results instead of using `store=True` #365

psomhorst
Mar 26, 2025
Maintainer