You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Features of eitprocessing can/are be reused within a single pipeline. An example is BreathDetection (BD), which can be used by other features (EELI, TIV). We have thought of and partly implemented some solutions to prevent re-running breath detection. However, these come with some disadvantages.
To move forward to a system of reproducible pipelines, I'd like to propose a single workflow that works for all cases.
Run all intermediate steps yourself
A simple solution is to run intermediate steps yourself before continuing to the next step. This adds steps for the end user. They have to run BD with the proper settings themselves and pass the results to e.g. TIV. This is not conductive to easy and reproducible pipelines.
Save/load results with a predictable label
BD results are saved to a sequence with a predictable label, based on the input. E.g. BD of "filtered impedance" data gets a label "breath detection - filtered impedance" and is stored in the sequence. When BD is run, it first checks whether the result already exists. If so, it returns that, instead of recalculating. This is a very simple way of caching with some downsides. First, you have to provide the sequence when running BD so that the results can be stored. This makes the interface less approachable. Second, it does not care about any changes in the data. You might have an entirely different object with the same label, resulting in getting the cached results from an different object. Third, this adds data to the sequence object, that is potentially unwanted. E.g., if you want to store to disk relevant results, but not BD data.
I propose a new (well, previously conceived, but never implemented) design.
Automatic caching of intermediate steps
I propose no results are automatically added to a sequence. My suggestion is to have end users only manually store results (sequence.data.add(some_result)). We could make this slightly easier by adding a .store() method to all objects (BD().find_breaths(some_Data).store(sequence)).
This means that to use the results of reused algorithms, we have to cache them. Python has several ways of automatically caching intermediate steps. The obvious built-in approach is the LRU cache from functools. You apply this cache to a method (e.g. find_breaths). The cache takes the arguments that were passed to the method, and the return values of the method, and saves them in a hash map. When running the method with the same arguments, the results are retrieved from the cache and returned.
The LRU component automatically removes values that are not often used if the cache becomes too full.
This feature hinges on the inputs of the methods to be hashable. For a generic method, e.g. BreathDetection().find_breaths(), the inputs are a) the BreathDetection object, and b) one or several DataContainers.
A BreathDetection object can be made hashable by freezing them (https://docs.python.org/3/library/dataclasses.html#frozen-instances), so they cannot be altered after initialization (see https://docs.python.org/3/library/dataclasses.html#module-contents under unsafe_hash). This would mean that for all these objects, that you can't alter them after creation. This is a small downside, but not insurmountable.
DataContainers are not yet hashable, and this is not a trivial task. We have to think about what a hash of a DataContainer is comprised of. Is it purely the contained data (e.g. time, values, or intervals, values), or also aspects like the label, description, etc. These will then have to be frozen or locked before using them in analysis (or freeze/lock them as soon as they are used for analysis).
On locking and freezing
Ideally, DataContainer objects (ContinuousData, IntervalData, SparseData and EITData) don't change at all after initialization. This would make it easier to work with hashes, etc, and forces the end user to get all components together before creating the object. This would remove the possibility to e.g. manually remove a single breath from some data.
Dataclasses support creating frozen objects, which removes the option to set attributes after initialization. This does not, however, protect against changing the attributes themselves, e.g. appending elements to a list. We have previously worked on freezing objects, which locks numpy arrays (so they cannot be changed), but does not protect against changing e.g. lists.
To make DataContainers hashable, it would make sense to:
prevent the usage of lists and prefer numpy arrays, tuples or sets
replace unhashable dicts with e.g. named tuples or config classes
find a way to properly hash numpy arrays
On hashing numpy arrays
Hashing numpy arrays is not supported by default. There are plenty of custom implementations available. However, these are not very fast for EIT-sized datasets. A solution might be:
if the object is not locked, don't return a hash
calculate the hash from the array bytes once, and store on the hash
as soon as you unlock the object, remove the stored hash
This should result in a fast way to repeatedly hash objects, thus making it possible to use them for caching results.
On saving intermediate steps
Of course it is very useful to do store intermediate steps. In my proposal, there is no easy way to do this.
It is possible to retrieve results from a cache, but we would need to write a nice interface for that.
A second option is to have an each method optionally return the intermediate steps (BD().find_breaths(..., return_intermediate=True)). However, this would mess with function signatures.
A third option is to always return intermediate results (and possible other stuff). This would return in e.g. breaths, other_stuff = BD().find_breaths(), and be in the way of BD().find_breaths(some_Data).store(sequence), automatically passing the return value to a next function, etc.
A fourth option is to save intermediate results that were used to produce an object into the object itself. E.g. with breaths = BD().find_breaths(), breaths.derived_from or another attribute would contain intermediate steps.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Features of eitprocessing can/are be reused within a single pipeline. An example is BreathDetection (BD), which can be used by other features (EELI, TIV). We have thought of and partly implemented some solutions to prevent re-running breath detection. However, these come with some disadvantages.
To move forward to a system of reproducible pipelines, I'd like to propose a single workflow that works for all cases.
Run all intermediate steps yourself
A simple solution is to run intermediate steps yourself before continuing to the next step. This adds steps for the end user. They have to run BD with the proper settings themselves and pass the results to e.g. TIV. This is not conductive to easy and reproducible pipelines.
Save/load results with a predictable label
BD results are saved to a sequence with a predictable label, based on the input. E.g. BD of "filtered impedance" data gets a label "breath detection - filtered impedance" and is stored in the sequence. When BD is run, it first checks whether the result already exists. If so, it returns that, instead of recalculating. This is a very simple way of caching with some downsides. First, you have to provide the sequence when running BD so that the results can be stored. This makes the interface less approachable. Second, it does not care about any changes in the data. You might have an entirely different object with the same label, resulting in getting the cached results from an different object. Third, this adds data to the sequence object, that is potentially unwanted. E.g., if you want to store to disk relevant results, but not BD data.
I propose a new (well, previously conceived, but never implemented) design.
Automatic caching of intermediate steps
I propose no results are automatically added to a sequence. My suggestion is to have end users only manually store results (
sequence.data.add(some_result)). We could make this slightly easier by adding a.store()method to all objects (BD().find_breaths(some_Data).store(sequence)).This means that to use the results of reused algorithms, we have to cache them. Python has several ways of automatically caching intermediate steps. The obvious built-in approach is the LRU cache from functools. You apply this cache to a method (e.g.
find_breaths). The cache takes the arguments that were passed to the method, and the return values of the method, and saves them in a hash map. When running the method with the same arguments, the results are retrieved from the cache and returned.The LRU component automatically removes values that are not often used if the cache becomes too full.
This feature hinges on the inputs of the methods to be hashable. For a generic method, e.g. BreathDetection().find_breaths(), the inputs are a) the BreathDetection object, and b) one or several DataContainers.
A BreathDetection object can be made hashable by freezing them (https://docs.python.org/3/library/dataclasses.html#frozen-instances), so they cannot be altered after initialization (see https://docs.python.org/3/library/dataclasses.html#module-contents under unsafe_hash). This would mean that for all these objects, that you can't alter them after creation. This is a small downside, but not insurmountable.
DataContainers are not yet hashable, and this is not a trivial task. We have to think about what a hash of a DataContainer is comprised of. Is it purely the contained data (e.g. time, values, or intervals, values), or also aspects like the label, description, etc. These will then have to be frozen or locked before using them in analysis (or freeze/lock them as soon as they are used for analysis).
On locking and freezing
Ideally, DataContainer objects (ContinuousData, IntervalData, SparseData and EITData) don't change at all after initialization. This would make it easier to work with hashes, etc, and forces the end user to get all components together before creating the object. This would remove the possibility to e.g. manually remove a single breath from some data.
Dataclasses support creating frozen objects, which removes the option to set attributes after initialization. This does not, however, protect against changing the attributes themselves, e.g. appending elements to a list. We have previously worked on freezing objects, which locks numpy arrays (so they cannot be changed), but does not protect against changing e.g. lists.
To make DataContainers hashable, it would make sense to:
On hashing numpy arrays
Hashing numpy arrays is not supported by default. There are plenty of custom implementations available. However, these are not very fast for EIT-sized datasets. A solution might be:
This should result in a fast way to repeatedly hash objects, thus making it possible to use them for caching results.
On saving intermediate steps
Of course it is very useful to do store intermediate steps. In my proposal, there is no easy way to do this.
BD().find_breaths(..., return_intermediate=True)). However, this would mess with function signatures.breaths, other_stuff = BD().find_breaths(), and be in the way ofBD().find_breaths(some_Data).store(sequence), automatically passing the return value to a next function, etc.breaths = BD().find_breaths(),breaths.derived_fromor another attribute would contain intermediate steps.Beta Was this translation helpful? Give feedback.
All reactions