Improve memory management

In Dask, when calling the `.persist` method on a Dask collection, we send the underlying computational graph to the scheduler. The graph is then evaluated on the cluster in the background, i.e. `persist` is asynchronous and keeps the result of the computation in distributed memory.

One should be careful about when `persist` is invoked, and on which collections. For example, if the user wants to compute a truncated SVD on a very large tall-and-skinny matrix `X` which does not fit on their laptop RAM, does it make sense to call `X = X.persist()` prior to computing the SVD? This effectively computes the underlying graph of `X` on the cluster and persists `X` on distributed RAM. Since it does not fit on RAM, Dask will have to spill to disk, which slows things down considerably.

At the moment, when we instantiate any of the SVD classes, a `self.X` attribute is generated which points to the `X` Dask collection. This means we create a new future pointing to the data, which makes it harder to remove data from the cluster at a later stage, since all futures pointing to the data need to be deleted for distributed RAM to be cleared. Is it therefore better to not generate a copy of `X` in the SVD classes? This would mean passing `X` as an argument to the `fit()` and `transform()` methods. The array rechunking performed by these classes would need to move somewhere else (maybe to the planned `preprocessing.py` module? - see #8). 



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve memory management #10

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Improve memory management #10

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions