Skip to content

Improve memory management #10

@dsj976

Description

@dsj976

In Dask, when calling the .persist method on a Dask collection, we send the underlying computational graph to the scheduler. The graph is then evaluated on the cluster in the background, i.e. persist is asynchronous and keeps the result of the computation in distributed memory.

One should be careful about when persist is invoked, and on which collections. For example, if the user wants to compute a truncated SVD on a very large tall-and-skinny matrix X which does not fit on their laptop RAM, does it make sense to call X = X.persist() prior to computing the SVD? This effectively computes the underlying graph of X on the cluster and persists X on distributed RAM. Since it does not fit on RAM, Dask will have to spill to disk, which slows things down considerably.

At the moment, when we instantiate any of the SVD classes, a self.X attribute is generated which points to the X Dask collection. This means we create a new future pointing to the data, which makes it harder to remove data from the cluster at a later stage, since all futures pointing to the data need to be deleted for distributed RAM to be cleared. Is it therefore better to not generate a copy of X in the SVD classes? This would mean passing X as an argument to the fit() and transform() methods. The array rechunking performed by these classes would need to move somewhere else (maybe to the planned preprocessing.py module? - see #8).

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionFurther information is requested

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions