In Dask, when calling the .persist method on a Dask collection, we send the underlying computational graph to the scheduler. The graph is then evaluated on the cluster in the background, i.e. persist is asynchronous and keeps the result of the computation in distributed memory.
One should be careful about when persist is invoked, and on which collections. For example, if the user wants to compute a truncated SVD on a very large tall-and-skinny matrix X which does not fit on their laptop RAM, does it make sense to call X = X.persist() prior to computing the SVD? This effectively computes the underlying graph of X on the cluster and persists X on distributed RAM. Since it does not fit on RAM, Dask will have to spill to disk, which slows things down considerably.
At the moment, when we instantiate any of the SVD classes, a self.X attribute is generated which points to the X Dask collection. This means we create a new future pointing to the data, which makes it harder to remove data from the cluster at a later stage, since all futures pointing to the data need to be deleted for distributed RAM to be cleared. Is it therefore better to not generate a copy of X in the SVD classes? This would mean passing X as an argument to the fit() and transform() methods. The array rechunking performed by these classes would need to move somewhere else (maybe to the planned preprocessing.py module? - see #8).
In Dask, when calling the
.persistmethod on a Dask collection, we send the underlying computational graph to the scheduler. The graph is then evaluated on the cluster in the background, i.e.persistis asynchronous and keeps the result of the computation in distributed memory.One should be careful about when
persistis invoked, and on which collections. For example, if the user wants to compute a truncated SVD on a very large tall-and-skinny matrixXwhich does not fit on their laptop RAM, does it make sense to callX = X.persist()prior to computing the SVD? This effectively computes the underlying graph ofXon the cluster and persistsXon distributed RAM. Since it does not fit on RAM, Dask will have to spill to disk, which slows things down considerably.At the moment, when we instantiate any of the SVD classes, a
self.Xattribute is generated which points to theXDask collection. This means we create a new future pointing to the data, which makes it harder to remove data from the cluster at a later stage, since all futures pointing to the data need to be deleted for distributed RAM to be cleared. Is it therefore better to not generate a copy ofXin the SVD classes? This would mean passingXas an argument to thefit()andtransform()methods. The array rechunking performed by these classes would need to move somewhere else (maybe to the plannedpreprocessing.pymodule? - see #8).