Move PCA and TSVD from cuml to raft#2952
Move PCA and TSVD from cuml to raft#2952aamijar wants to merge 13 commits intorapidsai:release/26.04from
Conversation
cpp/include/raft/linalg/pca.cuh
Outdated
|
|
||
| template <typename math_t, typename enum_solver = solver> | ||
| void truncCompExpVars(const raft::handle_t& handle, | ||
| math_t* in, |
There was a problem hiding this comment.
Need to use mdspan here- we've deprecated all the pointer APIs.
cpp/include/raft/linalg/pca.cuh
Outdated
| math_t* input, | ||
| math_t* components, | ||
| math_t* explained_var, | ||
| math_t* explained_var_ratio, |
There was a problem hiding this comment.
Input order should match the other (newer APIs). handle, params, input, output, free params. Also "stream" is in the handle now, and we use device_resources not raft::hande.
| /** | ||
| * @brief perform fit operation for the pca. Generates eigenvectors, explained vars, singular vals, | ||
| * etc. | ||
| * @param[in] handle: cuml handle object |
There was a problem hiding this comment.
doc mentioning cuml handle, not raft device_resources! (same for other docs too)
We will still have the same python and cpp apis in cuml too! |
|
@aamijar we will probably expose a preprocessing api through python for purposes of users who need to write scripts (for example Jinsol's new dataset gen requires PCA and it would be a circular dependency if we included cuml in cuVS) or have databases written in python. But- like I mentioned to Simon, the users are very diffeeent between the two. Same thing with kmeans- kmeans clusters is the equivalent of "lexicograph ordering" in the vector world. Pca is another way to reduce footprint of vectors without losing quality. Data science users will continue to use cuml. Vector databases will continue to use cuVS. It's important we don't duplicate code across the two... and since cuml is already using cuVS, it can continue to use the c++ api like you mentioned. |
| void truncCompExpVars(raft::resources const& handle, | ||
| math_t* in, | ||
| math_t* components, | ||
| math_t* explained_var, | ||
| math_t* explained_var_ratio, | ||
| math_t* noise_vars, | ||
| const paramsTSVD& prms, | ||
| cudaStream_t stream) |
There was a problem hiding this comment.
why do we need both the stream and the handle here? can't we use raft::resource::get_cuda_stream(handle)? Same for other functions too!
| auto stream = resource::get_cuda_stream(handle); | ||
|
|
||
| paramsPCA prms_with_dims = prms; | ||
| prms_with_dims.n_rows = static_cast<std::size_t>(input.extent(0)); | ||
| prms_with_dims.n_cols = static_cast<std::size_t>(input.extent(1)); | ||
|
|
||
| detail::pcaFit(handle, | ||
| input.data_handle(), | ||
| components.data_handle(), | ||
| explained_var.data_handle(), | ||
| explained_var_ratio.data_handle(), | ||
| singular_vals.data_handle(), | ||
| mu.data_handle(), | ||
| noise_vars.data_handle(), | ||
| prms_with_dims, | ||
| stream, | ||
| flip_signs_based_on_U); | ||
| } |
There was a problem hiding this comment.
similar to previous comment. suggesting we don't need to pass the stream separately
| struct paramsTSVD { | ||
| std::size_t n_rows = 0; | ||
| std::size_t n_cols = 0; | ||
| int gpu_id = 0; |
There was a problem hiding this comment.
are we using this gpu_id anywhere?
| void truncCompExpVars(raft::resources const& handle, | ||
| math_t* in, | ||
| math_t* components, | ||
| math_t* explained_var, | ||
| math_t* explained_var_ratio, | ||
| math_t* noise_vars, | ||
| const paramsTSVD& prms, |
There was a problem hiding this comment.
As Corey mentioned, suggesting we change the order to have handle-> params-> other stuff. Same for other functions.
| paramsPCA prms_with_dims = prms; | ||
| prms_with_dims.n_rows = static_cast<std::size_t>(input.extent(0)); | ||
| prms_with_dims.n_cols = static_cast<std::size_t>(input.extent(1)); |
There was a problem hiding this comment.
thinking if it would be better to make a new params here VS use RAFT_EXPECTS to check if we have the right rows/cols.... what do you think!?
|
Hi @jinsolp, thanks for the review! I guess my original goal was to keep as much of the existing code from cuml as possible, so that's why I am using the old pointer based apis for the detail namespace. The public API has been changed to use the mdspan, no stream usage, and correct ordering of params. That being said, we should probably redo some of the detail implementation to match the modern conventions. I'll take a look at it! |
Co-authored-by: Jinsol Park <jinsolp@nvidia.com>
@jinsolp @aamijar pointer-based APIs are okay in detail namespace. Never okay in public APIs. We do need to make sure the public APIs are ordered appropriately (consistently w/ the other public APIs). |
Required for rapidsai/cuvs#1207 and rapidsai/cuml#7802.
This PR moves
pca.cuh,tsvd.cuh, and gtests into raft.