Skip to content

Move PCA and TSVD from cuml to raft#2952

Open
aamijar wants to merge 13 commits intorapidsai:release/26.04from
aamijar:move-pca-from-cuml
Open

Move PCA and TSVD from cuml to raft#2952
aamijar wants to merge 13 commits intorapidsai:release/26.04from
aamijar:move-pca-from-cuml

Conversation

@aamijar
Copy link
Member

@aamijar aamijar commented Feb 13, 2026

Required for rapidsai/cuvs#1207 and rapidsai/cuml#7802.

This PR moves pca.cuh, tsvd.cuh, and gtests into raft.

@aamijar aamijar requested review from a team as code owners February 13, 2026 09:09
@aamijar aamijar self-assigned this Feb 13, 2026
@aamijar aamijar added non-breaking Non-breaking change feature request New feature or request labels Feb 13, 2026

template <typename math_t, typename enum_solver = solver>
void truncCompExpVars(const raft::handle_t& handle,
math_t* in,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to use mdspan here- we've deprecated all the pointer APIs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 7289840

math_t* input,
math_t* components,
math_t* explained_var,
math_t* explained_var_ratio,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Input order should match the other (newer APIs). handle, params, input, output, free params. Also "stream" is in the handle now, and we use device_resources not raft::hande.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 7289840

Copy link
Contributor

@jinsolp jinsolp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @aamijar ! just a minor comment.
Question: will this be imported in cuvs and exposed as a python API?

/**
* @brief perform fit operation for the pca. Generates eigenvectors, explained vars, singular vals,
* etc.
* @param[in] handle: cuml handle object
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doc mentioning cuml handle, not raft device_resources! (same for other docs too)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed here 9e285e6

@aamijar
Copy link
Member Author

aamijar commented Feb 14, 2026

Question: will this be imported in cuvs and exposed as a python API?

We will still have the same python and cpp apis in cuml too!
On the cuvs side I think the plan is to expose a cpp api.

@cjnolet
Copy link
Member

cjnolet commented Feb 14, 2026

@aamijar we will probably expose a preprocessing api through python for purposes of users who need to write scripts (for example Jinsol's new dataset gen requires PCA and it would be a circular dependency if we included cuml in cuVS) or have databases written in python.

But- like I mentioned to Simon, the users are very diffeeent between the two. Same thing with kmeans- kmeans clusters is the equivalent of "lexicograph ordering" in the vector world. Pca is another way to reduce footprint of vectors without losing quality.

Data science users will continue to use cuml. Vector databases will continue to use cuVS. It's important we don't duplicate code across the two... and since cuml is already using cuVS, it can continue to use the c++ api like you mentioned.

Comment on lines +29 to +36
void truncCompExpVars(raft::resources const& handle,
math_t* in,
math_t* components,
math_t* explained_var,
math_t* explained_var_ratio,
math_t* noise_vars,
const paramsTSVD& prms,
cudaStream_t stream)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need both the stream and the handle here? can't we use raft::resource::get_cuda_stream(handle)? Same for other functions too!

Comment on lines +51 to +68
auto stream = resource::get_cuda_stream(handle);

paramsPCA prms_with_dims = prms;
prms_with_dims.n_rows = static_cast<std::size_t>(input.extent(0));
prms_with_dims.n_cols = static_cast<std::size_t>(input.extent(1));

detail::pcaFit(handle,
input.data_handle(),
components.data_handle(),
explained_var.data_handle(),
explained_var_ratio.data_handle(),
singular_vals.data_handle(),
mu.data_handle(),
noise_vars.data_handle(),
prms_with_dims,
stream,
flip_signs_based_on_U);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar to previous comment. suggesting we don't need to pass the stream separately

struct paramsTSVD {
std::size_t n_rows = 0;
std::size_t n_cols = 0;
int gpu_id = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we using this gpu_id anywhere?

Comment on lines +29 to +35
void truncCompExpVars(raft::resources const& handle,
math_t* in,
math_t* components,
math_t* explained_var,
math_t* explained_var_ratio,
math_t* noise_vars,
const paramsTSVD& prms,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As Corey mentioned, suggesting we change the order to have handle-> params-> other stuff. Same for other functions.

Comment on lines +53 to +55
paramsPCA prms_with_dims = prms;
prms_with_dims.n_rows = static_cast<std::size_t>(input.extent(0));
prms_with_dims.n_cols = static_cast<std::size_t>(input.extent(1));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thinking if it would be better to make a new params here VS use RAFT_EXPECTS to check if we have the right rows/cols.... what do you think!?

@aamijar
Copy link
Member Author

aamijar commented Feb 20, 2026

Hi @jinsolp, thanks for the review! I guess my original goal was to keep as much of the existing code from cuml as possible, so that's why I am using the old pointer based apis for the detail namespace. The public API has been changed to use the mdspan, no stream usage, and correct ordering of params.

That being said, we should probably redo some of the detail implementation to match the modern conventions. I'll take a look at it!

aamijar and others added 2 commits March 9, 2026 18:08
@cjnolet
Copy link
Member

cjnolet commented Mar 11, 2026

so that's why I am using the old pointer based apis for the detail namespace. The public API has been changed to use the mdspan, no stream usage, and correct ordering of params.

@jinsolp @aamijar pointer-based APIs are okay in detail namespace. Never okay in public APIs. We do need to make sure the public APIs are ordered appropriately (consistently w/ the other public APIs).

@aamijar
Copy link
Member Author

aamijar commented Mar 11, 2026

Thanks @jinsolp and @cjnolet, I've opened an issue as a follow up item #2978. Can I get a re-review if everything looks good on this PR now?

@aamijar aamijar changed the base branch from main to release/26.04 March 13, 2026 00:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature request New feature or request non-breaking Non-breaking change

Projects

Development

Successfully merging this pull request may close these issues.

3 participants