Skip to content

PCA preprocessor#1808

Open
aamijar wants to merge 22 commits intorapidsai:release/26.04from
aamijar:pca-preprocessor
Open

PCA preprocessor#1808
aamijar wants to merge 22 commits intorapidsai:release/26.04from
aamijar:pca-preprocessor

Conversation

@aamijar
Copy link
Member

@aamijar aamijar commented Feb 16, 2026

Resolves #1207. Depends on rapidsai/raft#2952
This PR introduces the cuvs::preprocessing::pca with float support. The following APIs are supported:
fit, transform, fit_transform, inverse_transform.

@aamijar aamijar self-assigned this Feb 16, 2026
@aamijar aamijar added feature request New feature or request milvus AlloyDB non-breaking Introduces a non-breaking change labels Feb 16, 2026
raft::device_vector_view<float, int64_t> mu,
raft::device_scalar_view<float, int64_t> noise_vars,
bool flip_signs_based_on_U = false);

Copy link
Member Author

@aamijar aamijar Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making a note here that I don't think the existing cuml implementation has the ability to tune the percentage of explained variance.
For example, in sklearn we can set 0 < n_components < 1 where the user can select a percentage of the explained variance to recover and the n_components is automatically determined by the algorithm in order to satisfy that.

We will have to build that piece out since it doesn't exist in the current implementation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tuning is not what's being asked for. Exposing the explained variance is what's being requested (it is used for tuning / selecting the number of components but that's something the user does, not something we need to do).

Comment on lines +29 to +31
prms.copy = config.copy;
prms.whiten = config.whiten;
return prms;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
prms.copy = config.copy;
prms.whiten = config.whiten;
return prms;
prms.copy = config.copy;
prms.whiten = config.whiten;
prms.verbose = config.verbose;
return prms;

We are missing verbose here, no?

Copy link
Member Author

@aamijar aamijar Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

verbose was a unused parameter, removed it from pca.hpp in 99f32fc

Comment on lines +93 to +94
// prms.n_cols = params.n_col;
// prms.n_rows = params.n_row;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove commented code

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 074fd96

* @param[in] flip_signs_based_on_U whether to determine signs by U (true) or V.T (false)
*/
void fit(raft::resources const& handle,
params config,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small thing but important for API consistency: params config is passed by value here and in all other functions in the PR. The cuVS convention afaik is const params& for example in kmeans::fit:

void fit(raft::resources const& handle,
         const cuvs::cluster::kmeans::params& params,
         raft::device_matrix_view<const float, int> X,
         std::optional<raft::device_vector_view<const float, int>> sample_weight,
         raft::device_matrix_view<float, int> centroids,
         raft::host_scalar_view<float> inertia,
         raft::host_scalar_view<int> n_iter);

This won't affect performance or correctness at all, but for consistency I would suggest changing to const params& config throughout. Applies to all 8 overloads in this header plus the detail layer.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in c7c52a7

}

protected:
void basicTest()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both basicTest and advancedTest set n_components = n_col, so the inverse transform should perfectly reconstruct the input.

Consider adding a test with n_components < n_col that verifies the reconstruction error is bounded but non-zero, this would confirm the dimensionality reduction is actually working, not just passing data through unchanged and be a useful test case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 948882c

Copy link
Member

@divyegala divyegala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why were double instantiations needed? Where is the code intended to be used?

aamijar and others added 3 commits March 4, 2026 08:04
Co-authored-by: Dante Gama Dessavre <dante.gamadessavre@gmail.com>
Co-authored-by: Dante Gama Dessavre <dante.gamadessavre@gmail.com>
@aamijar
Copy link
Member Author

aamijar commented Mar 4, 2026

Why were double instantiations needed? Where is the code intended to be used?

The cuml pca python interface supports double inputs. However, cuml will use the raft api, so therefore cuvs does not need double instantiations if we think its not valuable.

@divyegala
Copy link
Member

The cuml pca python interface supports double inputs. However, cuml will use the raft api, so therefore cuvs does not need double instantiations if we think its not valuable.

In that case, please remove double.

@aamijar aamijar changed the base branch from main to release/26.04 March 14, 2026 00:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

AlloyDB feature request New feature or request milvus non-breaking Introduces a non-breaking change

Projects

Development

Successfully merging this pull request may close these issues.

[FEA] Implement cuvs::preprocessing::pca

4 participants