SpectralClustering in cuml.accel#7804
SpectralClustering in cuml.accel#7804rapids-bot[bot] merged 28 commits intorapidsai:release/26.04from
SpectralClustering in cuml.accel#7804Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds a cuml.accel sklearn proxy for SpectralClustering, extends the Cython core with CPU–GPU interop mixins and mapping methods, and introduces integration/unit tests, upstream xfail entries, and documentation updates describing SpectralClustering behavior and limitations. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Comment Tip You can enable review details to help with troubleshooting, context usage and more.Enable the |
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@python/cuml/cuml_accel_tests/integration/test_spectral_clustering.py`:
- Around line 29-39: The test currently computes adjusted_rand_score but never
asserts it; update test_spectral_clustering_n_clusters to store the score (e.g.,
score = adjusted_rand_score(y_true, y_pred)) and add assertions: assert
y_pred.shape == y_true.shape, assert len(np.unique(y_pred)) == n_clusters (or
allow <= n_clusters if algorithm can return fewer), and when n_clusters equals
the known true cluster count assert score >= a reasonable threshold (e.g., 0.8)
to validate correctness; apply the same pattern to the other SpectralClustering
tests referenced (lines 41-52, 54-65, 67-78, 80-91, 93-105, 108-121), and add
additional cases to cover empty dataset, single-sample, high-dimensional input,
fit/predict/transform consistency, and different input types (NumPy, pandas,
cuDF) using the same assertions and comparisons against scikit-learn where
applicable.
In `@python/cuml/cuml_accel_tests/test_basic_estimators.py`:
- Around line 35-41: The test currently creates data with make_blobs and calls
SpectralClustering().fit(X) but never asserts results and leaves y_true unused;
update test_spectral_clustering to remove the unused y_true or use it for
validation, and add concrete assertions: check sc.labels_.shape ==
(X.shape[0],), assert the number of unique labels equals n_clusters
(len(np.unique(sc.labels_)) == 3), and validate numerical correctness by
comparing to sklearn.cluster.SpectralClustering (e.g., via adjusted_rand_score
between sc.labels_ and sklearn_sc.labels_); additionally add small extra
subtests for edge cases (empty array, single sample, high-dimensional data),
test fit/predict/transform consistency on the same input, and run the same
checks with different input types (NumPy, pandas, cuDF) so the test covers
required behaviors.
In `@python/cuml/cuml/accel/_wrappers/sklearn/cluster.py`:
- Line 13: The __all__ export tuple is not sorted which triggers Ruff RUF022;
update the __all__ tuple in this module so the symbol names are alphabetically
ordered (DBSCAN, KMeans, SpectralClustering) to satisfy the linter and keep
style consistent with other modules.
- Around line 49-51: Add the new SpectralClustering wrapper to the
sklearn-compatibility test by importing SpectralClustering from cuml.cluster in
test_sklearn_compatibility.py and adding the SpectralClustering class to the
test estimators list so it is included in the automatic conformance checks
(update the module import block and append/include SpectralClustering in the
estimators list used by the tests).
| def test_spectral_clustering(): | ||
| X, y_true = make_blobs(n_samples=100, centers=3, random_state=42) | ||
| X = X.astype("float32") | ||
| sc = SpectralClustering( | ||
| n_clusters=3, affinity="nearest_neighbors", random_state=42 | ||
| ).fit(X) | ||
| sc.labels_ |
There was a problem hiding this comment.
Add assertions and remove unused y_true.
Right now the test evaluates sc.labels_ without asserting anything, so it won’t catch incorrect results and triggers Ruff warnings. Consider asserting shape/cluster count and dropping the unused variable.
✅ Suggested fix
def test_spectral_clustering():
- X, y_true = make_blobs(n_samples=100, centers=3, random_state=42)
+ X, _ = make_blobs(n_samples=100, centers=3, random_state=42)
X = X.astype("float32")
sc = SpectralClustering(
n_clusters=3, affinity="nearest_neighbors", random_state=42
).fit(X)
- sc.labels_
+ assert sc.labels_.shape == (X.shape[0],)
+ assert len(set(sc.labels_.tolist())) == 3🧰 Tools
🪛 Ruff (0.15.1)
[warning] 36-36: Unpacked variable y_true is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
[warning] 41-41: Found useless expression. Either assign it to a variable or remove it.
(B018)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@python/cuml/cuml_accel_tests/test_basic_estimators.py` around lines 35 - 41,
The test currently creates data with make_blobs and calls
SpectralClustering().fit(X) but never asserts results and leaves y_true unused;
update test_spectral_clustering to remove the unused y_true or use it for
validation, and add concrete assertions: check sc.labels_.shape ==
(X.shape[0],), assert the number of unique labels equals n_clusters
(len(np.unique(sc.labels_)) == 3), and validate numerical correctness by
comparing to sklearn.cluster.SpectralClustering (e.g., via adjusted_rand_score
between sc.labels_ and sklearn_sc.labels_); additionally add small extra
subtests for edge cases (empty array, single sample, high-dimensional data),
test fit/predict/transform consistency on the same input, and run the same
checks with different input types (NumPy, pandas, cuDF) so the test covers
required behaviors.
| @@ -10,7 +10,7 @@ | |||
| from cuml.accel.estimator_proxy import ProxyBase | |||
| from cuml.internals.interop import UnsupportedOnGPU | |||
|
|
|||
| __all__ = ("KMeans", "DBSCAN") | |||
| __all__ = ("KMeans", "DBSCAN", "SpectralClustering") | |||
There was a problem hiding this comment.
Sort __all__ to satisfy Ruff RUF022.
This keeps lint clean and consistent with existing style checks.
♻️ Suggested fix
-__all__ = ("KMeans", "DBSCAN", "SpectralClustering")
+__all__ = ("DBSCAN", "KMeans", "SpectralClustering")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| __all__ = ("KMeans", "DBSCAN", "SpectralClustering") | |
| __all__ = ("DBSCAN", "KMeans", "SpectralClustering") |
🧰 Tools
🪛 Ruff (0.15.1)
[warning] 13-13: __all__ is not sorted
Apply an isort-style sorting to __all__
(RUF022)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@python/cuml/cuml/accel/_wrappers/sklearn/cluster.py` at line 13, The __all__
export tuple is not sorted which triggers Ruff RUF022; update the __all__ tuple
in this module so the symbol names are alphabetically ordered (DBSCAN, KMeans,
SpectralClustering) to satisfy the linter and keep style consistent with other
modules.
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/source/cuml-accel/limitations.rst`:
- Around line 111-114: Add an "Additional notes" callout under the
SpectralClustering fallback bullet that explicitly states
sklearn.SpectralClustering defaults to affinity='rbf' which is not in the
GPU-supported set {'nearest_neighbors','precomputed'} and therefore will always
fall back to CPU; mention the contrast with sklearn.SpectralEmbedding (which
defaults to affinity='nearest_neighbors' and is GPU-accelerated) so readers
understand the practical impact on out-of-the-box sklearn usage.
- Around line 108-118: Update the SpectralClustering fallback list to include
sparse input as a CPU-fallback condition: add a bullet stating that
SpectralClustering falls back to CPU for sparse input (e.g., "If input is
sparse") alongside the existing checks for assign_labels and affinity; mirror
the wording/placement used by SpectralEmbedding's sparse input fallback to keep
consistency and ensure the new bullet appears in the same list under the
"SpectralClustering" section.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@python/cuml/cuml_accel_tests/upstream/scikit-learn/xfail-list.yaml`:
- Around line 1658-1661: The condition specifier for the xfail entry uses an
inconsistent comparator: change the condition field value from
"scikit-learn>1.5" to "scikit-learn>=1.6" so it matches the intended boundary
and the rest of the file's style; update the YAML entry that contains the
reason/tests (the block with reason "cuML SpectralClustering..." and tests list
including "sklearn.cluster.tests.test_spectral::test_affinities[42]") by
replacing the condition string accordingly.
viclafargue
left a comment
There was a problem hiding this comment.
Thanks @aamijar! LGTM for the most part. Please add the estimator to test_sklearn_compatibility.py for automated conformance checks.
| SpectralClustering | ||
| ^^^^^^^^^^^^^^^^^^ | ||
|
|
||
| ``SpectralClustering`` will fall back to CPU in the following cases: | ||
|
|
||
| - If ``assign_labels`` is not ``"kmeans"``. | ||
| - If ``affinity`` is not ``"nearest_neighbors"`` or ``"precomputed"``. | ||
|
|
||
| The following fitted attributes are currently not computed: | ||
|
|
||
| - ``affinity_matrix_`` |
There was a problem hiding this comment.
I agree with the bot comments here, if the affinity parameter is left to its default, it will not be GPU accelerated and it would be nice to document this. Also, we should document that the estimator will fall back to CPU if X is sparse (as documented for SpectralEmbedding).
python/cuml/cuml_accel_tests/integration/test_spectral_clustering.py
Outdated
Show resolved
Hide resolved
python/cuml/cuml_accel_tests/integration/test_spectral_clustering.py
Outdated
Show resolved
Hide resolved
python/cuml/cuml_accel_tests/integration/test_spectral_clustering.py
Outdated
Show resolved
Hide resolved
python/cuml/cuml_accel_tests/integration/test_spectral_clustering.py
Outdated
Show resolved
Hide resolved
python/cuml/cuml_accel_tests/integration/test_spectral_clustering.py
Outdated
Show resolved
Hide resolved
python/cuml/cuml_accel_tests/integration/test_spectral_clustering.py
Outdated
Show resolved
Hide resolved
python/cuml/cuml_accel_tests/integration/test_spectral_clustering.py
Outdated
Show resolved
Hide resolved
| sklearn_model = original.as_sklearn() | ||
| roundtrip_model = SpectralClustering.from_sklearn(sklearn_model) | ||
| assert array_equal(original.labels_, roundtrip_model.labels_) |
There was a problem hiding this comment.
Why not using assert_estimator_roundtrip? It should offer more robust testing.
Co-authored-by: Victor Lafargue <viclafargue@nvidia.com>
…ing.py Co-authored-by: Victor Lafargue <viclafargue@nvidia.com>
…ing.py Co-authored-by: Victor Lafargue <viclafargue@nvidia.com>
…ing.py Co-authored-by: Victor Lafargue <viclafargue@nvidia.com>
…ing.py Co-authored-by: Victor Lafargue <viclafargue@nvidia.com>
…ing.py Co-authored-by: Victor Lafargue <viclafargue@nvidia.com>
…ing.py Co-authored-by: Victor Lafargue <viclafargue@nvidia.com>
…ing.py Co-authored-by: Victor Lafargue <viclafargue@nvidia.com>
|
Thanks for the review @viclafargue! I have addressed all your comments now. |
| random_state=42, | ||
| ).fit(X) | ||
| y_pred = sc.labels_ | ||
| if n_clusters == 3: |
There was a problem hiding this comment.
Shouldn't we generate a new blob datasets for each number of clusters?
|
/merge |
f84a29a
into
rapidsai:release/26.04
Resolves #7293