add cuda 13 patch for cupy memadvise to host memory as used with SAM by eordentlich · Pull Request #1001 · NVIDIA/spark-rapids-ml

eordentlich · 2025-12-23T23:13:15Z

No description provided.

Signed-off-by: Erik Ordentlich <eordentlich@gmail.com>

greptile-apps · 2025-12-23T23:18:53Z

Greptile Summary

This PR implements a workaround for a CuPy 13.6 bug in the memAdvise API when using CUDA 13. The changes introduce a new helper function _memadvise_cpu in utils.py that detects the CUDA version and uses the appropriate API:

For CUDA < 13.0.0: Uses the original CuPy API (cp.cuda.runtime.memAdvise)
For CUDA >= 13.0.0: Uses low-level cuda-python bindings to work around the CuPy bug

The helper function is then used consistently across clustering.py, core.py, and umap.py to replace all direct calls to cp.cuda.runtime.memAdvise. This ensures that SAM (System Allocated Memory) continues to work correctly by pinning numpy arrays to host memory and preventing unwanted page migration to device during GPU concatenation operations.

Confidence Score: 4/5

This PR is safe to merge with minimal risk - it's a focused bugfix with appropriate version detection
The implementation is solid with proper version checking and fallback logic. The author has already addressed syntax issues through multiple iterations. The only minor concern is the lack of unit tests for the new _memadvise_cpu function, but this is a workaround for a specific CUDA 13 environment issue that may be difficult to test in CI
No files require special attention - the changes are straightforward refactoring to use a centralized helper function

Important Files Changed

Filename	Overview
python/src/spark_rapids_ml/utils.py	Adds `_memadvise_cpu` helper function to work around CuPy 13.x memAdvise API bug using cuda-python bindings
python/src/spark_rapids_ml/clustering.py	Replaces direct `cp.cuda.runtime.memAdvise` call with `_memadvise_cpu` helper function for SAM memory management
python/src/spark_rapids_ml/core.py	Replaces direct `cp.cuda.runtime.memAdvise` calls with `_memadvise_cpu` helper for dense and sparse arrays
python/src/spark_rapids_ml/umap.py	Replaces direct `cp.cuda.runtime.memAdvise` calls with `_memadvise_cpu` helper for dense and sparse arrays

Sequence Diagram

sequenceDiagram
    participant Caller as Algorithm Code<br/>(clustering/core/umap)
    participant Helper as _memadvise_cpu
    participant Version as cuda.bindings.__version__
    participant CuPy as cp.cuda.runtime.memAdvise<br/>(CUDA < 13)
    participant CudaPython as cuda.bindings.runtime<br/>(CUDA >= 13)
    
    Note over Caller: SAM enabled with numpy array
    Caller->>Helper: _memadvise_cpu(data, nbytes)
    Helper->>Version: Check cuda.bindings.__version__
    
    alt CUDA < 13.0.0
        Version-->>Helper: version < 13.0.0
        Helper->>CuPy: memAdvise(data, nbytes, 3, -1)
        Note over CuPy: Uses CuPy API<br/>(original method)
        CuPy-->>Helper: Success
    else CUDA >= 13.0.0
        Version-->>Helper: version >= 13.0.0
        Note over Helper: Workaround for CuPy 13.6 bug
        Helper->>CudaPython: Import cudaMemLocation,<br/>cudaMemLocationType,<br/>cudaMemoryAdvise
        Helper->>Helper: Create cudaMemLocation<br/>with Host type
        Helper->>CudaPython: cudaMemAdvise(data, nbytes,<br/>SetPreferredLocation, mem_location)
        Note over CudaPython: Uses low-level<br/>cuda-python bindings
        CudaPython-->>Helper: Success
    end
    
    Helper-->>Caller: Memory pinned to CPU
    Note over Caller: Prevents page migration<br/>during GPU concatenation

eordentlich · 2025-12-23T23:21:37Z

build

rishic3

LGTM, minor rewording suggestion

rishic3 · 2025-12-24T18:11:18Z

python/src/spark_rapids_ml/utils.py

+def _memadvise_cpu(data: Any, nbytes: int) -> None:
+    """
+    Advise data referenced by pointer to stay in cpu memory.
+    For use with SAM to prevent migration of host memory staged partial matrices to remain on cpu during


Suggested change

For use with SAM to prevent migration of host memory staged partial matrices to remain on cpu during

For use with SAM to prevent partial matrices staged on host from migrating to device during

Signed-off-by: Erik Ordentlich <eordentlich@gmail.com>

eordentlich · 2025-12-24T18:48:48Z

build

eordentlich added 3 commits December 23, 2025 15:10

add work around for memadvise cupy bug for cuda13

d4683d4

Signed-off-by: Erik Ordentlich <eordentlich@gmail.com>

syntax corrections

b3ca4bf

Signed-off-by: Erik Ordentlich <eordentlich@gmail.com>

syntax

0d96a36

Signed-off-by: Erik Ordentlich <eordentlich@gmail.com>

rishic3 previously approved these changes Dec 24, 2025

View reviewed changes

address comments

7cb564d

Signed-off-by: Erik Ordentlich <eordentlich@gmail.com>

eordentlich dismissed rishic3’s stale review via 7cb564d December 24, 2025 18:48

rishic3 approved these changes Dec 24, 2025

View reviewed changes

eordentlich merged commit aac8af5 into NVIDIA:main Dec 24, 2025
5 checks passed

eordentlich deleted the eo_cuda13_fixes branch December 24, 2025 20:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add cuda 13 patch for cupy memadvise to host memory as used with SAM #1001

add cuda 13 patch for cupy memadvise to host memory as used with SAM #1001
eordentlich merged 4 commits intoNVIDIA:mainfrom
eordentlich:eo_cuda13_fixes

eordentlich commented Dec 23, 2025

Uh oh!

greptile-apps bot commented Dec 23, 2025 •

edited

Loading

Uh oh!

eordentlich commented Dec 23, 2025

Uh oh!

rishic3 left a comment

Uh oh!

rishic3 Dec 24, 2025

Uh oh!

eordentlich commented Dec 24, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	For use with SAM to prevent migration of host memory staged partial matrices to remain on cpu during
	For use with SAM to prevent partial matrices staged on host from migrating to device during

Conversation

eordentlich commented Dec 23, 2025

Uh oh!

greptile-apps bot commented Dec 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

eordentlich commented Dec 23, 2025

Uh oh!

rishic3 left a comment

Choose a reason for hiding this comment

Uh oh!

rishic3 Dec 24, 2025

Choose a reason for hiding this comment

Uh oh!

eordentlich commented Dec 24, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

greptile-apps bot commented Dec 23, 2025 •

edited

Loading