[auto-merge] release/25.10 to main [skip ci] [bot] by nvauto · Pull Request #996 · NVIDIA/spark-rapids-ml

nvauto · 2025-11-13T01:06:22Z

auto-merge triggered by github actions on release/25.10 to create a PR keeping main up-to-date. If this PR is unable to be merged due to conflicts, it will remain open until manually fix.

Signed-off-by: nvauto <70000568+nvauto@users.noreply.github.com>

This is needed to avoid race condition segfaults with SAM when SAM headroom is reduced from its initial larger value during data loading to a smaller value during computations. --------- Signed-off-by: Erik Ordentlich <eordentlich@gmail.com>

nvauto · 2025-11-13T01:06:25Z

SUCCESS - auto-merge

greptile-apps · 2025-11-13T01:08:51Z

Greptile Overview

Greptile Summary

This auto-merge PR from release/25.10 to main includes a critical bug fix for memory management with RMM (RAPIDS Memory Manager). The changes address race condition segfaults that occurred when SAM (System Allocated Memory) headroom was dynamically reduced from a larger value during data loading to a smaller value during computation.

Key Changes:

Introduced global tracking of previous RMM memory resources via _old_memory_resources list to prevent premature deallocation
Added _last_sam_headroom_size to track the current SAM headroom and avoid unnecessary resource recreation
Removed the force_sam_headroom parameter from _configure_memory_resource() - now automatically detects when headroom value changes
Updated all call sites across classification, clustering, KNN, tree, and UMAP modules to remove the deprecated parameter

The fix ensures that old memory resources are retained for the lifetime of the process, allowing C++ allocations using them to safely invoke deallocation methods, preventing segfaults.

Confidence Score: 5/5

This PR is safe to merge - it fixes a critical memory management bug with a well-designed solution
The changes are well-structured with clear purpose: fixing race condition segfaults. The implementation correctly tracks memory resources, maintains backward compatibility by only changing internal behavior, and the global state management is appropriate for this use case. All call sites were consistently updated.
No files require special attention - all changes are straightforward parameter removals and internal implementation improvements

Important Files Changed

File Analysis

Filename	Score	Overview
python/src/spark_rapids_ml/utils.py	5/5	Added global tracking for old memory resources and SAM headroom size to prevent segfaults when headroom changes; removed `force_sam_headroom` parameter
python/src/spark_rapids_ml/classification.py	5/5	Removed `force_sam_headroom=True` argument from `_configure_memory_resource` call
python/src/spark_rapids_ml/clustering.py	5/5	Removed `force_sam_headroom=True` argument from two `_configure_memory_resource` calls in KMeans and DBSCAN
python/src/spark_rapids_ml/knn.py	5/5	Removed `force_sam_headroom=True` argument from `_configure_memory_resource` call
python/src/spark_rapids_ml/tree.py	5/5	Removed `force_sam_headroom=True` argument from `_configure_memory_resource` call in RandomForest estimator
python/src/spark_rapids_ml/umap.py	5/5	Removed `force_sam_headroom=True` argument from two `_configure_memory_resource` calls in supervised and unsupervised UMAP fit methods

Sequence Diagram

sequenceDiagram
    participant Estimator as ML Estimator (KMeans/UMAP/etc)
    participant ConfigMem as _configure_memory_resource()
    participant GlobalState as Global State (_old_memory_resources, _last_sam_headroom_size)
    participant RMM as RMM Memory Resource

    Note over Estimator: Data Loading Phase
    Estimator->>ConfigMem: Call with larger SAM headroom (2x)
    ConfigMem->>RMM: get_current_device_resource()
    RMM-->>ConfigMem: Current resource
    ConfigMem->>GlobalState: Append old resource to _old_memory_resources
    ConfigMem->>GlobalState: Update _last_sam_headroom_size
    ConfigMem->>RMM: set_current_device_resource(new SAM with larger headroom)
    
    Note over Estimator: Computation Phase
    Estimator->>ConfigMem: Call with smaller SAM headroom (1x)
    ConfigMem->>GlobalState: Check if sam_headroom != _last_sam_headroom_size
    Note over ConfigMem: Headroom changed, need new resource
    ConfigMem->>RMM: get_current_device_resource()
    RMM-->>ConfigMem: Current resource
    ConfigMem->>GlobalState: Append old resource to _old_memory_resources
    ConfigMem->>GlobalState: Update _last_sam_headroom_size
    ConfigMem->>RMM: set_current_device_resource(new SAM with smaller headroom)
    
    Note over GlobalState: Old resources retained for deallocation safety
    Note over GlobalState: Prevents segfaults from C++ deallocations

greptile-apps

_{6 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

nvauto and others added 2 commits November 7, 2025 06:26

Create release branch release/25.10

7e311c6

Signed-off-by: nvauto <70000568+nvauto@users.noreply.github.com>

nvauto merged commit f58767b into main Nov 13, 2025

greptile-apps bot reviewed Nov 13, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[auto-merge] release/25.10 to main [skip ci] [bot]#996

[auto-merge] release/25.10 to main [skip ci] [bot]#996
nvauto merged 2 commits intomainfrom
release/25.10

nvauto commented Nov 13, 2025

Uh oh!

nvauto commented Nov 13, 2025

Uh oh!

greptile-apps bot commented Nov 13, 2025

Uh oh!

greptile-apps bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

nvauto commented Nov 13, 2025

Uh oh!

nvauto commented Nov 13, 2025

Uh oh!

greptile-apps bot commented Nov 13, 2025

Greptile Overview

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants