Skip to content

[auto-merge] release/25.10 to main [skip ci] [bot]#996

Merged
nvauto merged 2 commits intomainfrom
release/25.10
Nov 13, 2025
Merged

[auto-merge] release/25.10 to main [skip ci] [bot]#996
nvauto merged 2 commits intomainfrom
release/25.10

Conversation

@nvauto
Copy link
Collaborator

@nvauto nvauto commented Nov 13, 2025

auto-merge triggered by github actions on release/25.10 to create a PR keeping main up-to-date. If this PR is unable to be merged due to conflicts, it will remain open until manually fix.

nvauto and others added 2 commits November 7, 2025 06:26
Signed-off-by: nvauto <70000568+nvauto@users.noreply.github.com>
This is needed to avoid race condition segfaults with SAM when SAM
headroom is reduced
from its initial larger value during data loading to a smaller value
during computations.

---------

Signed-off-by: Erik Ordentlich <eordentlich@gmail.com>
@nvauto nvauto merged commit f58767b into main Nov 13, 2025
@nvauto
Copy link
Collaborator Author

nvauto commented Nov 13, 2025

SUCCESS - auto-merge

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Nov 13, 2025

Greptile Overview

Greptile Summary

This auto-merge PR from release/25.10 to main includes a critical bug fix for memory management with RMM (RAPIDS Memory Manager). The changes address race condition segfaults that occurred when SAM (System Allocated Memory) headroom was dynamically reduced from a larger value during data loading to a smaller value during computation.

Key Changes:

  • Introduced global tracking of previous RMM memory resources via _old_memory_resources list to prevent premature deallocation
  • Added _last_sam_headroom_size to track the current SAM headroom and avoid unnecessary resource recreation
  • Removed the force_sam_headroom parameter from _configure_memory_resource() - now automatically detects when headroom value changes
  • Updated all call sites across classification, clustering, KNN, tree, and UMAP modules to remove the deprecated parameter

The fix ensures that old memory resources are retained for the lifetime of the process, allowing C++ allocations using them to safely invoke deallocation methods, preventing segfaults.

Confidence Score: 5/5

  • This PR is safe to merge - it fixes a critical memory management bug with a well-designed solution
  • The changes are well-structured with clear purpose: fixing race condition segfaults. The implementation correctly tracks memory resources, maintains backward compatibility by only changing internal behavior, and the global state management is appropriate for this use case. All call sites were consistently updated.
  • No files require special attention - all changes are straightforward parameter removals and internal implementation improvements

Important Files Changed

File Analysis

Filename Score Overview
python/src/spark_rapids_ml/utils.py 5/5 Added global tracking for old memory resources and SAM headroom size to prevent segfaults when headroom changes; removed force_sam_headroom parameter
python/src/spark_rapids_ml/classification.py 5/5 Removed force_sam_headroom=True argument from _configure_memory_resource call
python/src/spark_rapids_ml/clustering.py 5/5 Removed force_sam_headroom=True argument from two _configure_memory_resource calls in KMeans and DBSCAN
python/src/spark_rapids_ml/knn.py 5/5 Removed force_sam_headroom=True argument from _configure_memory_resource call
python/src/spark_rapids_ml/tree.py 5/5 Removed force_sam_headroom=True argument from _configure_memory_resource call in RandomForest estimator
python/src/spark_rapids_ml/umap.py 5/5 Removed force_sam_headroom=True argument from two _configure_memory_resource calls in supervised and unsupervised UMAP fit methods

Sequence Diagram

sequenceDiagram
    participant Estimator as ML Estimator (KMeans/UMAP/etc)
    participant ConfigMem as _configure_memory_resource()
    participant GlobalState as Global State (_old_memory_resources, _last_sam_headroom_size)
    participant RMM as RMM Memory Resource

    Note over Estimator: Data Loading Phase
    Estimator->>ConfigMem: Call with larger SAM headroom (2x)
    ConfigMem->>RMM: get_current_device_resource()
    RMM-->>ConfigMem: Current resource
    ConfigMem->>GlobalState: Append old resource to _old_memory_resources
    ConfigMem->>GlobalState: Update _last_sam_headroom_size
    ConfigMem->>RMM: set_current_device_resource(new SAM with larger headroom)
    
    Note over Estimator: Computation Phase
    Estimator->>ConfigMem: Call with smaller SAM headroom (1x)
    ConfigMem->>GlobalState: Check if sam_headroom != _last_sam_headroom_size
    Note over ConfigMem: Headroom changed, need new resource
    ConfigMem->>RMM: get_current_device_resource()
    RMM-->>ConfigMem: Current resource
    ConfigMem->>GlobalState: Append old resource to _old_memory_resources
    ConfigMem->>GlobalState: Update _last_sam_headroom_size
    ConfigMem->>RMM: set_current_device_resource(new SAM with smaller headroom)
    
    Note over GlobalState: Old resources retained for deallocation safety
    Note over GlobalState: Prevents segfaults from C++ deallocations
Loading

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

6 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants