Skip to content

retain global pointers to previous default rmm memory resources #995

Merged
eordentlich merged 2 commits intoNVIDIA:release/25.10from
eordentlich:eo_fix_segv
Nov 13, 2025
Merged

retain global pointers to previous default rmm memory resources #995
eordentlich merged 2 commits intoNVIDIA:release/25.10from
eordentlich:eo_fix_segv

Conversation

@eordentlich
Copy link
Collaborator

This is needed to avoid race condition segfaults with SAM when SAM headroom is reduced
from its initial larger value during data loading to a smaller value during computations.

…changing to avoid race condition segfaults with SAM

Signed-off-by: Erik Ordentlich <eordentlich@gmail.com>
@eordentlich
Copy link
Collaborator Author

build

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Nov 12, 2025

Greptile Overview

Greptile Summary

This PR addresses race condition segfaults when SAM (System Allocated Memory) headroom is reduced during execution. The fix introduces global tracking of previous RMM memory resources and the last SAM headroom value.

Key Changes:

  • Added _old_memory_resources list to retain references to previous memory resources, preventing segfaults when C++ deallocations attempt to use old resources
  • Added _last_sam_headroom_size tracking to avoid unnecessarily recreating SAM resources when headroom hasn't changed
  • Removed force_sam_headroom parameter from _configure_memory_resource and all call sites (6 files)
  • Memory resources are now only recreated when the headroom value actually changes or the resource type differs

How it works:
The solution keeps old memory resources alive in a global list so that any pending C++ allocations can safely invoke their deallocate methods. When the SAM headroom changes (e.g., from a larger value during data loading to a smaller value during computation), the old resource is appended to the list before creating a new one. This prevents the race condition that caused segfaults.

Confidence Score: 4/5

  • This PR is safe to merge with minor style improvements suggested
  • The core logic correctly addresses the race condition segfault issue by retaining old memory resources. The headroom tracking prevents unnecessary resource recreation. However, there are style improvements already noted in previous comments: inefficient type comparison creating temporary objects, and potential for better tracking of headroom changes to avoid accumulating resources when force_sam_headroom was previously used
  • python/src/spark_rapids_ml/utils.py - review the style suggestions from previous comments about type comparison efficiency and headroom tracking

Important Files Changed

File Analysis

Filename Score Overview
python/src/spark_rapids_ml/utils.py 3/5 Added global tracking of old memory resources and _last_sam_headroom_size to prevent segfaults when SAM headroom changes; removed force_sam_headroom parameter
python/src/spark_rapids_ml/classification.py 5/5 Removed force_sam_headroom=True parameter from _configure_memory_resource call in LogisticRegression
python/src/spark_rapids_ml/clustering.py 5/5 Removed force_sam_headroom=True parameter from _configure_memory_resource calls in KMeans and DBSCANModel
python/src/spark_rapids_ml/knn.py 5/5 Removed force_sam_headroom=True parameter from _configure_memory_resource call in NearestNeighborsModel
python/src/spark_rapids_ml/tree.py 5/5 Removed force_sam_headroom=True parameter from _configure_memory_resource call in RandomForestEstimator
python/src/spark_rapids_ml/umap.py 5/5 Removed force_sam_headroom=True parameter from two _configure_memory_resource calls in UMAP fit methods

Sequence Diagram

sequenceDiagram
    participant App as ML Algorithm
    participant Config as _configure_memory_resource
    participant Global as Global State (_last_sam_headroom_size, _old_memory_resources)
    participant RMM as RMM Memory Resource

    Note over App,RMM: Initial call with larger headroom (e.g., during data loading)
    App->>Config: _configure_memory_resource(sam_enabled=True, sam_headroom=large_value)
    Config->>Global: Check _last_sam_headroom_size (None initially)
    Config->>RMM: get_current_device_resource()
    RMM-->>Config: current_mr
    Config->>Global: _old_memory_resources.append(current_mr)
    Config->>Global: _last_sam_headroom_size = large_value
    Config->>RMM: set_current_device_resource(SamHeadroomMemoryResource(large_value))
    
    Note over App,RMM: Later call with smaller headroom (e.g., during computation)
    App->>Config: _configure_memory_resource(sam_enabled=True, sam_headroom=small_value)
    Config->>Global: Check _last_sam_headroom_size (large_value)
    Config->>Config: Condition: small_value != large_value (TRUE)
    Config->>RMM: get_current_device_resource()
    RMM-->>Config: current_mr (SamHeadroomMemoryResource with large_value)
    Config->>Global: _old_memory_resources.append(current_mr)
    Note over Global: Old resource retained to prevent segfaults on C++ deallocations
    Config->>Global: _last_sam_headroom_size = small_value
    Config->>RMM: set_current_device_resource(SamHeadroomMemoryResource(small_value))
    
    Note over App,RMM: Subsequent call with same headroom (no change)
    App->>Config: _configure_memory_resource(sam_enabled=True, sam_headroom=small_value)
    Config->>Global: Check _last_sam_headroom_size (small_value)
    Config->>Config: Condition: small_value != small_value (FALSE) - skip resource creation
    Config-->>App: Return (no changes)
Loading

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, 3 comments

Edit Code Review Agent Settings | Greptile

# invoke the corresponding deallocate methods. They will get cleaned up only when
# the process exits. This avoids a segfault in the case of creating a new
# SAM resource with a smaller headroom.
_old_memory_resources = []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: potential thread safety issue if _configure_memory_resource is called concurrently from multiple threads. list.append() is atomic in CPython due to GIL, but consider using threading.Lock if multi-threaded scenarios are possible

Suggested change
_old_memory_resources = []
import threading
_old_memory_resources: List[Any] = []
_memory_resource_lock = threading.Lock()

Comment on lines 208 to 210
if force_sam_headroom or not type(rmm.mr.get_current_device_resource()) == type(
rmm.mr.SamHeadroomMemoryResource(headroom=sam_headroom)
):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: creates a new SamHeadroomMemoryResource object just for type comparison on every call, which is wasteful. Consider comparing against the class directly

Suggested change
if force_sam_headroom or not type(rmm.mr.get_current_device_resource()) == type(
rmm.mr.SamHeadroomMemoryResource(headroom=sam_headroom)
):
if force_sam_headroom or not isinstance(rmm.mr.get_current_device_resource(),
rmm.mr.SamHeadroomMemoryResource
):

Comment on lines 207 to 213
elif sam_enabled and sam_headroom is not None:
if force_sam_headroom or not type(rmm.mr.get_current_device_resource()) == type(
rmm.mr.SamHeadroomMemoryResource(headroom=sam_headroom)
):
_old_memory_resources.append(rmm.mr.get_current_device_resource())
mr = rmm.mr.SamHeadroomMemoryResource(headroom=sam_headroom)
rmm.mr.set_current_device_resource(mr)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: when force_sam_headroom=True, this will append the current resource and create a new one even if the headroom value hasn't changed, potentially accumulating unnecessary resources in _old_memory_resources. Consider tracking the current headroom value and only recreating if it actually changed


# When changing default rmm memory resources we retain the old ones
# in this global array singleton to so that any (C++) allocations using them can
# invoke the corresponding deallocate methods. They will get cleaned up only when
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does that cause the memory already allocated within the default RMM mem resources not to be freed?

Copy link
Collaborator

@wbo4958 wbo4958 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A minor question. If that's not a problem, please feel free to merge it.

wbo4958
wbo4958 previously approved these changes Nov 12, 2025
Signed-off-by: Erik Ordentlich <eordentlich@gmail.com>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

6 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Copy link
Collaborator

@wbo4958 wbo4958 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@eordentlich
Copy link
Collaborator Author

build

@eordentlich eordentlich merged commit 904d298 into NVIDIA:release/25.10 Nov 13, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants