retain global pointers to previous default rmm memory resources by eordentlich · Pull Request #995 · NVIDIA/spark-rapids-ml

eordentlich · 2025-11-12T05:52:38Z

This is needed to avoid race condition segfaults with SAM when SAM headroom is reduced
from its initial larger value during data loading to a smaller value during computations.

…changing to avoid race condition segfaults with SAM Signed-off-by: Erik Ordentlich <eordentlich@gmail.com>

eordentlich · 2025-11-12T05:53:05Z

build

greptile-apps · 2025-11-12T05:56:24Z

Greptile Overview

Greptile Summary

This PR addresses race condition segfaults when SAM (System Allocated Memory) headroom is reduced during execution. The fix introduces global tracking of previous RMM memory resources and the last SAM headroom value.

Key Changes:

Added _old_memory_resources list to retain references to previous memory resources, preventing segfaults when C++ deallocations attempt to use old resources
Added _last_sam_headroom_size tracking to avoid unnecessarily recreating SAM resources when headroom hasn't changed
Removed force_sam_headroom parameter from _configure_memory_resource and all call sites (6 files)
Memory resources are now only recreated when the headroom value actually changes or the resource type differs

How it works:
The solution keeps old memory resources alive in a global list so that any pending C++ allocations can safely invoke their deallocate methods. When the SAM headroom changes (e.g., from a larger value during data loading to a smaller value during computation), the old resource is appended to the list before creating a new one. This prevents the race condition that caused segfaults.

Confidence Score: 4/5

This PR is safe to merge with minor style improvements suggested
The core logic correctly addresses the race condition segfault issue by retaining old memory resources. The headroom tracking prevents unnecessary resource recreation. However, there are style improvements already noted in previous comments: inefficient type comparison creating temporary objects, and potential for better tracking of headroom changes to avoid accumulating resources when force_sam_headroom was previously used
python/src/spark_rapids_ml/utils.py - review the style suggestions from previous comments about type comparison efficiency and headroom tracking

Important Files Changed

File Analysis

Filename	Score	Overview
python/src/spark_rapids_ml/utils.py	3/5	Added global tracking of old memory resources and `_last_sam_headroom_size` to prevent segfaults when SAM headroom changes; removed `force_sam_headroom` parameter
python/src/spark_rapids_ml/classification.py	5/5	Removed `force_sam_headroom=True` parameter from `_configure_memory_resource` call in LogisticRegression
python/src/spark_rapids_ml/clustering.py	5/5	Removed `force_sam_headroom=True` parameter from `_configure_memory_resource` calls in KMeans and DBSCANModel
python/src/spark_rapids_ml/knn.py	5/5	Removed `force_sam_headroom=True` parameter from `_configure_memory_resource` call in NearestNeighborsModel
python/src/spark_rapids_ml/tree.py	5/5	Removed `force_sam_headroom=True` parameter from `_configure_memory_resource` call in RandomForestEstimator
python/src/spark_rapids_ml/umap.py	5/5	Removed `force_sam_headroom=True` parameter from two `_configure_memory_resource` calls in UMAP fit methods

Sequence Diagram

sequenceDiagram
    participant App as ML Algorithm
    participant Config as _configure_memory_resource
    participant Global as Global State (_last_sam_headroom_size, _old_memory_resources)
    participant RMM as RMM Memory Resource

    Note over App,RMM: Initial call with larger headroom (e.g., during data loading)
    App->>Config: _configure_memory_resource(sam_enabled=True, sam_headroom=large_value)
    Config->>Global: Check _last_sam_headroom_size (None initially)
    Config->>RMM: get_current_device_resource()
    RMM-->>Config: current_mr
    Config->>Global: _old_memory_resources.append(current_mr)
    Config->>Global: _last_sam_headroom_size = large_value
    Config->>RMM: set_current_device_resource(SamHeadroomMemoryResource(large_value))
    
    Note over App,RMM: Later call with smaller headroom (e.g., during computation)
    App->>Config: _configure_memory_resource(sam_enabled=True, sam_headroom=small_value)
    Config->>Global: Check _last_sam_headroom_size (large_value)
    Config->>Config: Condition: small_value != large_value (TRUE)
    Config->>RMM: get_current_device_resource()
    RMM-->>Config: current_mr (SamHeadroomMemoryResource with large_value)
    Config->>Global: _old_memory_resources.append(current_mr)
    Note over Global: Old resource retained to prevent segfaults on C++ deallocations
    Config->>Global: _last_sam_headroom_size = small_value
    Config->>RMM: set_current_device_resource(SamHeadroomMemoryResource(small_value))
    
    Note over App,RMM: Subsequent call with same headroom (no change)
    App->>Config: _configure_memory_resource(sam_enabled=True, sam_headroom=small_value)
    Config->>Global: Check _last_sam_headroom_size (small_value)
    Config->>Config: Condition: small_value != small_value (FALSE) - skip resource creation
    Config-->>App: Return (no changes)

greptile-apps

_{1 file reviewed, 3 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2025-11-12T05:56:23Z

python/src/spark_rapids_ml/utils.py

+# invoke the corresponding deallocate methods.  They will get cleaned up only when
+# the process exits.  This avoids a segfault in the case of creating a new
+# SAM resource with a smaller headroom.
+_old_memory_resources = []


style: potential thread safety issue if _configure_memory_resource is called concurrently from multiple threads. list.append() is atomic in CPython due to GIL, but consider using threading.Lock if multi-threaded scenarios are possible

Suggested change

_old_memory_resources = []

import threading

_old_memory_resources: List[Any] = []

_memory_resource_lock = threading.Lock()

greptile-apps · 2025-11-12T05:56:23Z

python/src/spark_rapids_ml/utils.py

        if force_sam_headroom or not type(rmm.mr.get_current_device_resource()) == type(
            rmm.mr.SamHeadroomMemoryResource(headroom=sam_headroom)
        ):


style: creates a new SamHeadroomMemoryResource object just for type comparison on every call, which is wasteful. Consider comparing against the class directly

Suggested change

if force_sam_headroom or not type(rmm.mr.get_current_device_resource()) == type(

rmm.mr.SamHeadroomMemoryResource(headroom=sam_headroom)

):

if force_sam_headroom or not isinstance(rmm.mr.get_current_device_resource(),

rmm.mr.SamHeadroomMemoryResource

):

greptile-apps · 2025-11-12T05:56:24Z

python/src/spark_rapids_ml/utils.py

    elif sam_enabled and sam_headroom is not None:
        if force_sam_headroom or not type(rmm.mr.get_current_device_resource()) == type(
            rmm.mr.SamHeadroomMemoryResource(headroom=sam_headroom)
        ):
+            _old_memory_resources.append(rmm.mr.get_current_device_resource())
            mr = rmm.mr.SamHeadroomMemoryResource(headroom=sam_headroom)
            rmm.mr.set_current_device_resource(mr)


style: when force_sam_headroom=True, this will append the current resource and create a new one even if the headroom value hasn't changed, potentially accumulating unnecessary resources in _old_memory_resources. Consider tracking the current headroom value and only recreating if it actually changed

wbo4958 · 2025-11-12T06:55:50Z

python/src/spark_rapids_ml/utils.py


+# When changing default rmm memory resources we retain the old ones
+# in this global array singleton to so that any (C++) allocations using them can
+# invoke the corresponding deallocate methods.  They will get cleaned up only when


Does that cause the memory already allocated within the default RMM mem resources not to be freed?

wbo4958

A minor question. If that's not a problem, please feel free to merge it.

Signed-off-by: Erik Ordentlich <eordentlich@gmail.com>

greptile-apps

_{6 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

wbo4958

LGTM.

eordentlich · 2025-11-12T23:19:17Z

build

retain global pointers to previous default rmm memory resources when …

a34bf09

…changing to avoid race condition segfaults with SAM Signed-off-by: Erik Ordentlich <eordentlich@gmail.com>

greptile-apps bot reviewed Nov 12, 2025

View reviewed changes

wbo4958 reviewed Nov 12, 2025

View reviewed changes

wbo4958 previously approved these changes Nov 12, 2025

View reviewed changes

keep track of last sam headroom and update resource if changed

33254aa

Signed-off-by: Erik Ordentlich <eordentlich@gmail.com>

eordentlich dismissed wbo4958’s stale review via 33254aa November 12, 2025 20:02

greptile-apps bot reviewed Nov 12, 2025

View reviewed changes

wbo4958 approved these changes Nov 12, 2025

View reviewed changes

eordentlich merged commit 904d298 into NVIDIA:release/25.10 Nov 13, 2025
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

retain global pointers to previous default rmm memory resources #995

retain global pointers to previous default rmm memory resources #995
eordentlich merged 2 commits intoNVIDIA:release/25.10from
eordentlich:eo_fix_segv

eordentlich commented Nov 12, 2025

Uh oh!

eordentlich commented Nov 12, 2025

Uh oh!

greptile-apps bot commented Nov 12, 2025 •

edited

Loading

Uh oh!

greptile-apps bot left a comment

Uh oh!

greptile-apps bot Nov 12, 2025

Uh oh!

greptile-apps bot Nov 12, 2025

Uh oh!

greptile-apps bot Nov 12, 2025

Uh oh!

wbo4958 Nov 12, 2025

Uh oh!

wbo4958 left a comment •

edited

Loading

Uh oh!

greptile-apps bot left a comment

Uh oh!

wbo4958 left a comment

Uh oh!

eordentlich commented Nov 12, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

-_old_memory_resources = []
+import threading
+_old_memory_resources: List[Any] = []
+_memory_resource_lock = threading.Lock()

Conversation

eordentlich commented Nov 12, 2025

Uh oh!

eordentlich commented Nov 12, 2025

Uh oh!

greptile-apps bot commented Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Overview

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Nov 12, 2025

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Nov 12, 2025

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Nov 12, 2025

Choose a reason for hiding this comment

Uh oh!

wbo4958 Nov 12, 2025

Choose a reason for hiding this comment

Uh oh!

wbo4958 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

wbo4958 left a comment

Choose a reason for hiding this comment

Uh oh!

eordentlich commented Nov 12, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

greptile-apps bot commented Nov 12, 2025 •

edited

Loading

wbo4958 left a comment •

edited

Loading