retain global pointers to previous default rmm memory resources #995
retain global pointers to previous default rmm memory resources #995eordentlich merged 2 commits intoNVIDIA:release/25.10from
Conversation
…changing to avoid race condition segfaults with SAM Signed-off-by: Erik Ordentlich <eordentlich@gmail.com>
|
build |
Greptile OverviewGreptile SummaryThis PR addresses race condition segfaults when SAM (System Allocated Memory) headroom is reduced during execution. The fix introduces global tracking of previous RMM memory resources and the last SAM headroom value. Key Changes:
How it works: Confidence Score: 4/5
Important Files ChangedFile Analysis
Sequence DiagramsequenceDiagram
participant App as ML Algorithm
participant Config as _configure_memory_resource
participant Global as Global State (_last_sam_headroom_size, _old_memory_resources)
participant RMM as RMM Memory Resource
Note over App,RMM: Initial call with larger headroom (e.g., during data loading)
App->>Config: _configure_memory_resource(sam_enabled=True, sam_headroom=large_value)
Config->>Global: Check _last_sam_headroom_size (None initially)
Config->>RMM: get_current_device_resource()
RMM-->>Config: current_mr
Config->>Global: _old_memory_resources.append(current_mr)
Config->>Global: _last_sam_headroom_size = large_value
Config->>RMM: set_current_device_resource(SamHeadroomMemoryResource(large_value))
Note over App,RMM: Later call with smaller headroom (e.g., during computation)
App->>Config: _configure_memory_resource(sam_enabled=True, sam_headroom=small_value)
Config->>Global: Check _last_sam_headroom_size (large_value)
Config->>Config: Condition: small_value != large_value (TRUE)
Config->>RMM: get_current_device_resource()
RMM-->>Config: current_mr (SamHeadroomMemoryResource with large_value)
Config->>Global: _old_memory_resources.append(current_mr)
Note over Global: Old resource retained to prevent segfaults on C++ deallocations
Config->>Global: _last_sam_headroom_size = small_value
Config->>RMM: set_current_device_resource(SamHeadroomMemoryResource(small_value))
Note over App,RMM: Subsequent call with same headroom (no change)
App->>Config: _configure_memory_resource(sam_enabled=True, sam_headroom=small_value)
Config->>Global: Check _last_sam_headroom_size (small_value)
Config->>Config: Condition: small_value != small_value (FALSE) - skip resource creation
Config-->>App: Return (no changes)
|
| # invoke the corresponding deallocate methods. They will get cleaned up only when | ||
| # the process exits. This avoids a segfault in the case of creating a new | ||
| # SAM resource with a smaller headroom. | ||
| _old_memory_resources = [] |
There was a problem hiding this comment.
style: potential thread safety issue if _configure_memory_resource is called concurrently from multiple threads. list.append() is atomic in CPython due to GIL, but consider using threading.Lock if multi-threaded scenarios are possible
| _old_memory_resources = [] | |
| import threading | |
| _old_memory_resources: List[Any] = [] | |
| _memory_resource_lock = threading.Lock() |
python/src/spark_rapids_ml/utils.py
Outdated
| if force_sam_headroom or not type(rmm.mr.get_current_device_resource()) == type( | ||
| rmm.mr.SamHeadroomMemoryResource(headroom=sam_headroom) | ||
| ): |
There was a problem hiding this comment.
style: creates a new SamHeadroomMemoryResource object just for type comparison on every call, which is wasteful. Consider comparing against the class directly
| if force_sam_headroom or not type(rmm.mr.get_current_device_resource()) == type( | |
| rmm.mr.SamHeadroomMemoryResource(headroom=sam_headroom) | |
| ): | |
| if force_sam_headroom or not isinstance(rmm.mr.get_current_device_resource(), | |
| rmm.mr.SamHeadroomMemoryResource | |
| ): |
| elif sam_enabled and sam_headroom is not None: | ||
| if force_sam_headroom or not type(rmm.mr.get_current_device_resource()) == type( | ||
| rmm.mr.SamHeadroomMemoryResource(headroom=sam_headroom) | ||
| ): | ||
| _old_memory_resources.append(rmm.mr.get_current_device_resource()) | ||
| mr = rmm.mr.SamHeadroomMemoryResource(headroom=sam_headroom) | ||
| rmm.mr.set_current_device_resource(mr) |
There was a problem hiding this comment.
style: when force_sam_headroom=True, this will append the current resource and create a new one even if the headroom value hasn't changed, potentially accumulating unnecessary resources in _old_memory_resources. Consider tracking the current headroom value and only recreating if it actually changed
|
|
||
| # When changing default rmm memory resources we retain the old ones | ||
| # in this global array singleton to so that any (C++) allocations using them can | ||
| # invoke the corresponding deallocate methods. They will get cleaned up only when |
There was a problem hiding this comment.
Does that cause the memory already allocated within the default RMM mem resources not to be freed?
Signed-off-by: Erik Ordentlich <eordentlich@gmail.com>
|
build |
This is needed to avoid race condition segfaults with SAM when SAM headroom is reduced
from its initial larger value during data loading to a smaller value during computations.