Skip to content

Use the new reference-based RMM API to set RMM pools in device_resources_snmg#2972

Open
viclafargue wants to merge 2 commits intorapidsai:release/26.04from
viclafargue:refcounted-rmm-pools-for-snmg
Open

Use the new reference-based RMM API to set RMM pools in device_resources_snmg#2972
viclafargue wants to merge 2 commits intorapidsai:release/26.04from
viclafargue:refcounted-rmm-pools-for-snmg

Conversation

@viclafargue
Copy link
Contributor

@viclafargue viclafargue commented Mar 4, 2026

This PR updates the device_resources_snmg utility so that it uses the new reference-based RMM API to set RMM pools.

@viclafargue viclafargue self-assigned this Mar 4, 2026
@viclafargue viclafargue requested a review from a team as a code owner March 4, 2026 15:25
@viclafargue viclafargue added bug Something isn't working non-breaking Non-breaking change labels Mar 4, 2026
@viclafargue viclafargue changed the title Refcounted memory resources for RMM pools in SNMG Refcounted memory resources for RMM pools in device_resources_snmg Mar 4, 2026
Copy link
Contributor

@jinsolp jinsolp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @viclafargue , LGTM

Comment on lines +110 to +113
rmm::mr::get_current_device_resource_ref(),
rmm::percent_of_free_device_memory(percent_of_free_memory)));
rmm::mr::set_per_device_resource_ref(rmm::cuda_device_id{device_id},
*per_device_pools_.back());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like here is a little misunderstanding. "Refcounted" memory resource in this context unfortunately doesn't mean rmm::device_async_resource_ref shares the ownership of the rmm::mr::device_memory_resource. Here's the rmm::mr::set_per_device_resource_ref documention confirming device_resources_snmg still needs to outlive all its users. Therefore, in the current state, this PR doesn't resolve #2922 .

What happens with rmm-over-cccl overhaul is the introduction of the new concept/type for memory resources, which among other things ensures all resources are copyable (and can maintain shared refcounted state). So we will be able to store them by value inside allocated containers. For this to work, we'll need to pass the resources by value and not by reference for all allocations. But looking at the current rmm code, I don't see a way to pass an owning resource to the global device-resource map at the moment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the other hand, the changeset is welcome in that it changes a part of raft to use a future-compatible reference-based rmm api instead of the pointer api that is being deprecated. This will help in future refactorings. Let's just adjust the PR title and description and I think it is good to go.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, thanks for spotting this. I will update the title and description of the PR.

@viclafargue viclafargue changed the title Refcounted memory resources for RMM pools in device_resources_snmg Use the new reference-based RMM API to set RMM pools in device_resources_snmg Mar 10, 2026
@viclafargue viclafargue changed the base branch from main to release/26.04 March 13, 2026 09:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working non-breaking Non-breaking change

Projects

Development

Successfully merging this pull request may close these issues.

3 participants