Skip to content

[WIP] Clean up locking in Rmm.class and SparkRapidsJni#3860

Draft
abellina wants to merge 3 commits intoNVIDIA:mainfrom
abellina:clean_up_rmmspark_locking
Draft

[WIP] Clean up locking in Rmm.class and SparkRapidsJni#3860
abellina wants to merge 3 commits intoNVIDIA:mainfrom
abellina:clean_up_rmmspark_locking

Conversation

@abellina
Copy link
Collaborator

@abellina abellina commented Oct 20, 2025

There is definitely lock contention in the spark resource adaptor and RmmSpark, and as we add more and more retry logic to spark-rapids these slow downs are becoming more evident.

This is an attempt to clean up the locking to do it only when it makes sense and is necessary.

I have only done RmmSpark, and will continue. Hence this is a draft.

Signed-off-by: Alessandro Bellina <abellina@nvidia.com>
Copilot AI review requested due to automatic review settings October 20, 2025 15:53
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR refactors locking mechanisms in RmmSpark to reduce lock contention. The primary change introduces a helper method getSra() that centralizes synchronized access to the SparkResourceAdaptor instance, then refactors all methods that previously held the Rmm.class lock throughout their execution to instead obtain a local reference via getSra() and release the lock early.

Key changes:

  • Added getSra() helper method to obtain SparkResourceAdaptor reference under lock
  • Refactored 30+ methods to use early lock release pattern instead of holding lock for entire method execution
  • Removed redundant closing braces from methods with incorrect brace nesting

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@abellina abellina marked this pull request as draft October 20, 2025 15:54
abellina and others added 2 commits October 20, 2025 08:54
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
// helper method to get the SparkResourceAdaptor, keeping consistency
// with the static Rmm class lock
private static SparkResourceAdaptor getSra() {
synchronized (Rmm.class) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like we're still using synchornized here, so why extract a method called getSra()?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you look at the diff, we stop holding Rmm.class lock for actual calls into sra functions, that's the key of the patch.

We have to lock to get sra. During shutdown, we could be in the middle of setting it, and the reference we get may or may not be valid. I don't see any reason not to have this tiny lock just to get the current state of sra, as contention should be minimal (nothing, other than setup and teardown of the whole executor holds the lock while doing anything major).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it, thanks!

@abellina
Copy link
Collaborator Author

The rest of the change is coming. The locking I did here, after talking to @revans2 may need to be a read/write lock instead of no lock while calling some of the functions. This is due to Rmm itself potentially disappearing on us during tests. So this is a bit of work, but necessary if we want to test performance improvements. I am currently refactoring the thread state class to try and encapsulate its state, which should allow for some more finer grained locking.

@nvauto
Copy link
Collaborator

nvauto commented Nov 17, 2025

NOTE: release/25.12 has been created from main. Please retarget your PR to release/25.12 if it should be included in the release.

@nvauto
Copy link
Collaborator

nvauto commented Jan 19, 2026

NOTE: release/26.02 has been created from main. Please retarget your PR to release/26.02 if it should be included in the release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants