[WIP] Clean up locking in Rmm.class and SparkRapidsJni by abellina · Pull Request #3860 · NVIDIA/spark-rapids-jni

abellina · 2025-10-20T15:53:47Z

There is definitely lock contention in the spark resource adaptor and RmmSpark, and as we add more and more retry logic to spark-rapids these slow downs are becoming more evident.

This is an attempt to clean up the locking to do it only when it makes sense and is necessary.

I have only done RmmSpark, and will continue. Hence this is a draft.

Signed-off-by: Alessandro Bellina <abellina@nvidia.com>

Copilot

Pull Request Overview

This PR refactors locking mechanisms in RmmSpark to reduce lock contention. The primary change introduces a helper method getSra() that centralizes synchronized access to the SparkResourceAdaptor instance, then refactors all methods that previously held the Rmm.class lock throughout their execution to instead obtain a local reference via getSra() and release the lock early.

Key changes:

Added getSra() helper method to obtain SparkResourceAdaptor reference under lock
Refactored 30+ methods to use early lock release pattern instead of holding lock for entire method execution
Removed redundant closing braces from methods with incorrect brace nesting

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

src/main/java/com/nvidia/spark/rapids/jni/RmmSpark.java

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

binmahone · 2025-10-20T16:39:06Z

src/main/java/com/nvidia/spark/rapids/jni/RmmSpark.java

+  // helper method to get the SparkResourceAdaptor, keeping consistency
+  // with the static Rmm class lock
+  private static SparkResourceAdaptor getSra() {
+    synchronized (Rmm.class) {


looks like we're still using synchornized here, so why extract a method called getSra()?

if you look at the diff, we stop holding Rmm.class lock for actual calls into sra functions, that's the key of the patch.

We have to lock to get sra. During shutdown, we could be in the middle of setting it, and the reference we get may or may not be valid. I don't see any reason not to have this tiny lock just to get the current state of sra, as contention should be minimal (nothing, other than setup and teardown of the whole executor holds the lock while doing anything major).

got it, thanks!

abellina · 2025-10-20T22:12:30Z

The rest of the change is coming. The locking I did here, after talking to @revans2 may need to be a read/write lock instead of no lock while calling some of the functions. This is due to Rmm itself potentially disappearing on us during tests. So this is a bit of work, but necessary if we want to test performance improvements. I am currently refactoring the thread state class to try and encapsulate its state, which should allow for some more finer grained locking.

nvauto · 2025-11-17T01:19:44Z

NOTE: release/25.12 has been created from main. Please retarget your PR to release/25.12 if it should be included in the release.

nvauto · 2026-01-19T02:12:23Z

NOTE: release/26.02 has been created from main. Please retarget your PR to release/26.02 if it should be included in the release.

Ensure the Rmm.class lock is only held when is really needed

497aa91

Signed-off-by: Alessandro Bellina <abellina@nvidia.com>

Copilot AI review requested due to automatic review settings October 20, 2025 15:53

Copilot AI reviewed Oct 20, 2025

View reviewed changes

src/main/java/com/nvidia/spark/rapids/jni/RmmSpark.java Outdated Show resolved Hide resolved

src/main/java/com/nvidia/spark/rapids/jni/RmmSpark.java Outdated Show resolved Hide resolved

abellina marked this pull request as draft October 20, 2025 15:54

abellina and others added 2 commits October 20, 2025 08:54

Update src/main/java/com/nvidia/spark/rapids/jni/RmmSpark.java

dbf9198

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update src/main/java/com/nvidia/spark/rapids/jni/RmmSpark.java

2b30602

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

binmahone reviewed Oct 20, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Clean up locking in Rmm.class and SparkRapidsJni#3860

[WIP] Clean up locking in Rmm.class and SparkRapidsJni#3860
abellina wants to merge 3 commits intoNVIDIA:mainfrom
abellina:clean_up_rmmspark_locking

abellina commented Oct 20, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

binmahone Oct 20, 2025

Uh oh!

abellina Oct 20, 2025

Uh oh!

binmahone Oct 21, 2025

Uh oh!

abellina commented Oct 20, 2025

Uh oh!

nvauto commented Nov 17, 2025

Uh oh!

nvauto commented Jan 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

abellina commented Oct 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

Uh oh!

binmahone Oct 20, 2025

Choose a reason for hiding this comment

Uh oh!

abellina Oct 20, 2025

Choose a reason for hiding this comment

Uh oh!

binmahone Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

abellina commented Oct 20, 2025

Uh oh!

nvauto commented Nov 17, 2025

Uh oh!

nvauto commented Jan 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

abellina commented Oct 20, 2025 •

edited

Loading