[WIP] Clean up locking in Rmm.class and SparkRapidsJni#3860
[WIP] Clean up locking in Rmm.class and SparkRapidsJni#3860abellina wants to merge 3 commits intoNVIDIA:mainfrom
Conversation
Signed-off-by: Alessandro Bellina <abellina@nvidia.com>
There was a problem hiding this comment.
Pull Request Overview
This PR refactors locking mechanisms in RmmSpark to reduce lock contention. The primary change introduces a helper method getSra() that centralizes synchronized access to the SparkResourceAdaptor instance, then refactors all methods that previously held the Rmm.class lock throughout their execution to instead obtain a local reference via getSra() and release the lock early.
Key changes:
- Added
getSra()helper method to obtainSparkResourceAdaptorreference under lock - Refactored 30+ methods to use early lock release pattern instead of holding lock for entire method execution
- Removed redundant closing braces from methods with incorrect brace nesting
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
| // helper method to get the SparkResourceAdaptor, keeping consistency | ||
| // with the static Rmm class lock | ||
| private static SparkResourceAdaptor getSra() { | ||
| synchronized (Rmm.class) { |
There was a problem hiding this comment.
looks like we're still using synchornized here, so why extract a method called getSra()?
There was a problem hiding this comment.
if you look at the diff, we stop holding Rmm.class lock for actual calls into sra functions, that's the key of the patch.
We have to lock to get sra. During shutdown, we could be in the middle of setting it, and the reference we get may or may not be valid. I don't see any reason not to have this tiny lock just to get the current state of sra, as contention should be minimal (nothing, other than setup and teardown of the whole executor holds the lock while doing anything major).
|
The rest of the change is coming. The locking I did here, after talking to @revans2 may need to be a read/write lock instead of no lock while calling some of the functions. This is due to Rmm itself potentially disappearing on us during tests. So this is a bit of work, but necessary if we want to test performance improvements. I am currently refactoring the thread state class to try and encapsulate its state, which should allow for some more finer grained locking. |
|
NOTE: release/25.12 has been created from main. Please retarget your PR to release/25.12 if it should be included in the release. |
|
NOTE: release/26.02 has been created from main. Please retarget your PR to release/26.02 if it should be included in the release. |
There is definitely lock contention in the spark resource adaptor and RmmSpark, and as we add more and more retry logic to spark-rapids these slow downs are becoming more evident.
This is an attempt to clean up the locking to do it only when it makes sense and is necessary.
I have only done RmmSpark, and will continue. Hence this is a draft.