Skip to content

Make full_thread_state a shared pointer#3966

Open
abellina wants to merge 7 commits intoNVIDIA:mainfrom
abellina:shared_ptr_full_thread_state
Open

Make full_thread_state a shared pointer#3966
abellina wants to merge 7 commits intoNVIDIA:mainfrom
abellina:shared_ptr_full_thread_state

Conversation

@abellina
Copy link
Collaborator

Contributes to #3905

This change is simply making full_thread_state a shared pointer, and updating all calls to use -> instead of ..

We do this so that when we pass full_thread_state around, we are incrementing the ref count in the pointer and even if the thread disappears, we are still able to hold a reference to the state object until we exit the function call.

This PR doesn't change behavior, but in the future we can do things like: get the thread state under a lock, unlock, return thread state and use that object without fear that it will be destroyed by another thread.

Signed-off-by: Alessandro Bellina <abellina@nvidia.com>
Signed-off-by: Alessandro Bellina <abellina@nvidia.com>
@abellina
Copy link
Collaborator Author

build

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Nov 18, 2025

Greptile Summary

  • Converted full_thread_state from value type to std::shared_ptr to enable safe sharing across threads and prevent lifetime issues
  • Fixed previously reported use-after-free bugs by adding null checks after wait loops where threads may be removed

Confidence Score: 5/5

  • This PR is safe to merge with minimal risk
  • The changes are mechanical (converting value type to shared_ptr), properly implemented throughout the codebase, and fix previously identified use-after-free bugs. The shared_ptr approach enables the planned future work of safely holding references outside locks.
  • No files require special attention

Important Files Changed

Filename Overview
src/main/cpp/src/SparkResourceAdaptorJni.cpp Converted full_thread_state from value type to shared_ptr, updated all access patterns from . to ->, and fixed use-after-free bugs by adding null checks after wait loops

Sequence Diagram

sequenceDiagram
    participant Thread as "Thread"
    participant Adaptor as "spark_resource_adaptor"
    participant StateMap as "threads map"
    participant SharedPtr as "shared_ptr<full_thread_state>"
    
    Thread->>Adaptor: "associate_thread_with_task(thread_id, task_id)"
    Adaptor->>StateMap: "threads.emplace(thread_id, ...)"
    StateMap->>SharedPtr: "make_shared<full_thread_state>(...)"
    SharedPtr-->>StateMap: "shared_ptr created"
    StateMap-->>Adaptor: "emplace successful"
    Adaptor->>SharedPtr: "thread->second->state"
    SharedPtr-->>Adaptor: "access thread state via ->"
    Adaptor-->>Thread: "association complete"
    
    Thread->>Adaptor: "do_allocate(bytes, blocking=true)"
    Adaptor->>StateMap: "threads.find(thread_id)"
    StateMap-->>Adaptor: "iterator to thread entry"
    Adaptor->>SharedPtr: "thread->second->state = THREAD_ALLOC"
    
    alt Thread needs to block
        Adaptor->>SharedPtr: "thread->second->before_block()"
        loop Wait until unblocked
            Adaptor->>SharedPtr: "thread->second->wake_condition->wait(lock)"
            Adaptor->>StateMap: "thread = threads.find(thread_id)"
            alt Thread was removed
                StateMap-->>Adaptor: "thread == threads.end()"
                Adaptor->>Adaptor: "break loop"
            else Thread still exists
                Adaptor->>SharedPtr: "check is_blocked(thread->second->state)"
            end
        end
        alt Thread still exists after wait
            Adaptor->>SharedPtr: "thread->second->after_block()"
        end
    end
    
    Adaptor->>SharedPtr: "transition(thread->second, THREAD_RUNNING)"
    SharedPtr-->>Adaptor: "state updated"
    Adaptor-->>Thread: "allocation complete"
Loading

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, 2 comments

Edit Code Review Agent Settings | Greptile
React with 👍 or 👎 to share your feedback on this new summary format

@abellina
Copy link
Collaborator Author

build

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, no comments

Edit Code Review Agent Settings | Greptile
React with 👍 or 👎 to share your feedback on this new summary format

@abellina
Copy link
Collaborator Author

build

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, no comments

Edit Code Review Agent Settings | Greptile
React with 👍 or 👎 to share your feedback on this new summary format

Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you measured the performance impact of this? shared_ptr has o do a lot more work than a normal pointer. Did you run into problems anywhere that needed this? Are you planning more changes that require it, like looping through a data structure and you can modify it as you loop?

@abellina
Copy link
Collaborator Author

Have you measured the performance impact of this? shared_ptr has o do a lot more work than a normal pointer. Did you run into problems anywhere that needed this? Are you planning more changes that require it, like looping through a data structure and you can modify it as you loop?

Main change I was looking to do was be able to take the lock to get state, and then return the state object as is without a lock. This implies I can add locks inside of the per-thread state object to protect its internal datastructures. But this is something I am planning for later, as moving state inside of full_thread_state is, well, not trivial and I don't have that done.

In terms of perf, I tested it with other patches and performance in the plugin under NDS never got worse. But, I can split this up and rework the patch that is coming after this to use raw pointers and then we can just measure adding it when absolutely necessary.

@abellina
Copy link
Collaborator Author

Lets punt on this, I don't have a great answer to the performance question on this change on its own, and it's not strictly necessary to what I am trying to accomplish right now.

@abellina abellina closed this Nov 18, 2025
@abellina
Copy link
Collaborator Author

never mind, I do need shared pointers. I am storing full_thread_state in different maps in one of my patches. I'll have to reopen, let me test this patch independently

@abellina abellina reopened this Nov 18, 2025
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, no comments

Edit Code Review Agent Settings | Greptile
React with 👍 or 👎 to share your feedback on this new summary format

@abellina
Copy link
Collaborator Author

abellina commented Nov 19, 2025

I ran NDS at sf3k, and I don't see significant changes. It's slightly faster at the end, but it's all in the noise:

Results: benchmark: Previous (242200.0 ms) vs Current (241000.0 ms) Diff 1200 E2E 1.00x
query1: Previous (1627.2 ms) vs Current (1670.2 ms) Diff -43 E2E 0.97x
query2: Previous (1647.4 ms) vs Current (1415.6 ms) Diff 231 E2E 1.16x
query3: Previous (381.4 ms) vs Current (442.8 ms) Diff -61 E2E 0.86x
query4: Previous (4595.2 ms) vs Current (4713.4 ms) Diff -118 E2E 0.97x
query5: Previous (2116.6 ms) vs Current (2190.8 ms) Diff -74 E2E 0.97x
query6: Previous (865.0 ms) vs Current (834.8 ms) Diff 30 E2E 1.04x
query7: Previous (2573.2 ms) vs Current (2749.8 ms) Diff -176 E2E 0.94x
query8: Previous (761.8 ms) vs Current (741.8 ms) Diff 20 E2E 1.03x
query9: Previous (1683.6 ms) vs Current (1709.4 ms) Diff -25 E2E 0.98x
query10: Previous (1528.2 ms) vs Current (1460.0 ms) Diff 68 E2E 1.05x
query11: Previous (2824.8 ms) vs Current (2946.2 ms) Diff -121 E2E 0.96x
query12: Previous (655.2 ms) vs Current (643.2 ms) Diff 12 E2E 1.02x
query13: Previous (1612.6 ms) vs Current (1565.8 ms) Diff 46 E2E 1.03x
query14_part1: Previous (5497.6 ms) vs Current (5151.6 ms) Diff 346 E2E 1.07x
query14_part2: Previous (4460.8 ms) vs Current (4506.4 ms) Diff -45 E2E 0.99x
query15: Previous (1049.6 ms) vs Current (1183.4 ms) Diff -133 E2E 0.89x
query16: Previous (3335.2 ms) vs Current (3326.8 ms) Diff 8 E2E 1.00x
query17: Previous (1649.8 ms) vs Current (1578.4 ms) Diff 71 E2E 1.05x
query18: Previous (1972.4 ms) vs Current (1997.0 ms) Diff -24 E2E 0.99x
query19: Previous (1175.4 ms) vs Current (1236.0 ms) Diff -60 E2E 0.95x
query20: Previous (520.2 ms) vs Current (514.2 ms) Diff 6 E2E 1.01x
query21: Previous (452.2 ms) vs Current (600.0 ms) Diff -147 E2E 0.75x
query22: Previous (1074.6 ms) vs Current (1032.6 ms) Diff 42 E2E 1.04x
query23_part1: Previous (5555.2 ms) vs Current (5415.0 ms) Diff 140 E2E 1.03x
query23_part2: Previous (6028.0 ms) vs Current (6012.6 ms) Diff 15 E2E 1.00x
query24_part1: Previous (6998.6 ms) vs Current (6630.6 ms) Diff 368 E2E 1.06x
query24_part2: Previous (6456.0 ms) vs Current (6545.8 ms) Diff -89 E2E 0.99x
query25: Previous (1886.6 ms) vs Current (1829.8 ms) Diff 56 E2E 1.03x
query26: Previous (757.2 ms) vs Current (720.8 ms) Diff 36 E2E 1.05x
query27: Previous (1354.6 ms) vs Current (1074.4 ms) Diff 280 E2E 1.26x
query28: Previous (4232.0 ms) vs Current (4181.0 ms) Diff 51 E2E 1.01x
query29: Previous (2514.2 ms) vs Current (2593.6 ms) Diff -79 E2E 0.97x
query30: Previous (1856.8 ms) vs Current (1830.0 ms) Diff 26 E2E 1.01x
query31: Previous (1667.4 ms) vs Current (1656.4 ms) Diff 11 E2E 1.01x
query32: Previous (894.4 ms) vs Current (938.4 ms) Diff -44 E2E 0.95x
query33: Previous (1135.0 ms) vs Current (1037.4 ms) Diff 97 E2E 1.09x
query34: Previous (1865.0 ms) vs Current (1851.8 ms) Diff 13 E2E 1.01x
query35: Previous (1758.0 ms) vs Current (1734.8 ms) Diff 23 E2E 1.01x
query36: Previous (1292.2 ms) vs Current (1438.2 ms) Diff -146 E2E 0.90x
query37: Previous (516.0 ms) vs Current (558.2 ms) Diff -42 E2E 0.92x
query38: Previous (1861.4 ms) vs Current (1923.8 ms) Diff -62 E2E 0.97x
query39_part1: Previous (1816.6 ms) vs Current (1797.2 ms) Diff 19 E2E 1.01x
query39_part2: Previous (1333.0 ms) vs Current (1147.6 ms) Diff 185 E2E 1.16x
query40: Previous (1065.6 ms) vs Current (1089.0 ms) Diff -23 E2E 0.98x
query41: Previous (287.4 ms) vs Current (319.0 ms) Diff -31 E2E 0.90x
query42: Previous (317.4 ms) vs Current (337.6 ms) Diff -20 E2E 0.94x
query43: Previous (789.6 ms) vs Current (770.0 ms) Diff 19 E2E 1.03x
query44: Previous (666.2 ms) vs Current (647.2 ms) Diff 19 E2E 1.03x
query45: Previous (1165.8 ms) vs Current (1186.6 ms) Diff -20 E2E 0.98x
query46: Previous (1415.8 ms) vs Current (1438.0 ms) Diff -22 E2E 0.98x
query47: Previous (1713.4 ms) vs Current (1593.8 ms) Diff 119 E2E 1.08x
query48: Previous (904.4 ms) vs Current (980.8 ms) Diff -76 E2E 0.92x
query49: Previous (1892.0 ms) vs Current (1784.2 ms) Diff 107 E2E 1.06x
query50: Previous (7365.6 ms) vs Current (7141.4 ms) Diff 224 E2E 1.03x
query51: Previous (1675.8 ms) vs Current (1951.0 ms) Diff -275 E2E 0.86x
query52: Previous (490.8 ms) vs Current (449.8 ms) Diff 41 E2E 1.09x
query53: Previous (632.2 ms) vs Current (657.2 ms) Diff -25 E2E 0.96x
query54: Previous (1373.2 ms) vs Current (1343.6 ms) Diff 29 E2E 1.02x
query55: Previous (408.0 ms) vs Current (370.4 ms) Diff 37 E2E 1.10x
query56: Previous (794.6 ms) vs Current (832.4 ms) Diff -37 E2E 0.95x
query57: Previous (1259.0 ms) vs Current (1353.8 ms) Diff -94 E2E 0.93x
query58: Previous (901.0 ms) vs Current (814.4 ms) Diff 86 E2E 1.11x
query59: Previous (1805.2 ms) vs Current (1847.2 ms) Diff -42 E2E 0.98x
query60: Previous (1220.0 ms) vs Current (1185.6 ms) Diff 34 E2E 1.03x
query61: Previous (1213.0 ms) vs Current (1246.6 ms) Diff -33 E2E 0.97x
query62: Previous (1129.2 ms) vs Current (1124.2 ms) Diff 5 E2E 1.00x
query63: Previous (794.4 ms) vs Current (731.0 ms) Diff 63 E2E 1.09x
query64: Previous (14201.0 ms) vs Current (14071.2 ms) Diff 129 E2E 1.01x
query65: Previous (3162.6 ms) vs Current (3127.4 ms) Diff 35 E2E 1.01x
query66: Previous (2386.2 ms) vs Current (2469.6 ms) Diff -83 E2E 0.97x
query67: Previous (17100.6 ms) vs Current (17021.8 ms) Diff 78 E2E 1.00x
query68: Previous (1173.4 ms) vs Current (1210.6 ms) Diff -37 E2E 0.97x
query69: Previous (1359.4 ms) vs Current (1469.0 ms) Diff -109 E2E 0.93x
query70: Previous (1544.2 ms) vs Current (1496.8 ms) Diff 47 E2E 1.03x
query71: Previous (3042.4 ms) vs Current (3247.4 ms) Diff -205 E2E 0.94x
query72: Previous (2612.4 ms) vs Current (2579.6 ms) Diff 32 E2E 1.01x
query73: Previous (987.4 ms) vs Current (1046.6 ms) Diff -59 E2E 0.94x
query74: Previous (2250.4 ms) vs Current (2338.0 ms) Diff -87 E2E 0.96x
query75: Previous (6662.0 ms) vs Current (6678.6 ms) Diff -16 E2E 1.00x
query76: Previous (1665.2 ms) vs Current (1502.4 ms) Diff 162 E2E 1.11x
query77: Previous (925.2 ms) vs Current (1049.4 ms) Diff -124 E2E 0.88x
query78: Previous (8079.2 ms) vs Current (8006.2 ms) Diff 73 E2E 1.01x
query79: Previous (865.0 ms) vs Current (877.4 ms) Diff -12 E2E 0.99x
query80: Previous (3891.8 ms) vs Current (3862.2 ms) Diff 29 E2E 1.01x
query81: Previous (2065.8 ms) vs Current (2007.8 ms) Diff 58 E2E 1.03x
query82: Previous (599.6 ms) vs Current (668.2 ms) Diff -68 E2E 0.90x
query83: Previous (651.4 ms) vs Current (685.8 ms) Diff -34 E2E 0.95x
query84: Previous (1176.0 ms) vs Current (988.6 ms) Diff 187 E2E 1.19x
query85: Previous (1539.2 ms) vs Current (1480.4 ms) Diff 58 E2E 1.04x
query86: Previous (894.2 ms) vs Current (923.2 ms) Diff -29 E2E 0.97x
query87: Previous (1955.6 ms) vs Current (1965.4 ms) Diff -9 E2E 1.00x
query88: Previous (3135.6 ms) vs Current (3136.4 ms) Diff 0 E2E 1.00x
query89: Previous (1088.0 ms) vs Current (967.2 ms) Diff 120 E2E 1.12x
query90: Previous (699.0 ms) vs Current (685.4 ms) Diff 13 E2E 1.02x
query91: Previous (951.0 ms) vs Current (1108.6 ms) Diff -157 E2E 0.86x
query92: Previous (539.8 ms) vs Current (507.0 ms) Diff 32 E2E 1.06x
query93: Previous (9397.4 ms) vs Current (9176.0 ms) Diff 221 E2E 1.02x
query94: Previous (3957.0 ms) vs Current (3870.4 ms) Diff 86 E2E 1.02x
query95: Previous (4760.8 ms) vs Current (4813.4 ms) Diff -52 E2E 0.99x
query96: Previous (4285.2 ms) vs Current (4331.0 ms) Diff -45 E2E 0.99x
query97: Previous (1799.0 ms) vs Current (1710.2 ms) Diff 88 E2E 1.05x
query98: Previous (1279.4 ms) vs Current (1239.6 ms) Diff 39 E2E 1.03x
query99: Previous (1477.0 ms) vs Current (1555.8 ms) Diff -78 E2E 0.95x

* verification.
*/
void transition(full_thread_state& state, thread_state const new_state)
void transition(std::shared_ptr<full_thread_state> state, thread_state const new_state)
Copy link
Collaborator

@ttnghia ttnghia Dec 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: We can use std::shared_ptr<> const& as parameter. Doing so will avoid the atomic ops when creating a new shared_ptr and reduce overhead.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, if the intent of this change is to allow this function to continue runing while leaving the state at the caller to be destroyed independently then this can be left as-is.

* Checkpoint all of the metrics for a thread.
*/
void checkpoint_metrics(full_thread_state& state)
void checkpoint_metrics(std::shared_ptr<full_thread_state> state)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar here.

}

bool is_thread_bufn_or_above(JNIEnv* env, full_thread_state const& state)
bool is_thread_bufn_or_above(JNIEnv* env, std::shared_ptr<full_thread_state> state)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar here.

@nvauto
Copy link
Collaborator

nvauto commented Jan 19, 2026

NOTE: release/26.02 has been created from main. Please retarget your PR to release/26.02 if it should be included in the release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants