-
Notifications
You must be signed in to change notification settings - Fork 110
Description
Generated by Generative AI
No response
Operating System:
Jetson NX Orin and Jetson AGX Orin
ROS version or commit hash:
humble
RMW implementation (if applicable):
rmw_cyclonedds_cpp
RMW Configuration (if applicable):
No response
Client library (if applicable):
rclcpp
'ros2 doctor --report' output
ros2 doctor --report
<COPY OUTPUT HERE>Steps to reproduce issue
I have not been able to create a sharable reproduction case yet. We've only observed this segfault occurring on a few pieces of hardware, and very rarely.
However see details below; I've done a decent amount of debugging and maybe someone more knowledgeable than I will be able to construct an example.
Expected behavior
My application runs without crashing
Actual behavior
Seg fault akin to #460 (comment).
This segfault occurs specifically when destroying a rclcpp service client (I haven't observed this with action clients yet).
The segfault occurs on this line:
| dds_waitset_detach(ws->waitseth, x->client.sub->rdcondh); |
Through debugging with gdb, x->client.sub comes up as inaccessible memory, I've also been able to confirm that the client is destroyed just before the segfault occurs.
Application details
We are using a rclcpp multithreaded executor and creating clients on a thread that is not managed by the executor. The client is also being constructed with the default callback group (which I believe is a mutually exclusive callback group). There are a lot of entities being added onto the wait set, and not just service clients.
My Theory
I believe I'm seeing a race condition between different places where we are detaching or reattaching entities in the wait set, depending on when rclcpp destroys the client.
rclcpp's memory strategy concept is used to make sure that different dds entities at the rclcpp layer (so clients, topic publishers/subsribers, etc.) only really ensures that these entities stay alive while rmw_wait is called. Since these entities are are shared pointers at this layer, and the memory strategy only temporarily takes shared ownership via weak pointers when calling rmw_wait, there are two places the client can be destroyed:
- when the executor clears and recollects entities just before rmw_wait
- Or in whichever thread owns the shared pointer returned by
create_client.
In the first case, no segfault should occur, the client/entity is being destroyed explicitly while rmw_wait is not being invoked (the multithreaded executor holds a mutex lock to ensure this). However the second case could occur at any time.
Ultimately, this race condition is really rare, and I think this segfault only occurs under the following series of events in the rmw layer (under the previously stated conditions in the rclcpp layer):
Thread 1: rmw_wait invoked, wait set is marked as "inuse" (code)
Thread 2: client destruction started, globally cached waitsets are detached (except for the one that is "inuse", i.e. the wait set being used by thread 1) (code)
Thread 2: client handle is deleted (code)
Thread 1: rmw_wait reattaches entities due to the wait set's cache being outdated, detaching the old cached entities (including the just deleted one, causing the segfault) (code)
Additional information
Attempted fixes
The waitset has a mutex protecting the inuse varaible. Instead of skipping the "inuse" waitset when cleaning waitset caches I attempted to block on this waitset until it was no longer in use. This caused a deadlock for me unfortunately.
I also tried to get rid of the case where the client is destroyed in a separate thread by holding the shared pointer for longer in the memory strategy, but this is difficult to do without causing a memory leak: If done wrong, the shared pointers can be held forever and the clients will never be freed.