Segfault while detaching service client in `rmw_wait`

### Generated by Generative AI

_No response_

### Operating System:

Jetson NX Orin and Jetson AGX Orin

### ROS version or commit hash:

humble

### RMW implementation (if applicable):

rmw_cyclonedds_cpp

### RMW Configuration (if applicable):

_No response_

### Client library (if applicable):

rclcpp

### 'ros2 doctor --report' output

<details><summary>ros2 doctor --report</summary>

```console
<COPY OUTPUT HERE>
```
</details>


### Steps to reproduce issue

I have not been able to create a sharable reproduction case yet. We've only observed this segfault occurring on a few pieces of hardware, and very rarely.

However see details below; I've done a decent amount of debugging and maybe someone more knowledgeable than I will be able to construct an example.

### Expected behavior

My application runs without crashing

### Actual behavior

Seg fault akin to https://github.com/ros2/rmw_cyclonedds/issues/460#issuecomment-1604056229.

This segfault occurs specifically when destroying a rclcpp service client (I haven't observed this with action clients yet).

The segfault occurs on this line:

https://github.com/ros2/rmw_cyclonedds/blob/7cb3b38a21e14fbcc84aadcc460e4d812d0c7a7f/rmw_cyclonedds_cpp/src/rmw_node.cpp#L3958

Through debugging with gdb,` x->client.sub` comes up as inaccessible memory, I've also been able to confirm that the [client is destroyed](https://github.com/ros2/rmw_cyclonedds/blob/7cb3b38a21e14fbcc84aadcc460e4d812d0c7a7f/rmw_cyclonedds_cpp/src/rmw_node.cpp#L4703) just before the segfault occurs.

### Application details

We are using a rclcpp multithreaded executor and creating clients on a thread that is not managed by the executor. The client is also being constructed with the default callback group (which I believe is a mutually exclusive callback group). There are a lot of entities being added onto the wait set, and not just service clients.

### My Theory

I believe I'm seeing a race condition between different places where we are detaching or reattaching entities in the wait set, depending on when rclcpp destroys the client.

[rclcpp's memory strategy concept](https://github.com/ros2/rclcpp/blob/217e7a705cf53d73cb241935c79238e8fadd3d8b/rclcpp/include/rclcpp/memory_strategy.hpp) is used to make sure that different dds entities at the rclcpp layer (so clients, topic publishers/subsribers, etc.) only really ensures that these entities stay alive while `rmw_wait` is called. Since these entities are are shared pointers at this layer, and the memory strategy only temporarily takes shared ownership via weak pointers when calling `rmw_wait`, there are two places the client can be destroyed:
1. [when the executor clears and recollects entities just before rmw_wait](https://github.com/ros2/rclcpp/blob/217e7a705cf53d73cb241935c79238e8fadd3d8b/rclcpp/src/rclcpp/executor.cpp#L703)
1. Or in whichever thread owns the shared pointer returned by `create_client`. 

In the first case, no segfault should occur, the client/entity is being destroyed explicitly while `rmw_wait` is __not__ being invoked (the multithreaded executor holds a mutex lock to ensure this). However the second case could occur at any time.

Ultimately, this race condition is really rare, and I think this segfault only occurs under the following series of events in the rmw layer (under the previously stated conditions in the rclcpp layer):

__Thread 1:__ `rmw_wait` invoked, wait set is marked as "inuse" ([code](https://github.com/ros2/rmw_cyclonedds/blob/7cb3b38a21e14fbcc84aadcc460e4d812d0c7a7f/rmw_cyclonedds_cpp/src/rmw_node.cpp#L4047-L4054))
__Thread 2:__ client destruction started, globally cached waitsets are detached (except for the one that is "inuse", i.e. the wait set being used by thread 1)  ([code](https://github.com/ros2/rmw_cyclonedds/blob/7cb3b38a21e14fbcc84aadcc460e4d812d0c7a7f/rmw_cyclonedds_cpp/src/rmw_node.cpp#L4718))
__Thread 2:__ client handle is deleted ([code](https://github.com/ros2/rmw_cyclonedds/blob/7cb3b38a21e14fbcc84aadcc460e4d812d0c7a7f/rmw_cyclonedds_cpp/src/rmw_node.cpp#L4741))
__Thread 1__: `rmw_wait` reattaches entities due to the wait set's cache being outdated, detaching the old cached entities (including the just deleted one, causing the segfault) ([code](https://github.com/ros2/rmw_cyclonedds/blob/7cb3b38a21e14fbcc84aadcc460e4d812d0c7a7f/rmw_cyclonedds_cpp/src/rmw_node.cpp#L4067))

### Additional information

### Attempted fixes

The waitset has a mutex protecting the `inuse` varaible. Instead of skipping the "inuse" waitset when [cleaning waitset caches](https://github.com/ros2/rmw_cyclonedds/blob/7cb3b38a21e14fbcc84aadcc460e4d812d0c7a7f/rmw_cyclonedds_cpp/src/rmw_node.cpp#L3969-L3979) I attempted to block on this waitset until it was no longer in use. This caused a deadlock for me unfortunately. 

I also tried to get rid of the case where the client is destroyed in a separate thread by holding the shared pointer for longer in the memory strategy, but this is difficult to do without causing a memory leak: If done wrong, the shared pointers can be held forever and the clients will never be freed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Segfault while detaching service client in `rmw_wait` #563

Generated by Generative AI

Operating System:

ROS version or commit hash:

RMW implementation (if applicable):

RMW Configuration (if applicable):

Client library (if applicable):

'ros2 doctor --report' output

Steps to reproduce issue

Expected behavior

Actual behavior

Application details

My Theory

Additional information

Attempted fixes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Segfault while detaching service client in rmw_wait #563

Description

Generated by Generative AI

Operating System:

ROS version or commit hash:

RMW implementation (if applicable):

RMW Configuration (if applicable):

Client library (if applicable):

'ros2 doctor --report' output

Steps to reproduce issue

Expected behavior

Actual behavior

Application details

My Theory

Additional information

Attempted fixes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Segfault while detaching service client in `rmw_wait` #563