Skip to content

TestInteractiveMarkerClient.states fails when run with rmw_connextdds #40

@asorbini

Description

@asorbini

Test TestInteractiveMarkerClient.states from package ros-visualization/interactive_markers fails when run with rmw_connextdds (ref):

Details
[ RUN      ] TestInteractiveMarkerClient.states
[INFO] [1618402845.369215543] [test_interactive_marker_server_node]: Connected on namespace: test_namespace
[INFO] [1618402845.370914886] [test_interactive_marker_server_node]: Connected on namespace: test_namespace
[INFO] [1618402845.372802357] [test_interactive_marker_server_node]: Connected on namespace: test_namespace
[INFO] [1618402845.379741315] [mock_interactive_marker_server]: Mock server ready
[INFO] [1618402845.380105568] [test_interactive_marker_server_node]: Sending request for interactive markers
[INFO] [1618402845.390321054] [mock_interactive_marker_server]: Mock server destroyed
[WARN] [1618402845.392697448] [test_interactive_marker_server_node]: Server not available during initialization, resetting
[INFO] [1618402845.407828611] [mock_interactive_marker_server]: Mock server ready
[INFO] [1618402845.407945940] [test_interactive_marker_server_node]: Sending request for interactive markers
[INFO] [1618402845.418071273] [mock_interactive_marker_server]: Mock server destroyed
[INFO] [1618402845.421568472] [test_interactive_marker_server_node]: Connected on namespace: test_namespace
[WARN] [1618402845.422274371] [rcl.logging_rosout]: Publisher already registered for provided node name. If this is due to multiple nodes with the same name then all logs for that logger name will go out over the existing publisher. As soon as any node with that name is destructed it will unregister the publisher, preventing any further logs for that name from being published on the rosout topic.
[INFO] [1618402845.427342460] [mock_interactive_marker_server]: Mock server ready
[ERROR] [1618402845.427709873] [rclcpp]: Error in destruction of rcl service handle test_namespace/get_interactive_markers: the Node Handle was destructed too early. You will leak memory
[INFO] [1618402845.427803546] [test_interactive_marker_server_node]: Sending request for interactive markers
[INFO] [1618402845.468389763] [mock_interactive_marker_server]: Interactive markers request received
[INFO] [1618402845.499019875] [test_interactive_marker_server_node]: Service response received for initialization
[INFO] [1618402845.509249996] [mock_interactive_marker_server]: Mock server destroyed
RTI Data Distribution Service Non-commercial license is for academic, research, evaluation and personal use only. USE FOR COMMERCIAL PURPOSES IS PROHIBITED. See RTI_LICENSE.TXT for terms. Download free tools at rti.com/ncl. License issued to Non-Commercial User license@rti.com For non-production use only.
Expires on 00-jan-00 See www.rti.com for more information.
/tmp/ws/src/ros-visualization/interactive_markers/test/interactive_markers/test_interactive_marker_client.cpp:236: Failure
Expected equality of these values:
  client_->getState()
    Which is: 2
  ClientState::STATE_IDLE
    Which is: 0
[INFO] [1618402848.515853148] [test_interactive_marker_server_node]: Connected on namespace: test_namespace
[INFO] [1618402848.522049052] [mock_interactive_marker_server]: Mock server ready
[INFO] [1618402848.522210463] [test_interactive_marker_server_node]: Sending request for interactive markers
[INFO] [1618402848.562642973] [mock_interactive_marker_server]: Interactive markers request received
[INFO] [1618402848.593074129] [test_interactive_marker_server_node]: Service response received for initialization
[WARN] [1618402848.653861432] [test_interactive_marker_server_node]: Update sequence number is out of order. 1 (expected) vs. 2 (received)
[INFO] [1618402848.653926360] [test_interactive_marker_server_node]: Sending request for interactive markers
[INFO] [1618402848.664057004] [mock_interactive_marker_server]: Mock server destroyed
[WARN] [1618402849.664008944] [test_interactive_marker_server_node]: Did not receive response with interactive markers, resending request...
[INFO] [1618402849.664127018] [test_interactive_marker_server_node]: Sending request for interactive markers
[WARN] [1618402850.671448428] [test_interactive_marker_server_node]: Did not receive response with interactive markers, resending request...
[INFO] [1618402850.671568523] [test_interactive_marker_server_node]: Sending request for interactive markers
/tmp/ws/src/ros-visualization/interactive_markers/test/interactive_markers/test_interactive_marker_client.cpp:268: Failure
Expected equality of these values:
  client_->getState()
    Which is: 1
  ClientState::STATE_IDLE
    Which is: 0
[ERROR] [1618402851.674446619] [rmw_connextdds]: failed to delete DDS subscriber's entities
[ERROR] [1618402851.674483621] [rmw_connextdds]: failed to finalize node in context
[ERROR] [1618402851.674498396] [rclcpp]: Error in destruction of rcl node handle: failed to delete DDS subscriber's entities, at /tmp/ws/src/ros2/rmw_connextdds/rmw_connextdds_common/src/common/rmw_context.cpp:436, at /tmp/ws/src/ros2/rcl/rcl/src/rcl/node.c:384
[D0000|Sub(80000009)|DELETE_CONTAINED]PRESPsService_destroyAllLocalEndpointsInGroupWithCursor:!delete local reader
[D0000|Sub(80000009)|DELETE_CONTAINED]PRESPsService_destroyAllLocalEndpointsInGroup:!destroyAllLocalEndpointsInGroupWithCursor
[D0000|Sub(80000009)|DELETE_CONTAINED]DDS_Subscriber_delete_contained_entities:!delete contained entities
[  FAILED  ] TestInteractiveMarkerClient.states (6320 ms)

Both failures [1][2] are caused by an interactive_markers::InteractiveMarkerClient object not transitioning to ClientState::STATE_IDLE within a pretty generous timeout (3s). Both times, these state transitions would normally be triggered by rclcpp::Client::service_is_ready() called on member get_interactive_markers_client_ returning false [1][2]. This is expected to occur because both times the mock_server has been removed from the executor and deleted.

Unfortunately, the client never seems to "lose service", and the transitions never happen.

The key to why this is happening lies in the ERROR line:

[ERROR] [1618402845.427709873] [rclcpp]: Error in destruction of rcl service handle test_namespace/get_interactive_markers: the Node Handle was destructed too early. You will leak memory

This error is thrown by rclcpp when the node associated with an rcl_service_t has already been deleted at the time of the service's deletion [1]. Since the node handle is not valid anymore, rcl_service_fini() and rmw_destroy_service() are never called for that service.

I have been doing some digging around and I've determined that the reason for this leak lies in the following sequence of events:

  1. The rcl_service_t becomes active while the test is spinning the executor [1][2][3].
  2. A subscription is also active and is returned instead of the service [1]. The rcl_server_t remains referenced by AllocatorMemoryStrategy::service_handles_.
  3. The active subscription makes the client transition to RUNNING state [1] which causes the test to delete the mock_server. Since the rcl_service_t is still referenced, its "destructor" is not run.
  4. The test starts waiting for the client to transition to IDLE. This in turn causes the executor to finally notice that the rcl_service_t is associated with a deleted node, and finally clears its reference, causing the destructor to run, and the error to be printed [1].

Unfortunately the client will never transition to IDLE because in order to do this, its "request publisher" and "reply subscriber" would have to unmatch all endpoints from matching services [1], but this will never happen thanks to the leaked rcl_service_t (since rmw_destroy_service() was not called and the underlying DDS endpoints will never be deleted). I believe these leaked endpoints are also the reason for the finalization errors reported by Connext and rmw_connextdds at the end of the test.

Fix/Workaround

No changes are required in rmw_connextdds.

The only change required to make the test pass consistently is to modify Executor::get_next_ready_executable_from_map() in rclcpp/src/rclcpp/executor.cpp so that services are checked before subscriptions:

diff --git a/rclcpp/src/rclcpp/executor.cpp b/rclcpp/src/rclcpp/executor.cpp
index 630db50b..8b5b5f23 100644
--- a/rclcpp/src/rclcpp/executor.cpp
+++ b/rclcpp/src/rclcpp/executor.cpp
@@ -826,16 +826,16 @@ Executor::get_next_ready_executable_from_map(
     success = true;
   }
   if (!success) {
-    // Check the subscriptions to see if there are any that are ready
-    memory_strategy_->get_next_subscription(any_executable, weak_groups_to_nodes);
-    if (any_executable.subscription) {
+    // Check the services to see if there are any that are ready
+    memory_strategy_->get_next_service(any_executable, weak_groups_to_nodes);
+    if (any_executable.service) {
       success = true;
     }
   }
   if (!success) {
-    // Check the services to see if there are any that are ready
-    memory_strategy_->get_next_service(any_executable, weak_groups_to_nodes);
-    if (any_executable.service) {
+    // Check the subscriptions to see if there are any that are ready
+    memory_strategy_->get_next_subscription(any_executable, weak_groups_to_nodes);
+    if (any_executable.subscription) {
       success = true;
     }
   }

The reason why this small change fixes the leak, and the test in general, is because if the rcl_service_t is detected as active, it ends up being removed from AllocatorMemoryStrategy::service_handles_ and it can be properly finalized alongside its parent node. The subscription is still detected on the following wait, making the test progress as expected.

This is probably not The Fix, since this issue seems to outline a general memory management problem (i.e. the leak of rcl_service_t). While making the change in rclcpp will make the test pass, I suspect more problems will arise inevitably until the code is refactored so that all RMW resources are always guaranteed to be correctly finalized.

Sidenote

Turns out these errors have nothing to do with the reliability protocol's configuration, as I had suspected when I opened #26. I'm not quite sure why a very fast heartbeat rate is also able to make the test pass, but I suspect it has something to do with altering the timing and ordering in which the executor clears its associated objects.

EDIT: grammar and link to the correct related PR (#26)

Metadata

Metadata

Assignees

Labels

test-failureThe issue or PR relates to a test failure

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions