fix(gateway): synchronize with worker before notify in ResourceChangeNotifier::shutdown

bburda · bburda · commit e13f49a1bbdd · 2026-04-05T10:43:39.000+02:00
Classical lost-wakeup race between shutdown() and worker_loop():

1. worker locks queue_mutex_
2. worker checks predicate (flag=false, queue empty) -&gt; false
3. shutdown() CAS flag -&gt; true (NOT holding queue_mutex_)
4. shutdown() notify_one() - worker NOT yet enqueued on CV
5. worker enters wait(lock): atomic unlock+enqueue+sleep
6. worker sleeps forever; main blocks in worker_thread_.join()

Even though shutdown_flag_ uses seq_cst atomics, the worker's predicate
check and entry into wait() are not atomic with respect to the flag
modification. The notify can arrive before the worker is enqueued.

Fix: briefly acquire queue_mutex_ between setting the flag and notifying.
This guarantees the worker is either still outside its critical section
(will observe the new flag on the next lock) or already enqueued on the
CV (notify_one will wake it).

Manifested as a TSan-specific hang in DoubleShutdownIsSafe (worker spawn
+ immediate shutdown hits the race window).
diff --git a/src/ros2_medkit_gateway/src/resource_change_notifier.cpp b/src/ros2_medkit_gateway/src/resource_change_notifier.cpp
@@ -105,6 +105,13 @@ void ResourceChangeNotifier::shutdown() {
     });
   }
 
+  // Synchronize with worker_loop()'s predicate check. Without this, the flag
+  // store above can land between the worker's predicate evaluation and its
+  // wait() call, losing the subsequent notify_one(). Acquiring queue_mutex_
+  // here guarantees the worker is either still outside the critical section
+  // (will observe the new flag) or already enqueued on queue_cv_ (notify will
+  // wake it).
+  { std::lock_guard<std::mutex> sync(queue_mutex_); }
   queue_cv_.notify_one();
   if (worker_thread_.joinable()) {
     worker_thread_.join();