Skip to content

Conversation

@eepp
Copy link
Member

@eepp eepp commented Aug 19, 2015

No description provided.

Signed-off-by: Philippe Proulx <eeppeliteloop@gmail.com>
@jgalar jgalar added this to the LTTng 2.7 milestone Sep 1, 2015
jgalar pushed a commit that referenced this pull request Apr 23, 2021
Observed issue
==============

A dead lock is observed during the start-stop test suite for triggers.

Cause
=====

A start session action is executed by the action executor, the
`cmd_start_trace` function is called and effectively holds the
`session_list_lock.`. During `cmd_start_trace` a call to
`notification_thread_command_add_channel` is performed to inform the
notification thread of the new channel presence.

At the same time, a tracer event notification is received by the
notification thread. The actions are queued up and the sample of the
session id take place and a call to `session_lock_list` is performed and
blocks on the lock operation.

The notification thread wait on the `session_list_lock` and the
`session_list_lock` holder, the action executor, waits on the completion
of a command the be run by the notification thread: deadlock.

The backtrace:

 Thread 6 (Thread 0x7f831c8a6700 (LWP 3046458)):
 #0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
 #1  0x000000000053b852 in futex (uaddr=0x7f831c8a45e0, op=0, val=0, timeout=0x0, uaddr2=0x0, val3=0) at /home/joraj/lttng/master/install/include/urcu/futex.h:72
 #2  0x000000000053b4f9 in futex_noasync (uaddr=0x7f831c8a45e0, op=0, val=0, timeout=0x0, uaddr2=0x0, val3=0) at /home/joraj/lttng/master/install/include/urcu/futex.h:81
 #3  0x000000000053af10 in lttng_waiter_wait (waiter=0x7f831c8a45d8) at waiter.c:55
 #4  0x000000000046b0f2 in run_command_wait (handle=0xe60520, cmd=0x7f831c8a4588) at notification-thread-commands.c:49
 #5  0x000000000046b270 in notification_thread_command_add_channel (handle=0xe60520, session_name=0x7f8300006c30 "my_triggered_session", uid=1000, gid=1000, channel_name=0x7f82dc00be04 "channel0", key=1, domain=LTTNG_DOMAIN_UST, capacity=2097152) at notification-thread-commands.c:184
 #6  0x00000000004c7f65 in create_channel_per_uid (app=0x7f82d8000bf0, usess=0x7f8300000bb0, ua_sess=0x7f82dc002600, ua_chan=0x7f82dc00bde0) at ust-app.c:3360
 #7  0x00000000004c6f98 in ust_app_channel_send (app=0x7f82d8000bf0, usess=0x7f8300000bb0, ua_sess=0x7f82dc002600, ua_chan=0x7f82dc00bde0) at ust-app.c:3514
 #8  0x00000000004c6bde in ust_app_channel_create (usess=0x7f8300000bb0, ua_sess=0x7f82dc002600, uchan=0x7f8300005a90, app=0x7f82d8000bf0, _ua_chan=0x7f831c8a48b0) at ust-app.c:4771
 #9  0x00000000004c6968 in find_or_create_ust_app_channel (usess=0x7f8300000bb0, ua_sess=0x7f82dc002600, app=0x7f82d8000bf0, uchan=0x7f8300005a90, ua_chan=0x7f831c8a48b0) at ust-app.c:5610
 #10 0x00000000004c4f09 in ust_app_synchronize_all_channels (usess=0x7f8300000bb0, ua_sess=0x7f82dc002600, app=0x7f82d8000bf0) at ust-app.c:5820
 #11 0x00000000004b958c in ust_app_synchronize (usess=0x7f8300000bb0, app=0x7f82d8000bf0) at ust-app.c:5886
 #12 0x00000000004b8500 in ust_app_global_update (usess=0x7f8300000bb0, app=0x7f82d8000bf0) at ust-app.c:5960
 #13 0x00000000004b7ec2 in ust_app_start_trace_all (usess=0x7f8300000bb0) at ust-app.c:5520
 #14 0x0000000000444e86 in cmd_start_trace (session=0x7f8300006c30) at cmd.c:2707
 #15 0x00000000004a5af9 in action_executor_start_session_handler (executor=0x7f8314004410, work_item=0x7f8314005100, item=0x7f83140050b0) at action-executor.c:342
 #16 0x00000000004a537f in action_executor_generic_handler (executor=0x7f8314004410, work_item=0x7f8314005100, item=0x7f83140050b0) at action-executor.c:696
 #17 0x00000000004a4dbc in action_work_item_execute (executor=0x7f8314004410, work_item=0x7f8314005100) at action-executor.c:715
 #18 0x00000000004a37e6 in action_executor_thread (_data=0x7f8314004410) at action-executor.c:797
 #19 0x0000000000486193 in launch_thread (data=0x7f83140044b0) at thread.c:66
 #20 0x00007f8320b60609 in start_thread (arg=<optimized out>) at pthread_create.c:477
 #21 0x00007f8320a87293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

 Thread 5 (Thread 0x7f831d0a7700 (LWP 3046457)):
 #0  __lll_lock_wait (futex=futex@entry=0x5e1c10 <ltt_session_list>, private=0) at lowlevellock.c:52
 #1  0x00007f8320b630a3 in __GI___pthread_mutex_lock (mutex=0x5e1c10 <ltt_session_list>) at ../nptl/pthread_mutex_lock.c:80
 #2  0x00000000004378c3 in session_lock_list () at session.c:156
 #3  0x00000000004a871c in add_action_to_subitem_array (action=0x7f830001a730, subitems=0x7f83140051d0) at action-executor.c:1081
 #4  0x00000000004a8578 in add_action_to_subitem_array (action=0x7f830001a620, subitems=0x7f83140051d0) at action-executor.c:1025
 #5  0x00000000004a4922 in populate_subitem_array_from_trigger (trigger=0x7f830001a950, subitems=0x7f83140051d0) at action-executor.c:1116
 #6  0x00000000004a416e in action_executor_enqueue_trigger (executor=0x7f8314004410, trigger=0x7f830001a950, evaluation=0x7f8314005190, object_creds=0x0, client_list=0x7f8314004980) at action-executor.c:924
 #7  0x0000000000479481 in dispatch_one_event_notifier_notification (state=0x7f831d0a63e8, notification=0x7f8314005160) at notification-thread-events.c:4613
 #8  0x0000000000472324 in handle_one_event_notifier_notification (state=0x7f831d0a63e8, pipe=65, domain=LTTNG_DOMAIN_UST) at notification-thread-events.c:4702
 #9  0x0000000000472271 in handle_notification_thread_event_notification (state=0x7f831d0a63e8, pipe=65, domain=LTTNG_DOMAIN_UST) at notification-thread-events.c:4717
 #10 0x00000000004695a3 in handle_event_notification_pipe (event_source_fd=65, domain=LTTNG_DOMAIN_UST, revents=1, state=0x7f831d0a63e8) at notification-thread.c:591
 #11 0x000000000046849b in thread_notification (data=0xe60520) at notification-thread.c:727
 #12 0x0000000000486193 in launch_thread (data=0xe60610) at thread.c:66
 #13 0x00007f8320b60609 in start_thread (arg=<optimized out>) at pthread_create.c:477
 #14 0x00007f8320a87293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Solution
========

Instead of using session_find_by_name() which requires the
`session_list_lock`, we introduce `sample_session_id_by_name` that uses
a urcu backed data structure. This allows the sampling of the session
id without holding the session list lock. We accept the small window
where a session object is still accessible but concretely not valid
since the actual execution context will be validated at the moment of
execution. The execution side already handles the possibility that the
session is removed at that point or is not the same session. The
execution side acquires the session_list_lock for validation.

Known drawbacks
=========

None

Signed-off-by: Jonathan Rajotte <jonathan.rajotte-julien@efficios.com>
Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I5ad2c57acc0d03d2814dda59f8ecf2d831fd961e
jgalar added a commit that referenced this pull request Mar 20, 2025
Issue observed
--------------

A number of tests in the CI fail on a crash of the session daemon
on ARM32 and s390x.

For instance, on an ARM32 board running the
tests/regression/rools/base-path/test_ust script, we observe the
following failed assertion:

  #0  __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:47
  #1  0xf7af75cc in __pthread_kill_implementation (threadid=4075801728, signo=signo@entry=6, no_tid=no_tid@entry=0) at pthread_kill.c:43
  #2  0xf7af7610 in __pthread_kill_internal (signo=6, threadid=<optimized out>) at pthread_kill.c:78
  #3  0xf7ac6362 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
  #4  0xf7ab70ec in __GI_abort () at abort.c:79
  #5  0xf7ac189a in __assert_fail_base (fmt=0x20 <error: Cannot access memory at address 0x20>, assertion=0x595798 "res >= 0", assertion@entry=0x87 <error: Cannot access memory at address 0x87>,
      file=file@entry=0x20 <error: Cannot access memory at address 0x20>, line=line@entry=64, function=0x595740 "void urcu_ref_put(urcu_ref*, void (*)(urcu_ref*))", function@entry=0xf2efcc80 "\001") at assert.c:94
  #6  0xf7ac1916 in __GI___assert_fail (assertion=0x87 <error: Cannot access memory at address 0x87>, file=0x20 <error: Cannot access memory at address 0x20>, line=64, function=0xf2efcc80 "\001") at assert.c:103
  #7  0x00430fb6 in urcu_ref_put (ref=0xf1d0d8d0, release=0x4335c7 <session_release(urcu_ref*)>) at /root/build/usr/include/urcu/ref.h:64
  #8  0x00433a4a in session_put (session=0xf1d0c6c0) at session.cpp:1141
  #9  0x00434586 in ltt_session::_locked_session_release (session=0xf1d0c6c0) at session.cpp:1475
  #10 0x00420f5a in lttng::memory::create_deleter_class<ltt_session, &ltt_session::_locked_session_release>::deleter::operator() (this=0xf2efc144, instance=0xf1d0c6c0) at ../../../src/common/meta-helpers.hpp:25
  #11 0x00420a30 in lttng::non_copyable_reference<ltt_session, lttng::memory::create_deleter_class<ltt_session, &ltt_session::_locked_session_release>::deleter>::_clean_up (this=0xf2efc280) at ../../../src/common/reference.hpp:88
  #12 0x00420474 in lttng::non_copyable_reference<ltt_session, lttng::memory::create_deleter_class<ltt_session, &ltt_session::_locked_session_release>::deleter>::~non_copyable_reference (this=0xf2efc280, __in_chrg=<optimized out>)
      at ../../../src/common/reference.hpp:67
  #13 0x00433868 in session_release (ref=0xf1d0d8d0) at session.cpp:1087
  #14 0x00430fc2 in urcu_ref_put (ref=0xf1d0d8d0, release=0x4335c7 <session_release(urcu_ref*)>) at /root/build/usr/include/urcu/ref.h:66
  #15 0x00433a4a in session_put (session=0xf1d0c6c0) at session.cpp:1141
  #16 0x00434586 in ltt_session::_locked_session_release (session=0xf1d0c6c0) at session.cpp:1475
  #17 0x00420f5a in lttng::memory::create_deleter_class<ltt_session, &ltt_session::_locked_session_release>::deleter::operator() (this=0xf2efc314, instance=0xf1d0c6c0) at ../../../src/common/meta-helpers.hpp:25
  #18 0x00420a30 in lttng::non_copyable_reference<ltt_session, lttng::memory::create_deleter_class<ltt_session, &ltt_session::_locked_session_release>::deleter>::_clean_up (this=0xf2efc344) at ../../../src/common/reference.hpp:88
  #19 0x00420474 in lttng::non_copyable_reference<ltt_session, lttng::memory::create_deleter_class<ltt_session, &ltt_session::_locked_session_release>::deleter>::~non_copyable_reference (this=0xf2efc344, __in_chrg=<optimized out>)
      at ../../../src/common/reference.hpp:67
  #20 0x00471bb6 in lttng::sessiond::rotation_thread::_handle_job_queue (this=0x621b70) at rotation-thread.cpp:558
  #21 0x00473066 in lttng::sessiond::rotation_thread::_run (this=0x621b70) at rotation-thread.cpp:792
  #22 0x004729c6 in lttng::sessiond::rotation_thread::_thread_function (this=0x621b70) at rotation-thread.cpp:711
  #23 0x004734ce in operator() (__closure=0x0, ptr=0x621b70) at rotation-thread.cpp:826
  #24 0x004734e8 in _FUN () at rotation-thread.cpp:828
  #25 0x0047906a in launch_thread (data=0x6226b0) at thread.cpp:67
  #26 0xf7af5e62 in start_thread (arg=0xc1fc5a3b) at pthread_create.c:442
  #27 0xf7b4c960 in ?? () at ../sysdeps/unix/sysv/linux/arm/clone.S:74 from /lib/arm-linux-gnueabihf/libc.so.6

Cause
-----

The value used as a refcount tombstone is a negative value when
interpreted as a 32-bit 'long' value. See the comment in
session_release() for an explanation of the use of this value.

The call to urcu_put() results in the assertion failure. That is because
urcu_ref_put() contains the following check:

  long res = uatomic_sub_return(&ref->refcount, 1);
  urcu_posix_assert(res >= 0);

When using a negative refcount, the check expectedly fails.

Solution
--------

The tombstone is replaced by another arbitrary, but positive, value.

Drawbacks
---------

None. The hack remains as ugly as before, though.

Signed-off-by: Jérémie Galarneau <jeremie.galarneau@efficios.com>
Change-Id: I4871db160ee4883e63e6ecf5921b4752765d6767
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants