-
Notifications
You must be signed in to change notification settings - Fork 140
Rescue ThreadError when creating Sender Thread #307
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
lib/datadog/statsd/sender.rb
Outdated
| # start background thread | ||
| @sender_thread = @thread_class.new(&method(:send_loop)) | ||
| @sender_thread.name = "Statsd Sender" unless Gem::Version.new(RUBY_VERSION) < Gem::Version.new('2.3') | ||
| rescue ThreadError |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to try to distinguish the error on shutdown from other possible errors? Thread.new can fail for other reasons, such as no memory or too many processes, and it would be unfortunate to bury that in debug messages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, I've updated it to log the error message. Let me know if that's sufficient enough
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Logging the message is better, but it still a debug message. It may not be enabled and it would be difficult to find if failure to start the sender thread results in loss of observability.
Would it be possible to check if the sender thread was terminated or the process is exiting and skip re-starting sender thread in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a check for sender_thread.status.nil? to check for thread termination. I am not too familiar with ruby so I'm unsure how to check if the process is exiting. I found the at_exit function but I don't think it would apply here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ruby's Thread documentation is unfortunately lacking in its descriptions of what happens when, but this program still crashes on my machine sometimes, so it looks like status and alive? are not (always) updated together:
sender = Thread.new do
sleep 600
end
task = Thread.new do
begin
sleep 600
rescue Interrupt
ensure
if not sender.alive? and not sender.status.nil?
Thread.new { puts "OK" }.join
end
end
end
sleep 0.01
task.raise InterruptThis, however, seems to work fine:
done = false
sender = Thread.new do
sleep 600
ensure
done = true
end
task = Thread.new do
begin
sleep 600
rescue Interrupt
ensure
if not sender.alive? and not done
Thread.new { puts "OK" }.join
end
end
end
sleep 0.01
task.raise InterruptVariable assignment appears to be atomic at least in MRI, but we'd need to double-check this for all implementations we support, or we'd need an explicit atomic or a mutex. Although it seems to happen in practice, I also could not find any guarantee that ensure blocks must run when process terminates the threads, so this might not be entirely valid approach either.
|
I have done some testing on this. If I understand correctly the following script reproduces the problem (on MRI): You can add an explicit Result: The "can't alloc thread" message in ThreadError is, funnily enough, used only in the case of the main thread being dead (and is, therefore, never a failure to allocate anything): https://github.com/ruby/ruby/blob/master/thread.c#L908 To detect the case of process exiting in the existing Ruby releases, it should then be sufficient to simply check for the ThreadError's message being "can't alloc thread". Of course, this is a pretty fragile way of going about it, and our code also wouldn't make sense to people reading it (since the issue is not allocation of thread). A more reliable way of detecting this situation would be performing the check quoted above for main thread status, however I am not sure how to get at it from Ruby code (since dogstatsd doesn't have any C extensions that I can currently spot). On JRuby, threads can be created after the main thread starts exiting (no exception is raised). The reproducer above produces: |
p-datadog
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to me that the PR will attempt to flush the messages one time but if there are multiple calls to add after the process begins to exit, the second and subsequent ones would not attempt to flush the data out synchronously (due to the @done flag check).
lib/datadog/statsd/sender.rb
Outdated
| # the parent process) and spawn a new companion thread. | ||
| if !sender_thread.alive? | ||
| @mx.synchronize { | ||
| break if @done |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the purpose of this added line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect the intention is to not try to start again if the previous start failed.... But yeah it's probably not a bad idea to make this a lot more clearer.
lib/datadog/statsd/sender.rb
Outdated
| @sender_thread.name = "Statsd Sender" unless Gem::Version.new(RUBY_VERSION) < Gem::Version.new('2.3') | ||
| rescue ThreadError => e | ||
| @logger.debug { "Statsd: Failed to start sender thread: #{e.message}, flushing message buffer" } if @logger | ||
| message_buffer.flush |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line can raise exceptions, that the add method previously wouldn't be raising (i.e. those that are a result of a sending attempt); are they already appropriately handled by upstream code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It wasn't completely clear to me on the first pass that the intention here is to flush the message_buffer synchronously (aka in the current thread).
Yet... if the intention is that we get there from being called either when the Sender is originally created (nothing in the buffer) or from add (in the forked process scenario), is message_buffer ever going to contain anything at all?
The control flow to this method seems to be a bit confusing, e.g. it may be worth clarifying when and how this gets called (with a comment or a refactoring?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, if forked, message_queue and message_buffer is emptied. I've removed the additional flush.
ivoanjo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is definitely a weird one, thanks for looking into it! I left a few notes/suggestions.
lib/datadog/statsd/sender.rb
Outdated
| @sender_thread.name = "Statsd Sender" unless Gem::Version.new(RUBY_VERSION) < Gem::Version.new('2.3') | ||
| rescue ThreadError => e | ||
| @logger.debug { "Statsd: Failed to start sender thread: #{e.message}, flushing message buffer" } if @logger | ||
| message_buffer.flush |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It wasn't completely clear to me on the first pass that the intention here is to flush the message_buffer synchronously (aka in the current thread).
Yet... if the intention is that we get there from being called either when the Sender is originally created (nothing in the buffer) or from add (in the forked process scenario), is message_buffer ever going to contain anything at all?
The control flow to this method seems to be a bit confusing, e.g. it may be worth clarifying when and how this gets called (with a comment or a refactoring?).
lib/datadog/statsd/sender.rb
Outdated
| # the parent process) and spawn a new companion thread. | ||
| if !sender_thread.alive? | ||
| @mx.synchronize { | ||
| break if @done |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect the intention is to not try to start again if the previous start failed.... But yeah it's probably not a bad idea to make this a lot more clearer.
| end | ||
| end | ||
| @sender_thread.name = "Statsd Sender" unless Gem::Version.new(RUBY_VERSION) < Gem::Version.new('2.3') | ||
| rescue ThreadError => e |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest adding a test for this, as it's a bit brittle to maintain and understand what led to it. It looks like it'd be easy to do so: the @thread_class is already being injected, and thus it can be replaced by something that always raises a dummy ThreadError when #new is called on it.
ivoanjo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a few extra notes :)
lib/datadog/statsd/sender.rb
Outdated
| ensure | ||
| @mx.synchronize { @done = true } | ||
| end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the intention is to handle ThreadErrors... I don't think this is doing it. Specifically, if the ThreadError is raised by Thread.new, then the ensure will never run, so @done is never going to be set, and every add will result in the attempt to recreate the thread.
spec/statsd/sender_spec.rb
Outdated
| end | ||
|
|
||
| before do | ||
| allow(thread_class).to receive(:new).and_raise(ThreadError, "ThreadError") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor:
| allow(thread_class).to receive(:new).and_raise(ThreadError, "ThreadError") | |
| expect(thread_class).to receive(:new).and_raise(ThreadError, "ThreadError") |
Using expect here adds a tiny extra level of checks against the test passing trivially, e.g. if the method does not do anything. This is useful in tests that check "it doesn't break" as otherwise they can pass for the wrong reasons.
Alternatively, you can expect on the logger receiving a message as a way of "observing" that the correct path was exercised in the test.
lib/datadog/statsd/sender.rb
Outdated
| @mx.synchronize { | ||
| # an attempt was previously made to start the sender thread but failed. | ||
| # skipping re-start | ||
| break if @done |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A question occurred to me on a second pass on this one: is break really the intention here, or would return be more correct?
Since we're inside a Ruby block, break only pops out of the @mx.synchronize, and then executes the rest of the method (e.g. from the if message_queue.length <= @queue_size on).
Which means we still store all messages whenever there's a failure. Is that intended? (May be worth adding a spec to clarify this -- e.g. that the queue is either empty or not empty after the failure)
ivoanjo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍
lib/datadog/statsd/sender.rb
Outdated
| @sender_thread = @thread_class.new do | ||
| begin | ||
| send_loop | ||
| ensure | ||
| end | ||
| end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: I guess this can go back to the earlier @sender_thread = @thread_class.new(&method(:send_loop))
This pr introduces a change to handle the case in which
sender_threadfails to start and aThreadErroris returned.Addresses issue #306 authored by @mperham