Xandra exits on connection shutdwon during connection checkout#390
Merged
whatyouhide merged 2 commits intowhatyouhide:mainfrom Jan 14, 2026
Merged
Xandra exits on connection shutdwon during connection checkout#390whatyouhide merged 2 commits intowhatyouhide:mainfrom
whatyouhide merged 2 commits intowhatyouhide:mainfrom
Conversation
Trap shutdown exits from gen_statem checkout and surface them as :connection_shutdown errors so callers don’t crash and retry logic can run.
4eb3916 to
d5eabc9
Compare
whatyouhide
approved these changes
Jan 14, 2026
Owner
whatyouhide
left a comment
There was a problem hiding this comment.
Fantastic! 🎉 Amazing job on the report and root cause analysis as well, thanks @jvf.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
When a
Xandra.Connectionprocess shuts down while a caller is blocked in:gen_statem.call(conn_pid, {:checkout_state_for_next_request, ref}, :infinity),the caller exits with
:shutdown. This propagates to application processes, causing avoidable crashes.Because this happens during the state checkout (before a query is attempted), the retry strategy is never
invoked, and the caller dies instead of receiving a retryable error.
Example
We experienced this issue in production. A single Cassandra node experienced a "local pause" (this is what is
reported in the Cassandra logs by the
FailureDetector) due to I/O exhaustion (presumably, this is what ourOS level monitoring suggests). Around that time, when trying to insert data into this node, our process was
exited due to Xandra propagating an exit:
In our application logs we first saw the exit (this was during the time the Cassandra node already experienced
problems for ~40 seconds).
Roughly 25 seconds later we saw the connections to the node being re-established (only first connection of the pool shown here):
Where it happens
I think this is happening in
Xandra.Connectionwhen it uses:gen_statem.call/3during checkout. Thecurrent helper only traps
:noproc:If the connection exits with
:shutdown, the caller exits and does not return{:error, _}.Expected behavior
If the connection goes down during checkout, the caller should receive a normal error
(e.g.
{:error, :connection_shutdown}wrapped inXandra.ConnectionError) so that the caller process doesnot crash, and retry strategies can handle the failure.
Proposed change
Catch
:shutdownexit reasons in thegen_statem_call_trapping_noproc/2helper and return a normal errortuple:
This keeps the error handling consistent with the existing
{:error, reason}path andallows downstream retry strategies to kick in.
Additional context
As noted above, we experienced this in production in a multi-node cluster. Our
OS level monitoring suggests I/O exhaustion starting at 2:18:45 UTC.
In the Cassandra logs we see:
This indicates the the
FailureDetectorwas not scheduled for ~45 seconds (the warning is generated when notbeing scheduled for > 5s). So the problem was ongoing at least for ~40 seconds. We think this is the effect of
the I/O exhaustion.
Next we see lot's of these:
This matches the timestamp from Xandra restarting connections to the node, so we think this is the effect of
Xandra closing and restarting connections to the pool.
What I think is happening:
Xandra.Connectionreceives ais_closed_messageoris_error_messagesocket messageXandra.Cluster.PoolXandra.Cluster.Poolreacts by stopping the entire pool for that host:shutdownto all connection processes in that poolConnection reset by peermessages in Cassandra:gen_statem.callcheckout at that moment, it exits with:shutdown(as seen in our application logs)Environment