Xandra exits on connection shutdwon during connection checkout by jvf · Pull Request #390 · whatyouhide/xandra

jvf · 2026-01-12T10:01:02Z

Summary

When a Xandra.Connection process shuts down while a caller is blocked in
:gen_statem.call(conn_pid, {:checkout_state_for_next_request, ref}, :infinity),
the caller exits with :shutdown. This propagates to application processes, causing avoidable crashes.
Because this happens during the state checkout (before a query is attempted), the retry strategy is never
invoked, and the caller dies instead of receiving a retryable error.

Example

We experienced this issue in production. A single Cassandra node experienced a "local pause" (this is what is
reported in the Cassandra logs by the FailureDetector) due to I/O exhaustion (presumably, this is what our
OS level monitoring suggests). Around that time, when trying to insert data into this node, our process was
exited due to Xandra propagating an exit:

2025-12-19 02:19:00.833 module=supervisor pid=<0.420178.0> [error]: Child <OurApp>.Worker of Supervisor #PID<0.420178.0> (<OurApp>.Supervisor.Server) terminated
** (exit) exited in: :gen_statem.call(#PID<0.2566.0>, {:checkout_state_for_next_request, #Reference<0.0.3451443.215878225.1680146433.165777>}, :infinity)
    ** (EXIT) shutdown
Pid: #PID<0.420180.0>

In our application logs we first saw the exit (this was during the time the Cassandra node already experienced
problems for ~40 seconds).

Roughly 25 seconds later we saw the connections to the node being re-established (only first connection of the pool shown here):

2025-12-19 02:19:26.315 module=supervisor pid=<0.421299.0> [info]: Child {Xandra, 1} of Supervisor #PID<0.421299.0> (Xandra.Cluster.ConnectionPool) started
Pid: #PID<0.421300.0>
Start Call: Xandra.start_link([<obfuscated args>])

Where it happens

I think this is happening in Xandra.Connection when it uses :gen_statem.call/3 during checkout. The
current helper only traps :noproc:

# deps/xandra/lib/xandra/connection.ex
case gen_statem_call_trapping_noproc(conn_pid, {:checkout_state_for_next_request, req_alias}) do
  {:ok, state} -> ...
  {:error, error} -> {:error, ConnectionError.new("check out connection", error)}
end

defp gen_statem_call_trapping_noproc(pid, call) do
  :gen_statem.call(pid, call)
catch
  :exit, {:noproc, _} -> {:error, :no_connection_process}
end

If the connection exits with :shutdown, the caller exits and does not return {:error, _}.

Expected behavior

If the connection goes down during checkout, the caller should receive a normal error
(e.g. {:error, :connection_shutdown} wrapped in Xandra.ConnectionError) so that the caller process does
not crash, and retry strategies can handle the failure.

Proposed change

Catch :shutdown exit reasons in the gen_statem_call_trapping_noproc/2 helper and return a normal error
tuple:

defp gen_statem_call_trapping_noproc(pid, call) do
  :gen_statem.call(pid, call)
catch
  :exit, {:noproc, _} -> {:error, :no_connection_process}
  :exit, :shutdown -> {:error, :connection_shutdown}
  :exit, {:shutdown, _} -> {:error, :connection_shutdown}
  :exit, {:shutdown, _, _} -> {:error, :connection_shutdown}
end

This keeps the error handling consistent with the existing {:error, reason} path and
allows downstream retry strategies to kick in.

Additional context

As noted above, we experienced this in production in a multi-node cluster. Our
OS level monitoring suggests I/O exhaustion starting at 2:18:45 UTC.

In the Cassandra logs we see:

WARN  [GossipTasks:1] 2025-12-19 02:19:26,107 FailureDetector.java:319 - Not marking nodes down due to local pause of 45616267328ns > 5000000000ns

This indicates the the FailureDetector was not scheduled for ~45 seconds (the warning is generated when not
being scheduled for > 5s). So the problem was ongoing at least for ~40 seconds. We think this is the effect of
the I/O exhaustion.

Next we see lot's of these:

WARN  [epollEventLoopGroup-5-5] 2025-12-19 02:19:26,137 PreV5Handlers.java:261 - Unknown exception in client networking
io.netty.channel.unix.Errors$NativeIoException: writevAddresses(..) failed: Connection reset by peer

This matches the timestamp from Xandra restarting connections to the node, so we think this is the effect of
Xandra closing and restarting connections to the pool.

What I think is happening:

the stalling Cassandra node causes some TCP issue on the client side
Xandra.Connection receives a is_closed_message or is_error_message socket message
the connection process reports the disconnect to the Xandra.Cluster.Pool
Xandra.Cluster.Pool reacts by stopping the entire pool for that host
that termination sends :shutdown to all connection processes in that pool
this causes the Connection reset by peer messages in Cassandra
if a worker is in :gen_statem.call checkout at that moment, it exits with :shutdown (as seen in our application logs)

Environment

Xandra v0.19.4
Erlang/OTP 27
Elixir v1.18.3

Trap shutdown exits from gen_statem checkout and surface them as :connection_shutdown errors so callers don’t crash and retry logic can run.

whatyouhide

Fantastic! 🎉 Amazing job on the report and root cause analysis as well, thanks @jvf.

lib/xandra/connection_error.ex

Return error instead of exit on connection shutdown

d5eabc9

Trap shutdown exits from gen_statem checkout and surface them as :connection_shutdown errors so callers don’t crash and retry logic can run.

jvf force-pushed the fix-connection-checkout-exit branch from 4eb3916 to d5eabc9 Compare January 12, 2026 10:40

whatyouhide approved these changes Jan 14, 2026

View reviewed changes

lib/xandra/connection_error.ex Outdated Show resolved Hide resolved

Apply suggestion from @whatyouhide

708e83b

whatyouhide merged commit 133a8d0 into whatyouhide:main Jan 14, 2026
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Xandra exits on connection shutdwon during connection checkout#390

Xandra exits on connection shutdwon during connection checkout#390
whatyouhide merged 2 commits intowhatyouhide:mainfrom
jvf:fix-connection-checkout-exit

jvf commented Jan 12, 2026

Uh oh!

whatyouhide left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

jvf commented Jan 12, 2026

Summary

Example

Where it happens

Expected behavior

Proposed change

Additional context

Environment

Uh oh!

whatyouhide left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants