Skip to content

Conversation

@jeremy
Copy link
Contributor

@jeremy jeremy commented Nov 24, 2025

Fixes a race condition that causes SIGSEGV when a connection is closed while another thread is mid-query.

The library isn't meant to be thread-safe, but in case of mistaken concurrent use let's raise exceptions rather than crashing. Prefer both belt and suspenders when they're cheap and available.

Adds NULL deref guards to protect against connection close while awaiting IO (rb_wait_for_single_fd without GVL). Now raises Trilogy::ConnectionClosed instead of segfaulting.

Guards:

  • read_packet: check socket and buffer before read
  • trilogy_flush_writes: check socket before write
  • current_packet_type: check buffer before access
  • Ruby wrapper flush_writes: early exit on closed socket
  • Ruby read_query_response: check socket after wait returns
  • Ruby buffer_checkout/checkin: NULL pool fallback

/cc @jhawthorn

Fixes a race condition that causes SIGSEGV when a connection is closed
while another thread is mid-query.

The library isn't meant to be thread-safe, but in case of mistaken
concurrent use let's raise exceptions rather than crashing. Prefer both
belt and suspenders when they're cheap and available.

Adds NULL deref guards to protect against connection close while
awaiting IO (rb_wait_for_single_fd without GVL). Now raises
Trilogy::ConnectionClosed instead of segfaulting.

Guards:
* read_packet: check socket and buffer before read
* trilogy_flush_writes: check socket before write
* current_packet_type: check buffer before access
* Ruby wrapper flush_writes: early exit on closed socket
* Ruby read_query_response: check socket after wait returns
* Ruby buffer_checkout/checkin: NULL pool fallback
@jhawthorn
Copy link
Collaborator

I don't think I agree with this approach. Using a connection in parallel is so dangerous that a SEGV is preferable to a silent failure (or a failure that may be retried at another level of the stack, like TRILOGY_CLOSED_CONNECTION is likely to be)

@jeremy
Copy link
Contributor Author

jeremy commented Dec 1, 2025

@jhawthorn Goal here is louder failure, not silent failure. When a developer makes the mistake of concurrent access, a Ruby exception pointing at the call site will help them fix it faster than a SEGV they may never see with a native backtrace.

(SEGV in production is often silent or lost in the noise, manifesting as a worker failure + supervisor restart rather than an exception that can be reported and acted on. You get a C backtrace and crash dump rather than a relevant Ruby backtrace, and many environments have core dumps disabled. The Ruby exception is easier to observe and act on: it leads you straight to where concurrent access originated and gets logged with relevant context.)

Good point re. retry risk. Could raise a distinct exception like Trilogy::UnsafeConcurrentAccess that's clearly fatal.

Otherwise, these are cheap NULL checks. They don't make the library thread-safe or encourage concurrent use; they just make violations debuggable rather than catastrophic. Big bonus is that they communicate that the library is fully aware of where state violations may occur and how they should be handled.

@jhawthorn
Copy link
Collaborator

I think this will provide a false sense of security that invalid concurrent access will raise cleanly. In reality there's no guarantee these NULL checks will practically guard against issues. The value could just as easily become NULL the instant after it is checked.

I would like something that better guards against concurrent access, but it should either look like the mutexes added by #226 or using atomic operations to mark when a connection is in use. Anything else is unsound.

@matthewd
Copy link
Collaborator

matthewd commented Dec 1, 2025

Yeah, I agree with @jhawthorn.

Big bonus is that they communicate that the library is fully aware of where state violations may occur and how they should be handled.

It's not though, and that's the danger: these aren't "oopsie you pushed concurrent-access a little too far, have an exception and back up a bit"; they're evidence that you have fundamentally mishandled the library in a way that has probably already been leaking your customers' data to each other in ways we have no practical way of detecting let alone reporting.

@jeremy
Copy link
Contributor Author

jeremy commented Dec 1, 2025

Agreed the preconditions and checks are inherently racy and are incomplete without synchronization. I'll withdraw until that's addressed and these can become invariant checks (indicating improper synchronization impl) rather than concurrency bomb sniffers.

@jeremy jeremy closed this Dec 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants