Skip to content

nickfujita: Fix: Message buffering, connection handling, and concurrent connection support#873

Open
tdrz wants to merge 16 commits intomainfrom
tdrz/nickfujita-improvements
Open

nickfujita: Fix: Message buffering, connection handling, and concurrent connection support#873
tdrz wants to merge 16 commits intomainfrom
tdrz/nickfujita-improvements

Conversation

@tdrz
Copy link
Collaborator

@tdrz tdrz commented Jan 12, 2026

See #775

@github-actions
Copy link
Contributor

github-actions bot commented Jan 12, 2026

@github-actions
Copy link
Contributor

@github-actions
Copy link
Contributor

github-actions bot commented Jan 12, 2026

@tdrz tdrz requested a review from samwillis January 12, 2026 10:02
@samwillis
Copy link
Collaborator

I've not had a chance yet to look in detail (I'm tied up this morning in a mob session) - but I did ask GPT5.2 for a review of the branch. It looks like it's spotted a few things.


PR 873 Review — @electric-sql/pglite-socket

Summary

This PR targets three production issues:

  • Large queries causing crashes: PostgreSQL wire-protocol messages split across TCP packets were not being reassembled.
  • ECONNRESET crashes: abrupt disconnects could trigger unhandled rejections and crash the server process.
  • Single connection limitation: connection-level locking prevented concurrent clients from connecting.

The proposed architecture (per-connection handlers + shared query-level serialization) is the right direction for PGlite’s single-threaded execution model.


What looks good

  • Query-level serialization: a shared queue that serializes access to db.execProtocolRaw(...) aligns with PGlite’s constraints while allowing multiple sockets to stay connected.
  • Connection lifecycle separation: isolating per-socket state (buffers, idle timeout, cleanup) into a handler class is a clean boundary.
  • Operational intent is sound: treating ECONNRESET as normal behavior for pooled clients is correct.
  • Config additions: maxConnections and idleTimeout are reasonable operational controls.

Key risks / correctness concerns to double-check

1) TCP fragmentation / buffering must be concurrency-safe

Because Node can emit multiple 'data' events rapidly, ensure the handler’s buffering/drain logic cannot run concurrently in a way that races on shared state (e.g. messageBuffer). A race here can reintroduce message boundary corruption for large queries.

Recommendation: ensure buffer append and buffer-drain are serialized per connection (one drain loop at a time).

2) Queue draining edge cases

The queue manager should guarantee forward progress even if enqueues arrive around the time the processor finishes draining.

Recommendation: verify there is no timing window where processing flips false while there are still items queued (i.e. no “stuck until next enqueue” behavior).

3) Protocol parsing coverage

The parsing logic should handle:

  • StartupMessage (no type byte; [len:int32][protocol:int32][params...])
  • SSLRequest / CancelRequest / GSSENCRequest variants (also no type byte, but different request codes)
  • Regular frontend messages ([type:byte][len:int32][payload...])

Recommendations:

  • Treat lengths as unsigned and validate bounds (e.g. reject absurd lengths to avoid memory blow-ups).
  • Confirm that SSLRequest/CancelRequest packets are not misclassified as regular messages.

4) maxConnections rejection behavior

If rejecting connections when maxConnections is reached, ensure the server doesn’t write arbitrary plaintext to a socket that expects Postgres protocol frames.

Recommendation: either close the socket cleanly, or send a proper Postgres ErrorResponse (more work, better UX).

5) Public API / types

If the handler/server options types changed, ensure:

  • Existing construction patterns remain valid (backwards compatibility expectations).
  • Publicly exported types don’t reference internal/private-only classes (to avoid .d.ts issues).

Testing suggestions (beyond current unit coverage)

  • Fragmentation regression test: send a single large query (>64KB) that is forced to split across multiple TCP packets and confirm the server processes it correctly.
  • Abrupt disconnect test: disconnect mid-query and confirm no unhandled rejections; server continues serving new connections.
  • Concurrency test: open N connections (e.g. 20–50), run interleaved queries, confirm serialization and no deadlocks/starvation.
  • Idle timeout test: with idleTimeout set, verify only idle sockets are closed; active sockets are unaffected.

Files of interest

  • packages/pglite-socket/src/index.ts (handler, queue, server implementation)
  • packages/pglite-socket/src/scripts/server.ts (CLI runner / operational entrypoint)
  • packages/pglite-socket/tests/*.test.ts (coverage for multiplexing/disconnect/fragmentation)

@github-actions
Copy link
Contributor

@github-actions
Copy link
Contributor

@github-actions
Copy link
Contributor

@github-actions
Copy link
Contributor

@github-actions
Copy link
Contributor

@github-actions
Copy link
Contributor

Copy link
Collaborator

@samwillis samwillis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tdrz sorry for the delay! looking really good - I was all set to approve but Opus found a missing await, thats the main this. There are a few other things it's found too:


PR Review: Connection Multiplexer for PGLite Socket Server

Summary

This PR implements a connection multiplexer allowing multiple clients to share a single PGlite instance via a QueryQueueManager that serializes query execution while maintaining transaction isolation per connection. The approach is sound and the implementation handles the core use cases well.


What's Working Well ✅

  1. Transaction tracking is correctly integrated - The socket server uses execProtocolRaw, and I verified that isInTransaction() works correctly because transaction state is tracked at the WASM write callback level (#pglite_write), which parses CommandCompleteMessage for all protocol execution paths.

  2. Handler ID tracking - Each connection gets a unique ID for transaction attribution, enabling correct isolation.

  3. Transaction-aware queue processing - When a transaction is active, only queries from the transaction owner are processed, preventing interleaving.

  4. Cleanup on disconnect - clearQueueForHandler rejects pending queries and clearTransactionIfNeeded rolls back orphaned transactions.

  5. Good integration test coverage - Tests cover:

    • Interleaved transaction and query from different clients
    • Transaction owner disconnect/crash scenarios
    • Two independent interleaved transactions

Issues Found

1. 🔴 Critical: Missing await on ROLLBACK

async clearTransactionIfNeeded(handlerId: number): Promise<void> {
  if (this.db.isInTransaction() && this.lastHandlerId === handlerId) {
    this.db.exec('ROLLBACK')  // ← Missing await!
    this.lastHandlerId = null
    await this.processQueue()
  }
}

The rollback may not complete before processQueue() is called, potentially causing the next query to execute while the rollback is still in progress. This could lead to undefined behavior or transaction state corruption.

Fix:

await this.db.exec('ROLLBACK')

2. 🟡 Medium: Potential Indefinite Blocking

When a transaction is active but the transaction owner has no queries in the queue, processQueue() breaks out of the loop:

if (i === -1) {
  query = null
}
if (!query) break

If the transaction owner is slow (e.g., user thinking, network delay), other clients' queries will sit in the queue indefinitely until the owner sends another query or disconnects.

Scenario:

  1. Client A sends BEGIN → transaction starts
  2. Client B sends SELECT 1 → queued, waiting
  3. Client A is idle for 30 seconds...
  4. Client B's query is blocked the entire time

Consider adding a warning log when queries are blocked, or a configurable timeout for blocked queries.


Test Coverage Gaps

The integration tests are good, but there are some gaps:

  1. No unit tests for QueryQueueManager - The handler tests mock it, so the actual transaction queue logic isn't unit tested in isolation.

  2. No test for slow/idle transaction owner - Would be valuable to verify behavior when the transaction owner doesn't send queries for an extended period.

  3. No explicit queue ordering verification - No test explicitly verifies that transaction owner's queries are prioritized correctly when multiple handlers have queued queries.


Minor Observations

  • The CONNECTION_QUEUE_TIMEOUT constant (line 5) appears to be exported but unused after the refactor to the new multiplexing approach.

  • Consider adding JSDoc documentation to QueryQueueManager explaining the transaction isolation strategy.


Verdict

The architecture is solid and the transaction isolation approach is correct. Please fix the missing await on the ROLLBACK before merging. The blocking concern is worth noting but acceptable for an initial implementation - it could be addressed in a follow-up if it becomes an issue in practice.

@tdrz
Copy link
Collaborator Author

tdrz commented Jan 30, 2026

@samwillis Thank, good points! Fixed the await.

The rest are all valid points, will address them in the future.

@tdrz tdrz requested a review from samwillis January 30, 2026 19:41
@github-actions
Copy link
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants