feat(zero-cache): add snapshot reservation timeout #5409
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Motivation
Change log cleanup is paused while any client holds a
/snapshotreservation. Today a reservation is only released when the client finishes snapshot initialization or the/snapshotconnection closes.If a view syncer wedges mid snapshot initialization (hang, deadlock, stuck IO, event loop starvation), the reservation can remain open indefinitely. That pauses cleanup indefinitely, which allows
_zero.changeLogto grow without bound.We hit this in production (Goblins): multiple view syncer tasks held
/snapshotreservations for hours. Cleanup stayed paused, the changelog grew roughly 10x during peak write load, and view syncer latency climbed into the 100s of ms.Failure mode (ASCII)
What this looked like in logs
Cleanup paused while snapshotting:
{"level":"INFO","worker":"change-streamer","component":"backup-monitor","message":"pausing change-log cleanup while <task-id> snapshots"} {"level":"INFO","worker":"change-streamer","component":"backup-monitor","message":"watermark cleanup paused for snapshot(s): <task-id>[,<task-id>...]"}No purges until reservations clear, then a large purge:
{"level":"INFO","worker":"change-streamer","component":"change-streamer","message":"Purged 98547 changes before <watermark>"}View syncer symptom (slow changelog reads):
{"level":"WARN","worker":"syncer","component":"view-syncer","class":"Statement","sql":"SELECT * FROM \"_zero.changeLog\" WHERE stateVersion > ?","method":"iterate","message":"Slow SQLite query 227.85"}Catchup draining the backlog:
{"level":"INFO","worker":"change-streamer","message":"caught up <task-id>/serving-replicator-<pid> with 162277 changes (155074 ms)"}FAQ
Doesn't websocket liveness already cover this?
There is websocket liveness (ping/pong) in
zero-cachestreams. It will terminate connections that stop responding to heartbeats.That helps for truly dead connections, but it does not bound how long a reservation can be held when the view syncer is still alive enough to respond to heartbeats but is not making progress in snapshot init.
Doesn't a container health check solve this?
A container health check is a great mitigation and we agree it should be used.
One important detail is that in the current
zero-cacheboot sequence,/keepaliveis generally not served until after the worker reports ready. The view syncer performs its litestream restore before it reports ready, so a container health check that hits/keepaliveshould fail during restore (connection refused or timeout).This means a properly configured container health check plus a reasonable
startPeriod(upper bound on expected restore time) should prevent the specific "stuck in restore for hours" failure we saw.However, it is still easy to end up with a long cleanup pause in practice:
startPeriodor service health check grace periods (which are "ignore failing checks" timers)./keepaliveavailable earlier in startup.Because cleanup blocking is a high impact failure mode, relying on orchestration alone is brittle. A server-side timeout is a small, explicit safety valve.
Wouldn't the view syncer spinning down release the snapshot reservation?
Not necessarily. The reservation is released when the replication-manager observes the
/snapshotconnection closing.A view syncer can be "draining" from a load balancer perspective while the process is still alive and the socket is still open. Separately, if a view syncer is alive enough to keep the socket open (and respond to heartbeats) but not making progress, the reservation will still be held.
Why a configurable timeout instead of a hardcoded one?
Snapshot restore plus init time varies a lot by deployment (replica size, WAL volume, bandwidth, checkpointing behavior, restore parallelism, etc). A config option lets operators choose a value that is safely above their expected snapshot time while still bounding the worst-case cleanup pause.
Change
litestream.snapshotReservationTimeoutMs(flag:--litestream-snapshot-reservation-timeout-ms, env:ZERO_LITESTREAM_SNAPSHOT_RESERVATION_TIMEOUT_MS, default:0(disabled))./snapshotreservation gets a timer. If the timer fires, the reservation is ended and the websocket is closed so cleanup can resume.Notes
updateCleanupDelay=false).Testing
npm --prefix packages/zero-cache run check-typesnpm --prefix packages/zero-cache run test:no-pg