feat(zero-cache): add snapshot reservation timeout #5409

Karavil · 2026-01-14T02:44:00Z

Motivation

Change log cleanup is paused while any client holds a /snapshot reservation. Today a reservation is only released when the client finishes snapshot initialization or the /snapshot connection closes.

If a view syncer wedges mid snapshot initialization (hang, deadlock, stuck IO, event loop starvation), the reservation can remain open indefinitely. That pauses cleanup indefinitely, which allows _zero.changeLog to grow without bound.

We hit this in production (Goblins): multiple view syncer tasks held /snapshot reservations for hours. Cleanup stayed paused, the changelog grew roughly 10x during peak write load, and view syncer latency climbed into the 100s of ms.

Failure mode (ASCII)

view syncer starts (or restarts)
     │
     ▼
opens /snapshot reservation (holds socket open during restore)
     │
     ▼
backup monitor sees active reservation(s)
     │
     ▼
watermark cleanup paused for snapshot(s): <task ids>
     │
     ▼
(no Purged ... logs for a long time)
     │
     ▼
changeLog rows accumulate
     │
     ├─► view syncer catchup batches get bigger
     ├─► SELECT * FROM "_zero.changeLog" WHERE stateVersion > ? gets slower
     └─► pipeline work increases, more resets, more SQLite contention
              │
              ▼
           latency spikes

What this looked like in logs

Cleanup paused while snapshotting:

{"level":"INFO","worker":"change-streamer","component":"backup-monitor","message":"pausing change-log cleanup while <task-id> snapshots"}
{"level":"INFO","worker":"change-streamer","component":"backup-monitor","message":"watermark cleanup paused for snapshot(s): <task-id>[,<task-id>...]"}

No purges until reservations clear, then a large purge:

{"level":"INFO","worker":"change-streamer","component":"change-streamer","message":"Purged 98547 changes before <watermark>"}

View syncer symptom (slow changelog reads):

{"level":"WARN","worker":"syncer","component":"view-syncer","class":"Statement","sql":"SELECT * FROM \"_zero.changeLog\" WHERE stateVersion > ?","method":"iterate","message":"Slow SQLite query 227.85"}

Catchup draining the backlog:

{"level":"INFO","worker":"change-streamer","message":"caught up <task-id>/serving-replicator-<pid> with 162277 changes (155074 ms)"}

FAQ

Doesn't websocket liveness already cover this?

There is websocket liveness (ping/pong) in zero-cache streams. It will terminate connections that stop responding to heartbeats.

That helps for truly dead connections, but it does not bound how long a reservation can be held when the view syncer is still alive enough to respond to heartbeats but is not making progress in snapshot init.

Doesn't a container health check solve this?

A container health check is a great mitigation and we agree it should be used.

One important detail is that in the current zero-cache boot sequence, /keepalive is generally not served until after the worker reports ready. The view syncer performs its litestream restore before it reports ready, so a container health check that hits /keepalive should fail during restore (connection refused or timeout).

This means a properly configured container health check plus a reasonable startPeriod (upper bound on expected restore time) should prevent the specific "stuck in restore for hours" failure we saw.

However, it is still easy to end up with a long cleanup pause in practice:

Missing or overly permissive health checks.
Very long startPeriod or service health check grace periods (which are "ignore failing checks" timers).
Non ECS deployments.
Future refactors that make /keepalive available earlier in startup.

Because cleanup blocking is a high impact failure mode, relying on orchestration alone is brittle. A server-side timeout is a small, explicit safety valve.

Wouldn't the view syncer spinning down release the snapshot reservation?

Not necessarily. The reservation is released when the replication-manager observes the /snapshot connection closing.

A view syncer can be "draining" from a load balancer perspective while the process is still alive and the socket is still open. Separately, if a view syncer is alive enough to keep the socket open (and respond to heartbeats) but not making progress, the reservation will still be held.

Why a configurable timeout instead of a hardcoded one?

Snapshot restore plus init time varies a lot by deployment (replica size, WAL volume, bandwidth, checkpointing behavior, restore parallelism, etc). A config option lets operators choose a value that is safely above their expected snapshot time while still bounding the worst-case cleanup pause.

Change

Adds litestream.snapshotReservationTimeoutMs (flag: --litestream-snapshot-reservation-timeout-ms, env: ZERO_LITESTREAM_SNAPSHOT_RESERVATION_TIMEOUT_MS, default: 0 (disabled)).
When set to a positive value, each /snapshot reservation gets a timer. If the timer fires, the reservation is ended and the websocket is closed so cleanup can resume.
Adds a unit test covering timeout expiry.

Notes

Timeout expiry does not update the cleanup delay (it uses updateCleanupDelay=false).
Operators should set this comfortably above expected snapshot restore plus initialization time.
Default is disabled (no behavior change unless configured).

Testing

npm --prefix packages/zero-cache run check-types
npm --prefix packages/zero-cache run test:no-pg

vercel · 2026-01-14T02:44:04Z

@Karavil is attempting to deploy a commit to the Rocicorp Team on Vercel.

A member of the Team first needs to authorize it.

feat(zero-cache): add snapshot reservation timeout

0eb97e9

Karavil marked this pull request as draft January 14, 2026 02:45

Karavil marked this pull request as ready for review January 14, 2026 02:57

Karavil marked this pull request as draft January 14, 2026 03:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(zero-cache): add snapshot reservation timeout #5409

feat(zero-cache): add snapshot reservation timeout #5409

Uh oh!

Karavil commented Jan 14, 2026 •

edited

Loading

Uh oh!

vercel bot commented Jan 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

feat(zero-cache): add snapshot reservation timeout #5409

Are you sure you want to change the base?

feat(zero-cache): add snapshot reservation timeout #5409

Uh oh!

Conversation

Karavil commented Jan 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Failure mode (ASCII)

What this looked like in logs

FAQ

Doesn't websocket liveness already cover this?

Doesn't a container health check solve this?

Wouldn't the view syncer spinning down release the snapshot reservation?

Why a configurable timeout instead of a hardcoded one?

Change

Notes

Testing

Uh oh!

vercel bot commented Jan 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Karavil commented Jan 14, 2026 •

edited

Loading