Skip to content

Conversation

@Karavil
Copy link
Contributor

@Karavil Karavil commented Jan 14, 2026

Motivation

Change log cleanup is paused while any client holds a /snapshot reservation. Today a reservation is only released when the client finishes snapshot initialization or the /snapshot connection closes.

If a view syncer wedges mid snapshot initialization (hang, deadlock, stuck IO, event loop starvation), the reservation can remain open indefinitely. That pauses cleanup indefinitely, which allows _zero.changeLog to grow without bound.

We hit this in production (Goblins): multiple view syncer tasks held /snapshot reservations for hours. Cleanup stayed paused, the changelog grew roughly 10x during peak write load, and view syncer latency climbed into the 100s of ms.

Failure mode (ASCII)

view syncer starts (or restarts)
     │
     ▼
opens /snapshot reservation (holds socket open during restore)
     │
     ▼
backup monitor sees active reservation(s)
     │
     ▼
watermark cleanup paused for snapshot(s): <task ids>
     │
     ▼
(no Purged ... logs for a long time)
     │
     ▼
changeLog rows accumulate
     │
     ├─► view syncer catchup batches get bigger
     ├─► SELECT * FROM "_zero.changeLog" WHERE stateVersion > ? gets slower
     └─► pipeline work increases, more resets, more SQLite contention
              │
              ▼
           latency spikes

What this looked like in logs

Cleanup paused while snapshotting:

{"level":"INFO","worker":"change-streamer","component":"backup-monitor","message":"pausing change-log cleanup while <task-id> snapshots"}
{"level":"INFO","worker":"change-streamer","component":"backup-monitor","message":"watermark cleanup paused for snapshot(s): <task-id>[,<task-id>...]"}

No purges until reservations clear, then a large purge:

{"level":"INFO","worker":"change-streamer","component":"change-streamer","message":"Purged 98547 changes before <watermark>"}

View syncer symptom (slow changelog reads):

{"level":"WARN","worker":"syncer","component":"view-syncer","class":"Statement","sql":"SELECT * FROM \"_zero.changeLog\" WHERE stateVersion > ?","method":"iterate","message":"Slow SQLite query 227.85"}

Catchup draining the backlog:

{"level":"INFO","worker":"change-streamer","message":"caught up <task-id>/serving-replicator-<pid> with 162277 changes (155074 ms)"}

FAQ

Doesn't websocket liveness already cover this?

There is websocket liveness (ping/pong) in zero-cache streams. It will terminate connections that stop responding to heartbeats.

That helps for truly dead connections, but it does not bound how long a reservation can be held when the view syncer is still alive enough to respond to heartbeats but is not making progress in snapshot init.

Doesn't a container health check solve this?

A container health check is a great mitigation and we agree it should be used.

One important detail is that in the current zero-cache boot sequence, /keepalive is generally not served until after the worker reports ready. The view syncer performs its litestream restore before it reports ready, so a container health check that hits /keepalive should fail during restore (connection refused or timeout).

This means a properly configured container health check plus a reasonable startPeriod (upper bound on expected restore time) should prevent the specific "stuck in restore for hours" failure we saw.

However, it is still easy to end up with a long cleanup pause in practice:

  • Missing or overly permissive health checks.
  • Very long startPeriod or service health check grace periods (which are "ignore failing checks" timers).
  • Non ECS deployments.
  • Future refactors that make /keepalive available earlier in startup.

Because cleanup blocking is a high impact failure mode, relying on orchestration alone is brittle. A server-side timeout is a small, explicit safety valve.

Wouldn't the view syncer spinning down release the snapshot reservation?

Not necessarily. The reservation is released when the replication-manager observes the /snapshot connection closing.

A view syncer can be "draining" from a load balancer perspective while the process is still alive and the socket is still open. Separately, if a view syncer is alive enough to keep the socket open (and respond to heartbeats) but not making progress, the reservation will still be held.

Why a configurable timeout instead of a hardcoded one?

Snapshot restore plus init time varies a lot by deployment (replica size, WAL volume, bandwidth, checkpointing behavior, restore parallelism, etc). A config option lets operators choose a value that is safely above their expected snapshot time while still bounding the worst-case cleanup pause.

Change

  • Adds litestream.snapshotReservationTimeoutMs (flag: --litestream-snapshot-reservation-timeout-ms, env: ZERO_LITESTREAM_SNAPSHOT_RESERVATION_TIMEOUT_MS, default: 0 (disabled)).
  • When set to a positive value, each /snapshot reservation gets a timer. If the timer fires, the reservation is ended and the websocket is closed so cleanup can resume.
  • Adds a unit test covering timeout expiry.

Notes

  • Timeout expiry does not update the cleanup delay (it uses updateCleanupDelay=false).
  • Operators should set this comfortably above expected snapshot restore plus initialization time.
  • Default is disabled (no behavior change unless configured).

Testing

  • npm --prefix packages/zero-cache run check-types
  • npm --prefix packages/zero-cache run test:no-pg

@vercel
Copy link

vercel bot commented Jan 14, 2026

@Karavil is attempting to deploy a commit to the Rocicorp Team on Vercel.

A member of the Team first needs to authorize it.

@Karavil Karavil marked this pull request as draft January 14, 2026 02:45
@Karavil Karavil marked this pull request as ready for review January 14, 2026 02:57
@Karavil Karavil marked this pull request as draft January 14, 2026 03:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant