Skip to content

feat(update): add pre-update rollback snapshot with auto-restore on failure#358

Open
buddy0323 wants to merge 3 commits intoLight-Heart-Labs:mainfrom
buddy0323:feat/update-rollback-point
Open

feat(update): add pre-update rollback snapshot with auto-restore on failure#358
buddy0323 wants to merge 3 commits intoLight-Heart-Labs:mainfrom
buddy0323:feat/update-rollback-point

Conversation

@buddy0323
Copy link
Contributor

Summary

  • Before every dream-update.sh update run, a rollback snapshot is written to data/backups/pre-update-<timestamp>/ containing .env, all active docker-compose overlays, per-extension config dirs (config/{litellm,n8n,openclaw,searxng}/), and .version
  • On any failure — git pull error, failed migration, or health-check timeout — the snapshot is automatically restored and services are restarted with the pre-update configuration, requiring no manual intervention
  • dream-update.sh rollback now prefers the most recent pre-update-* snapshot over general backups, and accepts a bare timestamp as a target (e.g. rollback 20260317-120000)
  • dream-update.sh status now shows rollback snapshot count, storage path, and the last recorded snapshot path
  • Snapshots are pruned automatically, retaining at most MAX_BACKUPS (default: 10)

Files changed

  • dream-update.sh — added ROLLBACK_DIR, snapshot_pre_update(), _restore_snapshot(), wait_for_healthy(); rewired cmd_update() with 6-step rollback-aware flow; extended cmd_rollback() and cmd_status()

Test plan

  • Run dream-update.sh update on a clean install — confirm data/backups/pre-update-<timestamp>/ is created with .env, compose files, and config dirs
  • Simulate a git pull failure (e.g. network off) — confirm snapshot is restored and services stay up
  • Simulate a migration failure — confirm snapshot is restored before services restart
  • Set HEALTH_TIMEOUT=15 and break a service — confirm auto-restore triggers after timeout
  • Run dream-update.sh rollback with no argument — confirm it picks the most recent pre-update-* snapshot
  • Run dream-update.sh rollback <timestamp> — confirm it resolves the correct snapshot
  • Run dream-update.sh status — confirm rollback snapshot count and path are displayed
  • Verify oldest snapshots are pruned when count exceeds MAX_BACKUPS

Copy link
Collaborator

@Lightheartdevs Lightheartdevs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: REQUEST CHANGES

Sound design — snapshot before update with auto-restore on failure is the right approach. But several issues need fixing before merge.

Blocking: _update_rollback swallows restore failures

_update_rollback never checks the return value of _restore_snapshot. If restore fails (permissions, corrupt snapshot), the user sees "Rollback complete" even though the system is broken. Must check return value and emit log_error with manual recovery instructions if it fails.

Blocking: || true / 2>/dev/null violates CLAUDE.md

CLAUDE.md rule 4: "Never || true or 2>/dev/null." The PR introduces many new instances:

  • (( files_saved++ )) || true (4 times) — fix: files_saved=$(( files_saved + 1 )) avoids the exit-code issue
  • cmd_health &>/dev/null in wait_for_healthy — completely silences health output, making debugging impossible
  • Docker fallback chains with || true — use || { log_warn "docker compose v2 failed, trying v1"; docker-compose ...; } instead

High: No snapshot integrity validation

A disk-full-mid-copy produces a partial snapshot that looks valid (has snapshot.json) but is missing files. _restore_snapshot only checks [[ -d "$snap_dir" ]]. Add verification: confirm snapshot.json is valid JSON and critical files (.env, .version) exist before proceeding.

High: No timestamp input validation

snapshot_pre_update interpolates the timestamp directly into a path. Validate format: [[ "$timestamp" =~ ^[0-9]{8}-[0-9]{6}$ ]] || { log_error "Invalid timestamp"; return 1; }

Medium: rm -rf in pruning needs safety guard

If ROLLBACK_DIR is misconfigured (empty string), find searches CWD. Add: [[ -n "$ROLLBACK_DIR" && "$ROLLBACK_DIR" == */data/backups ]]

Medium: Function too large

snapshot_pre_update is 60+ lines (CLAUDE.md threshold: 30). Extract pruning into _prune_rollback_snapshots.

Medium: Inconsistent health checking

cmd_update uses the new wait_for_healthy but cmd_rollback still uses sleep 10; cmd_health. Should be consistent.

Low: Nested function relies on dynamic scoping

_update_rollback captures $snap_dir and $compose_flags by closure. Bash doesn't have real closures. Pass them as explicit parameters.

🤖 Reviewed with Claude Code

@buddy0323
Copy link
Contributor Author

All review items addressed:

Blocking fixes

  • _update_rollback is now a top-level function with explicit <snap_dir> and <compose_flags> parameters (no dynamic scoping). Return value of _restore_snapshot is checked; on failure it prints CRITICAL + exact manual recovery steps (cp .env …, docker compose up -d) and returns 1.
  • All || true and 2>/dev/null instances removed: files_saved and attempt increments use $(( n + 1 )), cmd_health output is captured to a mktemp log (shown in full only on timeout), and docker v1/v2 fallbacks use explicit if ! docker compose …; then log_warn "…trying v1…"; docker-compose …; fi chains.

High fixes

  • snapshot_pre_update runs jq empty snapshot.json after writing metadata; on invalid JSON it removes the partial snapshot and returns 1.
  • _restore_snapshot checks for snapshot.json existence and validity before touching any install files, and warns if .env or .version are absent.
  • Timestamp is validated with [[ "$timestamp" =~ ^[0-9]{8}-[0-9]{6}$ ]] before any directory is created.

Medium fixes

  • Pruning extracted into _prune_rollback_snapshots() with a safety guard: aborts if ROLLBACK_DIR is empty or does not end in /data/backups.
  • snapshot_pre_update is now ~40 lines (down from ~60) after the extraction.
  • cmd_rollback now calls wait_for_healthy instead of sleep 10; cmd_health, consistent with cmd_update.

Low fix

  • _update_rollback promoted to module-level; $snap_dir and $compose_flags passed as explicit positional arguments at every call site.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants