Skip to content

OPS-08: Backup/restore automation and DR drill playbook#663

Merged
Chris0Jeky merged 7 commits intomainfrom
ops/backup-restore-dr-playbook
Apr 1, 2026
Merged

OPS-08: Backup/restore automation and DR drill playbook#663
Chris0Jeky merged 7 commits intomainfrom
ops/backup-restore-dr-playbook

Conversation

@Chris0Jeky
Copy link
Copy Markdown
Owner

Closes #86

Summary

  • scripts/backup.sh — timestamped SQLite hot backup via sqlite3 .backup (safe with active writers); configurable retention (default 7 days); PRAGMA integrity_check on output; chmod 600 on backup files; cp fallback with explicit warning
  • scripts/restore.sh — SQLite magic-bytes + PRAGMA integrity_check verification before overwrite; automatic safety copy of live DB; post-restore integrity verification; interactive confirmation (--yes to skip)
  • scripts/backup.ps1 / restore.ps1 — PowerShell equivalents for Windows with equivalent logic and restricted ACL (owner-only) instead of chmod
  • docs/ops/DISASTER_RECOVERY_RUNBOOK.md — RTO/RPO targets, backup scheduling (cron + Task Scheduler + Docker), step-by-step restore procedure, backup verification checks, access control requirements, DR drill schedule, evidence table template, escalation path
  • docs/ops/rehearsal-scenarios/backup-restore-drill.md — three injection options (delete, destructive query, Docker volume); pass criteria table; evidence checklist
  • docs/ops/INCIDENT_REHEARSAL_CADENCE.md — registered new scenario in scenario library

RTO/RPO

  • RTO: < 30 minutes (local SQLite)
  • RPO: < 24 hours (daily backup rotation default); < 1 hour with hourly cron

Usage

# Backup
bash scripts/backup.sh --db-path ~/.taskdeck/taskdeck.db --output-dir ~/.taskdeck/backups/

# Restore
bash scripts/restore.sh --backup-file ~/.taskdeck/backups/taskdeck-backup-2026-04-01-120000.db

Test plan

  • Run bash scripts/backup.sh on a live database and confirm PRAGMA integrity_check passes on the output
  • Run bash scripts/restore.sh against the backup and confirm post-restore integrity check passes
  • Verify retention: create 9 backups, confirm only 7 remain after each run
  • Verify --yes flag skips confirmation prompt
  • Run PowerShell equivalents on Windows and confirm equivalent behavior
  • Execute backup-restore-drill rehearsal scenario per the runbook

- scripts/backup.sh: timestamped hot backup via sqlite3 .backup (safe
  with active writers); configurable retention (default 7); integrity
  check on output; chmod 600 on backup files; cp fallback with warning
- scripts/restore.sh: magic-bytes + integrity_check verification before
  overwrite; automatic safety copy; post-restore integrity verification;
  interactive confirmation (--yes to skip)
- scripts/backup.ps1 / restore.ps1: PowerShell equivalents for Windows
  with equivalent logic and ACL restriction instead of chmod
- docs/ops/DISASTER_RECOVERY_RUNBOOK.md: RTO < 30min / RPO < 24h
  targets; backup schedule and cron examples; step-by-step restore
  procedure; backup verification checks; access control requirements
  (chmod 600 / restricted ACL); DR drill schedule; evidence table
  template; escalation path
- docs/ops/rehearsal-scenarios/backup-restore-drill.md: three injection
  options (delete, destructive query, Docker volume); pass criteria;
  evidence checklist; registered in scenario library
Copilot AI review requested due to automatic review settings April 1, 2026 18:26
@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, you can upgrade your account or add credits to your account and enable them for code reviews in your settings.

sqlite3 dot-commands (.backup, .restore) pass file paths as
single-quoted string literals. Reject paths containing single-quote
characters with a clear error message to prevent truncation or
misrouting of the dot-command. Both backup.sh and restore.sh now
validate --db-path, --output-dir, and --backup-file before proceeding.
@Chris0Jeky
Copy link
Copy Markdown
Owner Author

Self-Review Findings

Backup hot-copy safety (active writes): Safe — backup.sh uses sqlite3 .backup which acquires a shared lock and flushes WAL frames before copying. Safe with active writers. The cp fallback is explicitly labeled unsafe with a warning recommending sqlite3 for production use.

Restore integrity check: Fully verified before overwrite — (1) SQLite magic bytes checked via file -b or xxd, (2) PRAGMA integrity_check run on the backup file, (3) schema sanity check confirms a Boards table is present. Only then is the safety copy created and the live DB overwritten. It is not a file-exists-only check.

Retention logic correctness: Correct — ls -1t produces newest-first order; the loop runs from index $RETAIN to end, deleting the oldest entries. The newest $RETAIN backups are preserved. The glob taskdeck-backup-*.db is narrow enough to avoid touching unrelated files.

Path injection safety: Was a concern — sqlite3 .backup and .restore embed file paths as single-quoted string literals inside dot-commands. A path containing a single quote would break the command. Fixed in commit aeb41a52: both backup.sh and restore.sh now reject --db-path, --output-dir, and --backup-file values containing single-quote characters with a clear error message.

RTO/RPO realism: Realistic for a SQLite local-first app. RTO < 30 min is achievable: the restore script itself completes in seconds; the bulk of the window is API restart + WAL checkpoint. The runbook notes Docker/hosted instances may need up to 60 min. RPO < 24h under daily cron; < 1h with hourly scheduling — both documented.

Drill template usability: Actionable — the evidence table has Date, Operator, Backup Age, Backup File, Restore Duration, integrity_check result, Row Count Match, Pass/Fail, and Notes columns. The backup-restore-drill.md scenario has three injection options, a pass criteria table, and a concrete evidence checklist. The scenario is registered in INCIDENT_REHEARSAL_CADENCE.md.

Critical issues: None remaining. Path injection fix pushed; all other checks passed initial review.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request establishes a disaster recovery framework for Taskdeck, including a comprehensive runbook, rehearsal scenarios, and automation scripts for SQLite backups and restores on both Linux/macOS and Windows. The review feedback identifies opportunities to harden connection string parsing against multi-parameter strings, ensure Bash compatibility for macOS users by avoiding Bash 4.0 features, and prevent potential memory issues when validating large database files in PowerShell.

if [[ -z "$DB_PATH" ]]; then
# Try to extract from the ConnectionStrings env var (e.g. "Data Source=/app/data/taskdeck.db")
if [[ -n "${ConnectionStrings__DefaultConnection:-}" ]]; then
DB_PATH="${ConnectionStrings__DefaultConnection#*=}"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The connection string parsing is fragile. Using ${VAR#*=} assumes the path is the only parameter and follows the first =. If the connection string contains multiple parameters (e.g., Data Source=taskdeck.db;Version=3), this will result in an invalid file path.

Suggested change
DB_PATH="${ConnectionStrings__DefaultConnection#*=}"
DB_PATH=$(echo "${ConnectionStrings__DefaultConnection}" | sed -n 's/.*Data Source=\([^;]*\).*/\1/p')

Comment on lines +161 to +163
echo "ERROR: backup integrity check failed: $INTEGRITY" >&2
rm -f "$BACKUP_FILE"
exit 1
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

mapfile (or readarray) was introduced in Bash 4.0. Default installations of macOS still use Bash 3.2, where this command is unavailable. Using a while loop with read provides better compatibility for local-first deployments.

Suggested change
echo "ERROR: backup integrity check failed: $INTEGRITY" >&2
rm -f "$BACKUP_FILE"
exit 1
ALL_BACKUPS=()
while IFS= read -r line; do
ALL_BACKUPS+=("$line")
done < <(ls -1t "${OUTPUT_DIR}/taskdeck-backup-"*.db 2>/dev/null)

# ---------------------------------------------------------------------------
if [[ -z "$DB_PATH" ]]; then
if [[ -n "${ConnectionStrings__DefaultConnection:-}" ]]; then
DB_PATH="${ConnectionStrings__DefaultConnection#*=}"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The connection string parsing is fragile. Using ${VAR#*=} assumes the path is the only parameter and follows the first =. If the connection string contains multiple parameters (e.g., Data Source=taskdeck.db;Version=3), this will result in an invalid file path.

Suggested change
DB_PATH="${ConnectionStrings__DefaultConnection#*=}"
DB_PATH=$(echo "${ConnectionStrings__DefaultConnection}" | sed -n 's/.*Data Source=\([^;]*\).*/\1/p')

$csEnv = $env:ConnectionStrings__DefaultConnection
if (-not [string]::IsNullOrWhiteSpace($csEnv)) {
# Parse "Data Source=/path/to/taskdeck.db"
$DbPath = ($csEnv -replace '^(?i)data\s+source\s*=\s*', '').Trim()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The connection string parsing using -replace is fragile and will fail if the connection string contains multiple parameters. It's safer to split the string and extract the specific key.

        $DbPath = ($csEnv -split ';' | Where-Object { $_ -match 'Data Source=' } | ForEach-Object { $_ -replace '.*Data Source=', '' }).Trim()

if ([string]::IsNullOrWhiteSpace($DbPath)) {
$csEnv = $env:ConnectionStrings__DefaultConnection
if (-not [string]::IsNullOrWhiteSpace($csEnv)) {
$DbPath = ($csEnv -replace '^(?i)data\s+source\s*=\s*', '').Trim()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The connection string parsing using -replace is fragile and will fail if the connection string contains multiple parameters. It's safer to split the string and extract the specific key.

        $DbPath = ($csEnv -split ';' | Where-Object { $_ -match 'Data Source=' } | ForEach-Object { $_ -replace '.*Data Source=', '' }).Trim()


# Check SQLite magic bytes (first 16 bytes = "SQLite format 3\0")
$SqliteMagic = [System.Text.Encoding]::ASCII.GetBytes("SQLite format 3")
$FileBytes = [System.IO.File]::ReadAllBytes($BackupFile)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

[System.IO.File]::ReadAllBytes($BackupFile) reads the entire file into memory. For large database backups, this can lead to high memory consumption or OutOfMemoryException. Since only the first 16 bytes are needed for the magic check, it's more efficient to read only the required header.

    $FileBytes = Get-Content -Path $BackupFile -Encoding Byte -TotalCount 16

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds cross-platform (bash + PowerShell) automation and documentation to support SQLite backup/restore workflows and to formalize a DR drill playbook for Taskdeck’s local-first database.

Changes:

  • Introduces backup/restore scripts for Linux/macOS/WSL and Windows, including integrity checks and permission hardening.
  • Adds a Disaster Recovery runbook documenting RTO/RPO targets, scheduling guidance, verification steps, and escalation.
  • Adds a rehearsal scenario for running a backup/restore DR drill and registers it in the rehearsal cadence index.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 14 comments.

Show a summary per file
File Description
scripts/backup.sh Hot-backup + retention automation with integrity check and chmod hardening.
scripts/restore.sh Restore automation with pre/post integrity checks, schema sanity checks, and safety copy creation.
scripts/backup.ps1 Windows backup equivalent with retention and owner-only ACL hardening.
scripts/restore.ps1 Windows restore equivalent with validation, safety copy, and owner-only ACL hardening.
docs/ops/DISASTER_RECOVERY_RUNBOOK.md DR runbook describing backup/restore procedure, scheduling, verification, and evidence requirements.
docs/ops/rehearsal-scenarios/backup-restore-drill.md Step-by-step DR drill scenario including injections, recovery steps, and pass criteria.
docs/ops/INCIDENT_REHEARSAL_CADENCE.md Registers the new backup/restore drill scenario in the scenario library list.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +32 to +36
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"

DEFAULT_DB_PATH="${HOME}/.taskdeck/taskdeck.db"
DEFAULT_OUTPUT_DIR="${HOME}/.taskdeck/backups"
DEFAULT_RETAIN=7
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SCRIPT_DIR is assigned but never used, which adds dead code/noise. Remove it or use it (e.g., to resolve repo-relative defaults) so the script stays minimal and easier to maintain.

Copilot uses AI. Check for mistakes.
Comment on lines +124 to +129
# Restrict backup directory permissions (owner read/write only)
chmod 700 "$OUTPUT_DIR" 2>/dev/null || true

# ---------------------------------------------------------------------------
# Build backup filename
# ---------------------------------------------------------------------------
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sqlite3 dot-command is built with single-quoted interpolation (.backup '${BACKUP_FILE}'). If the output path contains a single quote, the backup will fail; worse, a crafted path could inject additional sqlite3 commands. Prefer passing a safely escaped path (e.g., escape single quotes for sqlite3) and avoid constructing dot-commands that can be interpreted as multiple commands.

Copilot uses AI. Check for mistakes.
Comment on lines +124 to +127
exit 1
fi
if [[ "$DB_PATH" == *"'"* ]]; then
echo "ERROR: --db-path must not contain single-quote characters: $DB_PATH" >&2
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The header check comment says the first 16 bytes must match SQLite format 3\0, but the script reads 15 bytes and compares against the hex for SQLite format 3 (no null terminator). Either read/compare 16 bytes including the trailing 00, or update the comment/expected value so the check and documentation align.

Copilot uses AI. Check for mistakes.
Comment on lines +31 to +36
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"

# ---------------------------------------------------------------------------
# Defaults
# ---------------------------------------------------------------------------
DEFAULT_DB_PATH="${HOME}/.taskdeck/taskdeck.db"
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SCRIPT_DIR is assigned but never used. Please remove it to reduce dead code, or use it to resolve repo-relative paths if that was the intent.

Copilot uses AI. Check for mistakes.
Comment on lines +210 to +216
[[ "$CONFIRM" =~ ^[Yy]$ ]] || { echo "Aborted."; exit 1; }
fi

# ---------------------------------------------------------------------------
# Step 3: Create safety copy of the current live database
# ---------------------------------------------------------------------------
mkdir -p "$SAFETY_DIR"
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sqlite3 dot-command string uses single-quoted interpolation (.backup '${SAFETY_FILE}'). A safety-copy path containing a single quote will break the command; additionally, constructing dot-commands from user-controlled paths can allow sqlite3 command injection. Escape single quotes for sqlite3 (or otherwise sanitize/encode the filename) before embedding it into a dot-command.

Copilot uses AI. Check for mistakes.
Comment on lines +89 to +96
```powershell
# Create a daily backup task
$action = New-ScheduledTaskAction -Execute "pwsh.exe" `
-Argument "-NonInteractive -File C:\taskdeck\scripts\backup.ps1 -Yes"
$trigger = New-ScheduledTaskTrigger -Daily -At "02:00"
Register-ScheduledTask -TaskName "Taskdeck-Daily-Backup" `
-Action $action -Trigger $trigger -RunLevel Highest
```
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Task Scheduler example passes -Yes to backup.ps1, but scripts/backup.ps1 does not define a Yes parameter. This will cause scheduled backups to fail. Remove -Yes from the example (or add a corresponding switch parameter in the script if intended).

Copilot uses AI. Check for mistakes.
Comment on lines +82 to +87
# 4. Restore via the restore script (exec into a temp container)
docker run --rm \
-v taskdeck_taskdeck-db:/data \
-v "$(pwd):/repo" \
alpine:3 \
sh -c "cp /data/backups/taskdeck-backup-<timestamp>.db /data/taskdeck.db"
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Option C, Step 4 says "Restore via the restore script" but the command shown does a raw cp into the volume. This bypasses the restore script’s validations (magic bytes/integrity check) and safety copy behavior. Either update the text to say this is a manual copy (and why it's acceptable), or change the instructions to use scripts/restore.sh inside a container that has sqlite3 available.

Suggested change
# 4. Restore via the restore script (exec into a temp container)
docker run --rm \
-v taskdeck_taskdeck-db:/data \
-v "$(pwd):/repo" \
alpine:3 \
sh -c "cp /data/backups/taskdeck-backup-<timestamp>.db /data/taskdeck.db"
# 4. Restore via the restore script (exec into a temp container with sqlite3 available)
docker run --rm \
-v taskdeck_taskdeck-db:/data \
-v "$(pwd):/repo" \
alpine:3 \
sh -c "apk add --no-cache bash sqlite && bash /repo/scripts/restore.sh --db-path /data/taskdeck.db --backup-path /data/backups/taskdeck-backup-<timestamp>.db"

Copilot uses AI. Check for mistakes.
Comment on lines +104 to +112
$Sqlite3 = Get-Command sqlite3.exe -ErrorAction SilentlyContinue

if ($Sqlite3) {
# sqlite3 .backup is a hot backup: copies pages under a shared lock,
# flushing WAL frames first. Safe with active readers and writers.
& $Sqlite3.Source $DbPath ".backup '$BackupFile'"
if ($LASTEXITCODE -ne 0) {
Write-Error "sqlite3 .backup failed (exit code $LASTEXITCODE)."
}
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sqlite3 dot-command is constructed with single-quoted interpolation (.backup '$BackupFile'). If the path contains a single quote, sqlite3 parsing will fail; additionally, embedding unsanitized input into dot-commands can allow command injection. Escape/sanitize the filename before constructing the .backup command.

Copilot uses AI. Check for mistakes.
Comment on lines +85 to +91
# Check SQLite magic bytes (first 16 bytes = "SQLite format 3\0")
$SqliteMagic = [System.Text.Encoding]::ASCII.GetBytes("SQLite format 3")
$FileBytes = [System.IO.File]::ReadAllBytes($BackupFile)

if ($FileBytes.Length -lt 16) {
Write-Error "Backup file is too small to be a valid SQLite database."
}
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment says the first 16 bytes are SQLite format 3\0, but the code compares only the 15-byte ASCII string SQLite format 3 (and never validates the trailing null byte). Either include the null terminator in the expected bytes, or update the comment/length check to match what’s being validated.

Copilot uses AI. Check for mistakes.
Comment on lines +156 to +163
# Quick integrity check on the backup
# ---------------------------------------------------------------------------
if command -v sqlite3 &>/dev/null; then
INTEGRITY="$(sqlite3 "$BACKUP_FILE" 'PRAGMA integrity_check;' 2>&1)"
if [[ "$INTEGRITY" != "ok" ]]; then
echo "ERROR: backup integrity check failed: $INTEGRITY" >&2
rm -f "$BACKUP_FILE"
exit 1
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Retention comment says backups are sorted oldest-first, but ls -1t sorts newest-first (and the deletion loop relies on newest-first ordering). Update the comment to match the actual ordering to avoid confusion when maintaining the retention logic.

Copilot uses AI. Check for mistakes.
- Fix fragile connection string parsing: use sed/split to extract
  Data Source from multi-parameter connection strings in all 4 scripts
- Replace mapfile with while-read loop for macOS Bash 3.2 compatibility
- Remove dead SCRIPT_DIR variables from backup.sh and restore.sh
- Add sqlite3 single-quote escaping as defense-in-depth against
  command injection in all .backup/.restore dot-commands
- Fix header byte count comment: code reads 15 bytes (text only),
  not 16 (which includes the null terminator)
- Replace ReadAllBytes with stream read of first 16 bytes in
  restore.ps1 to avoid loading entire file into memory
- Fix retention comment: ls -t sorts newest-first, not oldest-first
- Remove nonexistent -Yes flag from Task Scheduler example in runbook
- Update Option C Docker restore to use restore.sh instead of raw cp
Copy link
Copy Markdown
Owner Author

@Chris0Jeky Chris0Jeky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adversarial Review — PR #663 (backup/restore scripts + DR runbook)

The two fix-up commits addressed the bot findings well (connection string parsing, mapfile compatibility, ReadAllBytes, single-quote escaping, dead SCRIPT_DIR, comment corrections, Docker Option C). Good follow-through. The scripts are generally well-structured with solid defensive checks.

Below are the remaining findings I could confirm through code analysis and local shell testing.


1. [HIGH] Restore scripts do not clean up stale WAL/SHM files — risk of silent corruption

Files: scripts/restore.sh, scripts/restore.ps1

When the live database is in WAL mode (which EF Core enables by default), the database directory contains taskdeck.db-wal and taskdeck.db-shm companion files. The restore scripts replace the main .db file (via sqlite3 .restore or cp) but leave any existing -wal and -shm files untouched.

On the next application startup, SQLite will discover the old WAL file and attempt to replay it against the newly restored database. Since the WAL contains page references from the pre-restore database, this can silently corrupt the restored data or cause an immediate SQLITE_CORRUPT error.

Fix: Before overwriting DB_PATH, delete (or rename alongside the safety copy) the companion files:

# In restore.sh, before the restore step:
rm -f "${DB_PATH}-wal" "${DB_PATH}-shm"
# In restore.ps1, before the restore step:
Remove-Item "${DbPath}-wal" -Force -ErrorAction SilentlyContinue
Remove-Item "${DbPath}-shm" -Force -ErrorAction SilentlyContinue

This is the most important finding. Without it, a restore of a WAL-mode database followed by an API restart can produce data corruption that defeats the purpose of the backup.


2. [MEDIUM] cp fallback in backup.sh does not copy WAL file — backup may be incomplete

File: scripts/backup.sh (and backup.ps1 Copy-Item fallback)

When sqlite3 is not available, the script falls back to cp "$DB_PATH" "$BACKUP_FILE". If the database is in WAL mode, the main .db file alone is an incomplete snapshot — committed transactions in the -wal file are not included, and the backup will appear to have rolled back those transactions.

The script already warns that cp is unsafe with active writers. Consider strengthening the warning to mention WAL explicitly, or copying the -wal file alongside the main database in the fallback path (with a caveat that it is still not crash-safe). This would at least give the operator a complete set of files to work with.

Not a blocker since the warning is already present, but worth a one-line clarification.


3. [MEDIUM] restore.sh magic-byte validation logic: file is checked before using the computed xxd result

File: scripts/restore.sh, lines ~131-157

The script computes MAGIC_ACTUAL via dd | xxd unconditionally, then checks if command -v file first. If file is available, MAGIC_ACTUAL is computed but never used. More importantly, the file command is less precise — it checks heuristics, not exact bytes. A file that happens to contain "SQLite" in the right place but is not a valid SQLite database could pass the file check.

Consider reversing the priority: prefer the exact xxd magic-byte check when available, fall back to file only when xxd is not present. Or just skip the file fallback entirely since xxd is more common in production environments (it ships with vim which is nearly universal on Linux).


4. [LOW] No disk-space pre-check — partial backup files on full disk

Files: scripts/backup.sh, scripts/backup.ps1

If the disk fills during sqlite3 .backup, the command will fail and leave a partial .db file at the backup path. The subsequent PRAGMA integrity_check will catch it and delete the partial file, which is good. However, the partial file itself may consume the remaining scarce disk space, making recovery harder. A pre-flight check (df on Linux, Get-PSDrive on Windows) comparing available space to the source database size would give an early, actionable error message instead of a cryptic sqlite3 failure.

Not a blocker — the integrity check is a good safety net — but a nice-to-have for operational clarity.


5. [LOW] Sub-second re-runs silently overwrite the previous backup

File: scripts/backup.sh, scripts/backup.ps1

The backup filename uses second-granularity timestamps (%Y-%m-%d-%H%M%S). If the script is invoked twice within the same second (e.g., automated retry on transient failure), the second run overwrites the first backup without warning. Consider appending a random suffix or checking for filename collision, or simply documenting this as a known limitation.


Summary

The PR is in good shape after the fix-up commits. Finding #1 (WAL/SHM cleanup on restore) is the only one I would call a blocker — it is a real data corruption vector for the default EF Core SQLite configuration. The rest are improvements that could ship in a follow-up.

EF Core uses WAL mode by default. Both restore scripts replaced the
main .db file but left stale -wal and -shm companion files in place.
On next API startup the stale WAL would be replayed against the
restored database, risking silent data corruption.

Delete ${DB_PATH}-wal and ${DB_PATH}-shm (bash) and their PowerShell
equivalents immediately before the sqlite3 .restore / cp step.
The cp fallback (used when sqlite3 is unavailable) only copied the
main .db file, producing an incomplete snapshot for WAL-mode databases.
Now also copies the -wal file if it exists and warns that the backup
may be incomplete for WAL-mode databases.
@Chris0Jeky Chris0Jeky merged commit 4b6467a into main Apr 1, 2026
22 checks passed
@Chris0Jeky Chris0Jeky deleted the ops/backup-restore-dr-playbook branch April 1, 2026 20:20
@github-project-automation github-project-automation bot moved this from Pending to Done in Taskdeck Execution Apr 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

OPS-08: Backup/restore automation and disaster-recovery drill playbook

2 participants