Skip to content

feat: add installer checkpoint/resume system#311

Open
boffin-dmytro wants to merge 6 commits intoLight-Heart-Labs:mainfrom
boffin-dmytro:feat/installer-checkpoint-resume
Open

feat: add installer checkpoint/resume system#311
boffin-dmytro wants to merge 6 commits intoLight-Heart-Labs:mainfrom
boffin-dmytro:feat/installer-checkpoint-resume

Conversation

@boffin-dmytro
Copy link
Contributor

Summary

Adds checkpoint/resume capability to installer so users can resume from the last successful phase if installation is interrupted.

Changes

  • Add lib/checkpoint.sh with checkpoint management functions
  • Save checkpoint after each successful phase
  • Detect previous installation on startup
  • Prompt user to resume or start fresh
  • Skip completed phases when resuming
  • Clear checkpoint after successful installation
  • Checkpoint expires after 24 hours (prevents stale state)

Checkpoint Features

  • Saves phase number, timestamp, install dir, version
  • Interactive prompt to resume or start fresh
  • Validates checkpoint age (<24 hours)
  • Works with --force flag (bypasses resume)
  • Works with --dry-run (no checkpoints saved)

User Experience

Previous installation detected (stopped at phase 5)
Resume from phase 5? [Y/n]

If user resumes, phases 1-4 are skipped and installation continues from phase 6.

Impact

  • Users can safely Ctrl+C during long downloads
  • Network failures don't require starting over
  • Saves time on multi-GB model downloads (GGUF, FLUX)
  • Improves reliability of installation process
  • Especially helpful for slow connections and low-end hardware

Testing

  • Bash syntax validated
  • Checkpoint file format is simple key=value pairs
  • Safe to delete checkpoint file manually if needed

Total LOC: ~120 lines (70 checkpoint lib + 50 integration)

Copy link
Collaborator

@Lightheartdevs Lightheartdevs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checkpoint_should_resume uses RESUME_PHASE=$(checkpoint_should_resume) which captures in a subshell — the read prompt inside can never see user input. Also, several early phases (02-detection, 03-features) are not idempotent: resuming from phase 4 after detection ran is safe, but the design should document which phases are safe to skip. Fix the subshell capture issue.

@boffin-dmytro
Copy link
Contributor Author

Thanks for the detailed feedback! You're correct - the read prompt inside checkpoint_should_resume() couldn't receive user input when called in a command substitution subshell.

Changes:

  • Split checkpoint_should_resume() into two functions:
    • checkpoint_prompt_resume() - prompts user and returns 0/1 (must be called in parent shell)
    • checkpoint_load() - returns phase number (can be called in subshell)
  • Updated install-core.sh to call checkpoint_prompt_resume() directly, then checkpoint_load() in command substitution
  • Added </dev/tty to the read command to ensure input works correctly
  • Added idempotency documentation to checkpoint.sh header explaining which phases are safe to resume from (phases 05+ are generally idempotent; phases 01-04 perform detection and may produce different results if system state changed)

The prompt now works correctly and users can interact with the resume dialog.

@Lightheartdevs
Copy link
Collaborator

The checkpoint/resume concept is great for the installer. Key issue: \ uses \ which isn't set until phase 6 (). Phases 1-5 would write checkpoints to an undefined path. Fix: use a temp location for early phases, then migrate the checkpoint file once INSTALL_DIR is established.

Copy link
Collaborator

@Lightheartdevs Lightheartdevs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: REQUEST CHANGES — Two P0 bugs would crash on first use

Good concept — resuming after failed multi-GB downloads is a real pain point. But the implementation has critical issues.

P0: Checkpoint file writes to nonexistent directory

CHECKPOINT_FILE="${INSTALL_DIR}/.install-checkpoint" is evaluated at source time, before phases run. Phase 06 (06-directories) is where INSTALL_DIR is actually created. Calling checkpoint_save for phases 01-05 will fail under set -euo pipefail because cat > targets a path inside a nonexistent directory.

Fix: Use /tmp/dream-server-checkpoint for early phases, migrate to INSTALL_DIR after phase 06.

P0: Resumed installs skip variable-setting phases

Resuming from phase 6 skips phases 02 (detection) and 03 (features), which set GPU_VENDOR, GPU_VRAM, TIER, ENABLE_* flags. Later phases depend on these. Under set -euo pipefail, unset variables crash immediately.

Fix (simple): Always re-run phases 01-04 (they are fast detection/validation). Only allow resume from phase 05+.

Fix (complete): Serialize all exported variables into the checkpoint file and restore on resume.

P1: Off-by-one in resume prompt

last_phase is the last completed phase. The prompt says "stopped at phase $last_phase, Resume from phase $last_phase?" but the user actually resumes from $last_phase + 1. Should say: "Completed through phase $last_phase. Resume from phase $((last_phase + 1))?"

P1: Non-atomic checkpoint writes

cat > "$CHECKPOINT_FILE" is not atomic. If killed mid-write (the exact scenario this feature handles), the file is truncated. Write to ${CHECKPOINT_FILE}.tmp then mv into place — mv is atomic on POSIX.

P2: Missing set -euo pipefail in checkpoint.sh

Per CLAUDE.md conventions, every bash file must declare this explicitly.

P2: checkpoint_save failure crashes the installer

If checkpoint_save fails (e.g., directory issue), the entire installer crashes attributed to infrastructure code, not the actual phase. Use: checkpoint_save 1 || warn "checkpoint save failed (non-fatal)"

P2: Non-interactive mode silently always resumes

When INTERACTIVE != "true", the function skips the prompt and returns 0 (resume). CI/automation probably wants clean installs. Consider defaulting to "start fresh" in non-interactive mode.

🤖 Reviewed with Claude Code

@boffin-dmytro
Copy link
Contributor Author

Thanks for the detailed review! I've addressed the checkpoint path issue and added several improvements.

Main Fix

The checkpoint system now uses a two-stage approach:

  • Phases 1-5: Writes to ~/.cache/dream-server-install-checkpoint (temp location, since INSTALL_DIR doesn't exist yet)
  • Phase 6: Migrates checkpoint to ${INSTALL_DIR}/.install-checkpoint after creating the directory
  • Phases 7+: Writes directly to final location

Additional Improvements

While fixing this, I also addressed some edge cases:

  • Migration happens before saving phase 6 (prevents writing to wrong location)
  • --dry-run mode skips checkpoint writes entirely
  • --force flag clears any existing checkpoint before starting
  • Checkpoint validation rejects mismatched INSTALL_DIR paths

Testing

Added comprehensive test coverage:

  • test-checkpoint-migration.sh - 7 test cases covering temp/final locations, migration, staleness
  • test-checkpoint-phase6-order.sh - Verifies migration order
  • test-checkpoint-install-dir-validation.sh - Validates path consistency

All tests passing ✓

Ready for another look when you have time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants