feat: add installer checkpoint/resume system#311
feat: add installer checkpoint/resume system#311boffin-dmytro wants to merge 6 commits intoLight-Heart-Labs:mainfrom
Conversation
Lightheartdevs
left a comment
There was a problem hiding this comment.
checkpoint_should_resume uses RESUME_PHASE=$(checkpoint_should_resume) which captures in a subshell — the read prompt inside can never see user input. Also, several early phases (02-detection, 03-features) are not idempotent: resuming from phase 4 after detection ran is safe, but the design should document which phases are safe to skip. Fix the subshell capture issue.
|
Thanks for the detailed feedback! You're correct - the Changes:
The prompt now works correctly and users can interact with the resume dialog. |
|
The checkpoint/resume concept is great for the installer. Key issue: \ uses \ which isn't set until phase 6 (). Phases 1-5 would write checkpoints to an undefined path. Fix: use a temp location for early phases, then migrate the checkpoint file once INSTALL_DIR is established. |
Lightheartdevs
left a comment
There was a problem hiding this comment.
Review: REQUEST CHANGES — Two P0 bugs would crash on first use
Good concept — resuming after failed multi-GB downloads is a real pain point. But the implementation has critical issues.
P0: Checkpoint file writes to nonexistent directory
CHECKPOINT_FILE="${INSTALL_DIR}/.install-checkpoint" is evaluated at source time, before phases run. Phase 06 (06-directories) is where INSTALL_DIR is actually created. Calling checkpoint_save for phases 01-05 will fail under set -euo pipefail because cat > targets a path inside a nonexistent directory.
Fix: Use /tmp/dream-server-checkpoint for early phases, migrate to INSTALL_DIR after phase 06.
P0: Resumed installs skip variable-setting phases
Resuming from phase 6 skips phases 02 (detection) and 03 (features), which set GPU_VENDOR, GPU_VRAM, TIER, ENABLE_* flags. Later phases depend on these. Under set -euo pipefail, unset variables crash immediately.
Fix (simple): Always re-run phases 01-04 (they are fast detection/validation). Only allow resume from phase 05+.
Fix (complete): Serialize all exported variables into the checkpoint file and restore on resume.
P1: Off-by-one in resume prompt
last_phase is the last completed phase. The prompt says "stopped at phase $last_phase, Resume from phase $last_phase?" but the user actually resumes from $last_phase + 1. Should say: "Completed through phase
P1: Non-atomic checkpoint writes
cat > "$CHECKPOINT_FILE" is not atomic. If killed mid-write (the exact scenario this feature handles), the file is truncated. Write to ${CHECKPOINT_FILE}.tmp then mv into place — mv is atomic on POSIX.
P2: Missing set -euo pipefail in checkpoint.sh
Per CLAUDE.md conventions, every bash file must declare this explicitly.
P2: checkpoint_save failure crashes the installer
If checkpoint_save fails (e.g., directory issue), the entire installer crashes attributed to infrastructure code, not the actual phase. Use: checkpoint_save 1 || warn "checkpoint save failed (non-fatal)"
P2: Non-interactive mode silently always resumes
When INTERACTIVE != "true", the function skips the prompt and returns 0 (resume). CI/automation probably wants clean installs. Consider defaulting to "start fresh" in non-interactive mode.
🤖 Reviewed with Claude Code
|
Thanks for the detailed review! I've addressed the checkpoint path issue and added several improvements. Main FixThe checkpoint system now uses a two-stage approach:
Additional ImprovementsWhile fixing this, I also addressed some edge cases:
TestingAdded comprehensive test coverage:
All tests passing ✓ Ready for another look when you have time. |
Summary
Adds checkpoint/resume capability to installer so users can resume from the last successful phase if installation is interrupted.
Changes
lib/checkpoint.shwith checkpoint management functionsCheckpoint Features
--forceflag (bypasses resume)--dry-run(no checkpoints saved)User Experience
If user resumes, phases 1-4 are skipped and installation continues from phase 6.
Impact
Testing
Total LOC: ~120 lines (70 checkpoint lib + 50 integration)