Skip to content

Infra fix processes declare success too early: single-curl snapshot check passes against pre-fix server state, masking bugs that only manifest on next system cycle #123

@rogelsm

Description

@rogelsm

Problem

The shell-script-editing infra fix process library uses a single one-shot HTTP check (curl) immediately after applying a fix to a shell script in order to verify the fix worked. This snapshot verification is fundamentally flawed: it interrogates the server that was already running before the fix was applied, not the server as it will behave under the fixed code on its next scheduled execution cycle.

The process reports COMPLETED / FIXED, but the fix may be broken and will only reveal itself when the next system cycle fires (e.g., a launchd health monitor running every 5 minutes).


Root Cause

The verification step performs a single curl request immediately after patching the script. Because the server process is still alive from the pre-patch run, the HTTP check succeeds regardless of whether the patched code is correct. The check proves only that the server is still up, not that the fix is safe to run.

Key gap: there is no dry-run / static analysis of the patched script before application, no wait-and-re-check across at least one full monitor cycle, and no post-cycle survival check after the patched code has actually executed end-to-end.


Evidence

Real-world reproduction in the trip-planner project:

  1. The fix-stale-sync babysitter process added a call to check_sync_freshness() on the healthy-server code path inside the health-monitor shell script.
  2. check_sync_freshness() internally calls sync_project(), which executes rm -rf $DEV_DIR (atomic directory swap) while the server is running and serving from that directory.
  3. The single-curl snapshot check ran immediately after the patch was written. The server was still alive from its pre-patch state — check passed, process reported COMPLETED.
  4. Five minutes later, launchd fired the health monitor. check_sync_freshness() executed, rm -rf $DEV_DIR deleted the live server directory, and the server crashed.
  5. This failure loop repeated across 5 consecutive runs, each declaring success, none surviving the next launchd cycle.

Expected Behavior

  • After applying a fix to a shell script, the verification step must wait for at least one full monitor/cycle period to elapse and then confirm the server is still healthy.
  • Ideally, a static analysis or dry-run pass is performed on the patched script before it is applied, catching destructive operations (e.g., rm -rf inside a live-server code path) before they can execute.
  • The process must not report COMPLETED until the server has survived a full cycle under the new code.

Actual Behavior

  • A single curl request is issued immediately after patching.
  • The request succeeds because the server is still alive from before the fix.
  • The process reports COMPLETED / FIXED.
  • On the next system cycle, the patched code executes, crashes the server, and the failure is never attributed to the fix.

Repro Steps

  1. Set up a project with a launchd-managed health monitor shell script that runs every 5 minutes.
  2. Use the infra fix process to patch the script, injecting a function call on the healthy-server code path.
  3. Ensure the injected function contains a destructive operation that would crash the server (e.g., rm -rf of the server's working directory).
  4. Observe that the single-curl verification immediately after patching reports success.
  5. Wait 5 minutes (one launchd cycle).
  6. Observe the server has crashed; the fix declared success is actually broken.

Environment

Field Value
babysitter 0.0.187
Platform macOS darwin
Node 22.18.0
Scheduler launchd (KeepAlive=true)
Monitor interval 5 minutes

Related Issues


Proposed Solution

  1. Static pre-flight analysis: Before applying any patch to a shell script, parse the diff for dangerous patterns (rm -rf, rmdir, directory reassignment) in code paths that execute while the server is live. Abort with an actionable error if found.
  2. Post-cycle survival check: After applying the fix, wait for at least one full monitor cycle period (or a configurable grace window) before issuing the verification check. The process must not advance to COMPLETED until the check passes after the patched code has had the opportunity to execute.
  3. Multi-point verification: Replace the single curl snapshot with a minimum of two checks — one immediately (baseline) and one after the next system cycle — and require both to pass.
  4. Explicit completion gate: Gate the COMPLETED / FIXED status behind the multi-point verification result, making it impossible to report success before the post-cycle check resolves.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions