Infra fix processes declare success too early: single-curl snapshot check passes against pre-fix server state, masking bugs that only manifest on next system cycle

## Problem

The shell-script-editing infra fix process library uses a single one-shot HTTP check (`curl`) immediately after applying a fix to a shell script in order to verify the fix worked. This snapshot verification is fundamentally flawed: it interrogates the server that was already running **before** the fix was applied, not the server as it will behave under the **fixed** code on its next scheduled execution cycle.

The process reports `COMPLETED` / `FIXED`, but the fix may be broken and will only reveal itself when the next system cycle fires (e.g., a launchd health monitor running every 5 minutes).

---

## Root Cause

The verification step performs a single `curl` request immediately after patching the script. Because the server process is still alive from the pre-patch run, the HTTP check succeeds regardless of whether the patched code is correct. The check proves only that the server **is still up**, not that the **fix is safe to run**.

Key gap: there is no **dry-run / static analysis** of the patched script before application, no **wait-and-re-check** across at least one full monitor cycle, and no **post-cycle survival check** after the patched code has actually executed end-to-end.

---

## Evidence

Real-world reproduction in the **trip-planner** project:

1. The `fix-stale-sync` babysitter process added a call to `check_sync_freshness()` on the healthy-server code path inside the health-monitor shell script.
2. `check_sync_freshness()` internally calls `sync_project()`, which executes `rm -rf $DEV_DIR` (atomic directory swap) **while the server is running and serving from that directory**.
3. The single-`curl` snapshot check ran immediately after the patch was written. The server was still alive from its pre-patch state — check passed, process reported `COMPLETED`.
4. Five minutes later, launchd fired the health monitor. `check_sync_freshness()` executed, `rm -rf $DEV_DIR` deleted the live server directory, and the server crashed.
5. This failure loop repeated across **5 consecutive runs**, each declaring success, none surviving the next launchd cycle.

---

## Expected Behavior

- After applying a fix to a shell script, the verification step must wait for **at least one full monitor/cycle period** to elapse and then confirm the server is still healthy.
- Ideally, a **static analysis or dry-run** pass is performed on the patched script before it is applied, catching destructive operations (e.g., `rm -rf` inside a live-server code path) before they can execute.
- The process must not report `COMPLETED` until the server has survived a full cycle under the new code.

---

## Actual Behavior

- A single `curl` request is issued immediately after patching.
- The request succeeds because the server is still alive from before the fix.
- The process reports `COMPLETED` / `FIXED`.
- On the next system cycle, the patched code executes, crashes the server, and the failure is never attributed to the fix.

---

## Repro Steps

1. Set up a project with a launchd-managed health monitor shell script that runs every 5 minutes.
2. Use the infra fix process to patch the script, injecting a function call on the healthy-server code path.
3. Ensure the injected function contains a destructive operation that would crash the server (e.g., `rm -rf` of the server's working directory).
4. Observe that the single-`curl` verification immediately after patching reports success.
5. Wait 5 minutes (one launchd cycle).
6. Observe the server has crashed; the fix declared success is actually broken.

---

## Environment

| Field | Value |
|---|---|
| babysitter | 0.0.187 |
| Platform | macOS darwin |
| Node | 22.18.0 |
| Scheduler | launchd (KeepAlive=true) |
| Monitor interval | 5 minutes |

---

## Related Issues

- #82 — Content-aware verification (verifying the right thing after a change)
- #70 — Missing pre-flight health check before applying fixes
- #113 — Deterministic quality gates for fix processes

---

## Proposed Solution

1. **Static pre-flight analysis**: Before applying any patch to a shell script, parse the diff for dangerous patterns (`rm -rf`, `rmdir`, directory reassignment) in code paths that execute while the server is live. Abort with an actionable error if found.
2. **Post-cycle survival check**: After applying the fix, wait for at least one full monitor cycle period (or a configurable grace window) before issuing the verification check. The process must not advance to `COMPLETED` until the check passes **after** the patched code has had the opportunity to execute.
3. **Multi-point verification**: Replace the single `curl` snapshot with a minimum of two checks — one immediately (baseline) and one after the next system cycle — and require both to pass.
4. **Explicit completion gate**: Gate the `COMPLETED` / `FIXED` status behind the multi-point verification result, making it impossible to report success before the post-cycle check resolves.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Infra fix processes declare success too early: single-curl snapshot check passes against pre-fix server state, masking bugs that only manifest on next system cycle #123

Problem

Root Cause

Evidence

Expected Behavior

Actual Behavior

Repro Steps

Environment

Related Issues

Proposed Solution

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Field	Value
babysitter	0.0.187
Platform	macOS darwin
Node	22.18.0
Scheduler	launchd (KeepAlive=true)
Monitor interval	5 minutes

Infra fix processes declare success too early: single-curl snapshot check passes against pre-fix server state, masking bugs that only manifest on next system cycle #123

Description

Problem

Root Cause

Evidence

Expected Behavior

Actual Behavior

Repro Steps

Environment

Related Issues

Proposed Solution

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions