eighteyes · eighteyes · Mar 11, 2026 · Mar 8, 2026 · Mar 8, 2026 · Mar 9, 2026
diff --git a/.claude/skills/mesh-builder/SKILL.md b/.claude/skills/mesh-builder/SKILL.md
@@ -167,6 +167,12 @@ agents:
 
 **Propagation:** Upstream agents must include the key in their completion message frontmatter for downstream agents to receive it. The consumer maps frontmatter fields to `payload` automatically.
 
+**Reliability front-matter fields** (used by core agent for recovery, not in mesh configs):
+- `recover: true` — triggers DLQ recovery for the target mesh
+- `rewind-to: <state>` — override recovery session with checkpoint from named FSM state
+- `session-id: <id>` — resume a specific SDK session
+- `resume-mesh: true` — preserve mesh state instead of clearing on new entry
+
 ```
 User message:  feature: auth  → prebuild gets "/know:prebuild auth"
 Prebuild msg:  feature: auth  → builder gets "/know:build auth"

diff --git a/.gitignore b/.gitignore
@@ -28,3 +28,5 @@ meshes/*
 !meshes/structured-thinking
 !meshes/narrative-engine/
 !meshes/narrative-engine-v2/
+!meshes/reliability-test/
+!meshes/reliability-fsm/
diff --git a/docs/HUMAN_REVIEW.md b/docs/HUMAN_REVIEW.md
@@ -0,0 +1,147 @@
+# Human Review Gates — Reliability
+
+Every reliability feature includes human review steps. The system **never** silently changes behavior, retries destructively, or masks failures.
+
+> **The system does work. The human makes decisions.**
+
+- Retries within limits → automatic (but visible)
+- Recovery, replay, escalation → always human-approved
+- Failures → always surfaced with context and options
+- No silent state changes that affect mesh behavior
+
+See [reliability.md](./reliability.md) for feature details.
+
+---
+
+## Nine 1 — Basic Error Handling
+
+### Worker Retries
+- When retries exhaust → DLQ entry created → core presents failure to user
+- User decides: retry with variation, recover from checkpoint, or drop
+
+### Injection Poll Loop
+- Stale entries (>5min) are dropped but remain available via `tx inbox`
+- If file-based fallback activates, user sees pending messages on next interaction
+
+### Routing Correction
+- When routing retries exhaust → escalated to user with full attempt history
+- User sees which targets were tried and picks correct one
+
+### Usage Policy Errors
+- Human chooses: retry, skip, modify prompt, or abort
+- Full diagnostic context (triggering prompt, recent history) included in ask-human message
+
+### Recovery Handler
+- First 2 recovery requests: automatic guidance with FSM state and valid routes
+- 3rd+ request in 60s: escalated to human — agent is repeatedly stuck
+
+---
+
+## Nine 2 — Validation & Protocol Enforcement
+
+### Parity Gate
+- Violations → reminder injected to agent
+- If unresolved after reminder → surfaced to user with pending asks list
+
+### Identity Gate
+- Kill events → logged with full reason (agent ID, expected vs actual `from:` field)
+- User can audit identity violations via logs
+
+### Mesh Validator
+- Validation errors → block mesh load, user sees what's wrong and how to fix it
+- Warnings → logged but don't block (user can review in logs)
+
+### Manifest Validator
+- Validation failures → surfaced to user with missing/invalid paths and responsible agents
+
+### Bash Guard
+- 1-2 violations → error response with allowed paths shown to agent
+- 3+ violations → worker killed, logged for forensics
+- User can audit bash guard events in logs
+
+---
+
+## Nine 2.5 — Self-Healing & Auto-Recovery
+
+### Nudge Detector
+- Nudges are logged and visible in `tx spy`
+- Max 1 nudge per agent prevents recovery loops
+
+### Deadlock Breaker
+- Shallow cycles (depth ≤ 3) → auto-broken, logged
+- Deep cycles (depth 5+) → escalated to human with cycle visualization (A→B→C→A)
+- User decides which agent's ask to drop
+
+### Stale Message Cleaner
+- Stale messages archived with reason — no silent deletion
+- User can audit via `tx spy` and review archived messages
+
+### Quality Iteration Loops
+- Max iterations hit → presents feedback history to user
+- User decides: retry, accept current output, or drop
+- Each iteration's feedback is visible for review
+
+---
+
+## Nine 3 — Monitoring, Circuit Breaking, DLQ
+
+### Circuit Breaker
+- Circuit open → agent skipped, logged with failure count
+- Half-open test spawn → user can monitor via `tx mesh health`
+- Circuits don't auto-close silently — health dashboard shows state
+
+### Heartbeat Monitor
+- Warn threshold → logged warning (no action)
+- Stale threshold → logged stale warning
+- Dead threshold → **worker killed**, failure recorded, routed to DLQ
+- All events visible in `tx mesh health` with silence duration
+
+### Dead Letter Queue
+- **Recovery always requires human review** (except crash recovery on restart)
+- Core agent diagnoses, presents options (resume vs rewind vs drop), gets explicit confirmation
+- Available checkpoints shown before any recovery action
+- `tx mesh dlq` shows all pending entries with recovery mode and context
+
+### Checkpoint Log & Rewind-To
+- Checkpoint notification: core can surface "Mesh X completed 'build' — checkpoint saved"
+- Before rewind-to replay, core presents: which checkpoint, what replays, what's discarded
+- Post-replay: result presented for user approval before mesh continues
+- **Replay never happens without user choosing a checkpoint**
+
+### SLI Tracker
+- Threshold alerts: "Mesh X reliability dropped to 94.2% (below 95% cautious threshold). 3 failures in last 10 runs. Categories: 2x model_error, 1x timeout."
+- Periodic health summary available via `tx mesh health`
+- SLI data always visible — never hidden from user
+
+### Safe Mode
+- **Escalation beyond cautious requires user confirmation** when surfaced via core
+- Auto-escalation (if enabled) is logged with reason and SLI data
+- **Never auto-de-escalates** — human must clear via `resetMesh()` or `resetAll()`
+- Core presents: "SLI recovered to 98%. Clear restricted mode for mesh X?"
+
+---
+
+## Roadmap — Nine 4
+
+### Retry-With-Variation
+- First failure: core reports "Agent X failed (model_error). Retrying with variation: [description]. Retry 1/3."
+- Each retry logs what changed (e.g., "retry 2: simplified prompt, dropped optional context")
+- Exhausted retries: core presents full retry history with variations tried — user decides next step
+- New variation strategies require review before taking effect
+
+### Output Schema Validation
+- Validation failure: core reports "Agent X output failed validation: missing required field 'summary'."
+- Before retry with validation feedback, core presents: "Ask agent X to fix? Validation errors: [list]. Or drop?"
+- Schema changes in mesh config: core surfaces impact on existing agents
+- Partial pass: core shows what passed/failed, user decides accept partial, retry, or drop
+
+### Critical/Non-Critical Agent Classification
+- On mesh load, core can surface classifications: "critical=[planner, builder], non-critical=[linter]"
+- Non-critical failure: "Agent 'linter' failed (timeout). Mesh continues. Output from this step missing."
+- Repeated non-critical failures: "Agent 'linter' failed 5 times. Promote to critical or disable?"
+- Critical failures always stop the mesh and present recovery options
+
+### Aggregate Observability Dashboard
+- Anomaly alerts: "Mesh X failure rate spiked from 2% to 15% in last hour. Category: model_error."
+- Cost review gate: "Recovering mesh X with rewind-to will replay ~50k tokens. Proceed?"
+- Dashboard is passive — all actions from insights go through standard human review
diff --git a/docs/guardrails.md b/docs/guardrails.md
@@ -358,3 +358,118 @@ max_turns:
   warning: true
   limit: 50
 ```
+
+## Reliability (Four Nines)
+
+The reliability module (`src/reliability/`) provides four-nines (99.99%) patterns inspired by Karpathy's "March of Nines". Each nine requires fundamentally new approaches:
+
+| Nine | Target | TX Mechanism |
+|------|--------|-------------|
+| 1 (90%) | Basic error handling | Logging, guardrails, FSM validation |
+| 2 (99%) | Message recovery | Dead Letter Queue, retry with backoff |
+| 3 (99.9%) | Failure isolation | Circuit breakers, heartbeat monitoring |
+| 4 (99.99%) | Proactive safety | SLI tracking, safe mode, failure taxonomy |
+
+### Configuration
+
+Add to `.ai/tx/data/config.yaml`:
+
+```yaml
+reliability:
+  circuitBreaker:
+    failureThreshold: 3      # Failures before circuit opens
+    cooldownMs: 60000         # Wait before probe request
+    windowMs: 300000          # Failure counting window (5 min)
+  heartbeat:
+    warnMs: 60000             # Silence before warning (1 min)
+    staleMs: 120000           # Silence before stale (2 min)
+    deadMs: 300000            # Silence before dead (5 min)
+    checkIntervalMs: 15000    # Check interval (15s)
+  safeMode:
+    defaultLevel: normal      # normal | cautious | restricted | lockdown
+    autoEscalate: false       # Auto-escalate based on SLI
+    cautiousThreshold: 0.95   # SLI rate triggering cautious mode
+    restrictedThreshold: 0.90 # SLI rate triggering restricted mode
+    lockdownThreshold: 0.80   # SLI rate triggering lockdown
+  dlq:
+    maxRetries: 3             # Max retries before DLQ
+  sli:
+    retentionMs: 604800000    # SLI data retention (7 days)
+```
+
+### Dead Letter Queue (DLQ)
+
+Messages that fail delivery after max retries are routed to the DLQ instead of being silently dropped. DLQ entries persist in SQLite and can be replayed manually.
+
+- Automatic retry with exponential backoff
-Messages that fail delivery after max retries are routed to the DLQ instead of being silently dropped. DLQ entries persist in SQLite and can be replayed manually.
-
- Automatic retry with exponential backoff
+Messages that fail delivery after max retries are routed to the DLQ instead of being silently dropped. DLQ entries persist in SQLite and can be replayed manually; the DLQ itself does not schedule retries or implement backoff.
+
+- Integration with worker retries (fixed-delay; DLQ only after retries exhausted)
-Messages that fail delivery after max retries are routed to the DLQ instead of being silently dropped. DLQ entries persist in SQLite and can be replayed manually.
-
- Automatic retry with exponential backoff
+Messages that fail delivery after max retries are routed to the DLQ instead of being silently dropped. DLQ entries persist in SQLite and can be replayed manually; the DLQ itself does not schedule retries or implement backoff.
+
+- Integration with worker retries (fixed-delay; DLQ only after retries exhausted)
+- Failure reason tracking for taxonomy
+- Replay capability for manual recovery
+- Stats available via `reliability.dlq.getStats()`
+
+### Circuit Breaker
+
+Prevents cascading failures when an agent repeatedly fails. Three states:
+
+| State | Behavior |
+|-------|----------|
+| **Closed** | Normal — requests pass through |
+| **Open** | Failures exceeded threshold — requests fail immediately |
+| **Half-Open** | After cooldown — single probe request allowed |
+
+Applied per-agent (`mesh/agent`). Resets on mesh completion.
+
+### Heartbeat Monitor
+
+Detects stalled/hung workers by monitoring output timestamps:
+
+| Level | Default | Action |
+|-------|---------|--------|
+| Warn | 60s silence | Log warning |
+| Stale | 120s silence | Inject nudge to worker |
+| Dead | 300s silence | Record failure, trigger circuit breaker |
+
+### SLI Tracker
+
+Tracks success rates, latencies, and failure categories per mesh:
+
+- **Success rate**: Per-mesh and per-agent (target: 99.99%)
+- **MTTR**: Mean time to recovery (failure → next success)
+- **Failure taxonomy**: Categorized failures for targeted fixes
+- **Nines level**: Human-readable "99.9% (3 nines)" display
+
+Failure categories: `model_error`, `routing_error`, `timeout`, `guardrail_kill`, `crash`, `stuck`, `policy_violation`, `gate_failure`, `circuit_open`, `unknown`
+
+### Safe Mode
+
+Treat autonomy as a knob, not a switch. Four levels:
+
+| Level | Tools Disabled | Actions Blocked |
+|-------|---------------|-----------------|
+| **normal** | None | None |
+| **cautious** | None | Destructive bash, git push, file delete |
+| **restricted** | Write, Edit, Bash | All writes, all bash, git operations |
+| **lockdown** | All tools | All operations (stops agent execution) |
+
+Safe mode can be:
+- Set manually per-mesh or globally
+- Auto-escalated based on SLI thresholds (when `autoEscalate: true`)
+- Only escalates automatically; human must clear/de-escalate
+
+### Test Meshes
+
+Two meshes for testing reliability features:
+
+- **`reliability-test`**: Simple 3-agent linear mesh (planner → worker → checker) with tight guardrails
+- **`reliability-fsm`**: FSM-based mesh with gate scripts, iteration tracking, and state transitions
+
+### Implementation
+
+| File | Role |
+|------|------|
+| `src/reliability/index.ts` | Module exports |
+| `src/reliability/reliability-manager.ts` | Central coordinator (single integration point) |
+| `src/reliability/dead-letter-queue.ts` | DLQ with SQLite persistence |
+| `src/reliability/circuit-breaker.ts` | Per-agent circuit breaker |
+| `src/reliability/heartbeat-monitor.ts` | Stalled worker detection |
+| `src/reliability/sli-tracker.ts` | SLI measurement and nines calculation |
+| `src/reliability/safe-mode.ts` | Gradual autonomy control |