From 8cda832e4cb64e6388799c24edcd0ee0866da7a4 Mon Sep 17 00:00:00 2001
From: Claude <noreply@anthropic.com>
Date: Sun, 8 Mar 2026 22:10:54 +0000
Subject: [PATCH 01/12] feat(reliability): Add four-nines reliability module
 with DLQ, circuit breakers, SLI tracking, and safe mode

Implements Karpathy's "March of Nines" patterns for TX mesh reliability:
- Dead Letter Queue: Failed messages persist for replay instead of silent drops
- Circuit Breaker: Per-agent failure isolation prevents cascading failures
- Heartbeat Monitor: Detects stalled workers at warn/stale/dead thresholds
- SLI Tracker: Measures success rates, MTTR, and failure taxonomy per mesh
- Safe Mode: Gradual autonomy control (normal/cautious/restricted/lockdown)
- ReliabilityManager: Single integration point wired into dispatcher

Includes two test meshes (reliability-test, reliability-fsm) and updated guardrails docs.

https://claude.ai/code/session_012PTx7bZNNh74rAshCcuSQg
---
 .gitignore                                |   2 +
 docs/guardrails.md                        | 115 +++++++++
 meshes/reliability-fsm/analyst/prompt.md  |  20 ++
 meshes/reliability-fsm/builder/prompt.md  |  14 ++
 meshes/reliability-fsm/config.yaml        | 135 +++++++++++
 meshes/reliability-fsm/verifier/prompt.md |  14 ++
 meshes/reliability-test/checker/prompt.md |  21 ++
 meshes/reliability-test/config.yaml       |  86 +++++++
 meshes/reliability-test/planner/prompt.md |  19 ++
 meshes/reliability-test/worker/prompt.md  |  20 ++
 src/reliability/circuit-breaker.ts        | 193 +++++++++++++++
 src/reliability/dead-letter-queue.ts      | 188 +++++++++++++++
 src/reliability/heartbeat-monitor.ts      | 221 +++++++++++++++++
 src/reliability/index.ts                  |  19 ++
 src/reliability/reliability-manager.ts    | 276 ++++++++++++++++++++++
 src/reliability/safe-mode.ts              | 235 ++++++++++++++++++
 src/reliability/sli-tracker.ts            | 239 +++++++++++++++++++
 src/worker/dispatcher.ts                  |  37 +++
 18 files changed, 1854 insertions(+)
 create mode 100644 meshes/reliability-fsm/analyst/prompt.md
 create mode 100644 meshes/reliability-fsm/builder/prompt.md
 create mode 100644 meshes/reliability-fsm/config.yaml
 create mode 100644 meshes/reliability-fsm/verifier/prompt.md
 create mode 100644 meshes/reliability-test/checker/prompt.md
 create mode 100644 meshes/reliability-test/config.yaml
 create mode 100644 meshes/reliability-test/planner/prompt.md
 create mode 100644 meshes/reliability-test/worker/prompt.md
 create mode 100644 src/reliability/circuit-breaker.ts
 create mode 100644 src/reliability/dead-letter-queue.ts
 create mode 100644 src/reliability/heartbeat-monitor.ts
 create mode 100644 src/reliability/index.ts
 create mode 100644 src/reliability/reliability-manager.ts
 create mode 100644 src/reliability/safe-mode.ts
 create mode 100644 src/reliability/sli-tracker.ts

diff --git a/.gitignore b/.gitignore
index 7a39346e..1be296ef 100644
--- a/.gitignore
+++ b/.gitignore
@@ -28,3 +28,5 @@ meshes/*
 !meshes/structured-thinking
 !meshes/narrative-engine/
 !meshes/narrative-engine-v2/
+!meshes/reliability-test/
+!meshes/reliability-fsm/
diff --git a/docs/guardrails.md b/docs/guardrails.md
index 59b93ae7..986f49dc 100644
--- a/docs/guardrails.md
+++ b/docs/guardrails.md
@@ -358,3 +358,118 @@ max_turns:
   warning: true
   limit: 50
 ```
+
+## Reliability (Four Nines)
+
+The reliability module (`src/reliability/`) provides four-nines (99.99%) patterns inspired by Karpathy's "March of Nines". Each nine requires fundamentally new approaches:
+
+| Nine | Target | TX Mechanism |
+|------|--------|-------------|
+| 1 (90%) | Basic error handling | Logging, guardrails, FSM validation |
+| 2 (99%) | Message recovery | Dead Letter Queue, retry with backoff |
+| 3 (99.9%) | Failure isolation | Circuit breakers, heartbeat monitoring |
+| 4 (99.99%) | Proactive safety | SLI tracking, safe mode, failure taxonomy |
+
+### Configuration
+
+Add to `.ai/tx/data/config.yaml`:
+
+```yaml
+reliability:
+  circuitBreaker:
+    failureThreshold: 3      # Failures before circuit opens
+    cooldownMs: 60000         # Wait before probe request
+    windowMs: 300000          # Failure counting window (5 min)
+  heartbeat:
+    warnMs: 60000             # Silence before warning (1 min)
+    staleMs: 120000           # Silence before stale (2 min)
+    deadMs: 300000            # Silence before dead (5 min)
+    checkIntervalMs: 15000    # Check interval (15s)
+  safeMode:
+    defaultLevel: normal      # normal | cautious | restricted | lockdown
+    autoEscalate: false       # Auto-escalate based on SLI
+    cautiousThreshold: 0.95   # SLI rate triggering cautious mode
+    restrictedThreshold: 0.90 # SLI rate triggering restricted mode
+    lockdownThreshold: 0.80   # SLI rate triggering lockdown
+  dlq:
+    maxRetries: 3             # Max retries before DLQ
+  sli:
+    retentionMs: 604800000    # SLI data retention (7 days)
+```
+
+### Dead Letter Queue (DLQ)
+
+Messages that fail delivery after max retries are routed to the DLQ instead of being silently dropped. DLQ entries persist in SQLite and can be replayed manually.
+
+- Automatic retry with exponential backoff
+- Failure reason tracking for taxonomy
+- Replay capability for manual recovery
+- Stats available via `reliability.dlq.getStats()`
+
+### Circuit Breaker
+
+Prevents cascading failures when an agent repeatedly fails. Three states:
+
+| State | Behavior |
+|-------|----------|
+| **Closed** | Normal — requests pass through |
+| **Open** | Failures exceeded threshold — requests fail immediately |
+| **Half-Open** | After cooldown — single probe request allowed |
+
+Applied per-agent (`mesh/agent`). Resets on mesh completion.
+
+### Heartbeat Monitor
+
+Detects stalled/hung workers by monitoring output timestamps:
+
+| Level | Default | Action |
+|-------|---------|--------|
+| Warn | 60s silence | Log warning |
+| Stale | 120s silence | Inject nudge to worker |
+| Dead | 300s silence | Record failure, trigger circuit breaker |
+
+### SLI Tracker
+
+Tracks success rates, latencies, and failure categories per mesh:
+
+- **Success rate**: Per-mesh and per-agent (target: 99.99%)
+- **MTTR**: Mean time to recovery (failure → next success)
+- **Failure taxonomy**: Categorized failures for targeted fixes
+- **Nines level**: Human-readable "99.9% (3 nines)" display
+
+Failure categories: `model_error`, `routing_error`, `timeout`, `guardrail_kill`, `crash`, `stuck`, `policy_violation`, `gate_failure`, `circuit_open`, `unknown`
+
+### Safe Mode
+
+Treat autonomy as a knob, not a switch. Four levels:
+
+| Level | Tools Disabled | Actions Blocked |
+|-------|---------------|-----------------|
+| **normal** | None | None |
+| **cautious** | None | Destructive bash, git push, file delete |
+| **restricted** | Write, Edit, Bash | All writes, all bash, git operations |
+| **lockdown** | All tools | All operations (stops agent execution) |
+
+Safe mode can be:
+- Set manually per-mesh or globally
+- Auto-escalated based on SLI thresholds (when `autoEscalate: true`)
+- Only escalates automatically; human must clear/de-escalate
+
+### Test Meshes
+
+Two meshes for testing reliability features:
+
+- **`reliability-test`**: Simple 3-agent linear mesh (planner → worker → checker) with tight guardrails
+- **`reliability-fsm`**: FSM-based mesh with gate scripts, iteration tracking, and state transitions
+
+### Implementation
+
+| File | Role |
+|------|------|
+| `src/reliability/index.ts` | Module exports |
+| `src/reliability/reliability-manager.ts` | Central coordinator (single integration point) |
+| `src/reliability/dead-letter-queue.ts` | DLQ with SQLite persistence |
+| `src/reliability/circuit-breaker.ts` | Per-agent circuit breaker |
+| `src/reliability/heartbeat-monitor.ts` | Stalled worker detection |
+| `src/reliability/sli-tracker.ts` | SLI measurement and nines calculation |
+| `src/reliability/safe-mode.ts` | Gradual autonomy control |
diff --git a/meshes/reliability-fsm/analyst/prompt.md b/meshes/reliability-fsm/analyst/prompt.md
new file mode 100644
index 00000000..13a9c388
--- /dev/null
+++ b/meshes/reliability-fsm/analyst/prompt.md
@@ -0,0 +1,20 @@
+# Analyst Agent
+
+You are the coordinator of a reliability test FSM mesh. You analyze tasks, coordinate work, and finalize results.
+
+## Responsibilities
+
+- **analyze state**: Break down the incoming task into clear requirements
+- **complete state**: Synthesize results and report completion
+
+## Guidelines
+
+- Keep analysis brief and focused
+- Forward clear requirements to the builder
+- On completion, summarize what was accomplished
+
+## Routing
+
+- When analysis is ready: route `complete` (FSM handles transition to build)
+- When task is complete: route `complete` → core
+- When user input needed: route `blocked` → core
diff --git a/meshes/reliability-fsm/builder/prompt.md b/meshes/reliability-fsm/builder/prompt.md
new file mode 100644
index 00000000..5dc1fb63
--- /dev/null
+++ b/meshes/reliability-fsm/builder/prompt.md
@@ -0,0 +1,14 @@
+# Builder Agent
+
+You are a builder agent in an FSM reliability test mesh. You implement what the analyst specifies.
+
+## Responsibilities
+
+- Execute the implementation plan from the analyst
+- Write clean, functional code
+- Report completion for verification
+
+## Routing
+
+- When build is done: route `complete` (FSM transitions to verify)
+- When blocked: route `blocked` → analyst
diff --git a/meshes/reliability-fsm/config.yaml b/meshes/reliability-fsm/config.yaml
new file mode 100644
index 00000000..61dd1adf
--- /dev/null
+++ b/meshes/reliability-fsm/config.yaml
@@ -0,0 +1,135 @@
+# reliability-fsm/config.yaml
+# FSM-based reliability test mesh
+#
+# Tests reliability features with state machine transitions:
+# - Gate scripts that can fail (tests circuit breaker recovery)
+# - Multi-step workflow (tests per-step SLI tracking)
+# - Iteration loop (tests heartbeat monitoring across retries)
+# - Safe mode integration (tests tool restriction under degraded SLI)
+
+mesh: reliability-fsm
+description: "FSM reliability test: state gates, iteration tracking, safe-mode integration"
+
+agents:
+  - name: analyst
+    model: haiku
+    prompt: analyst/prompt.md
+
+  - name: builder
+    model: haiku
+    prompt: builder/prompt.md
+
+  - name: verifier
+    model: haiku
+    prompt: verifier/prompt.md
+
+entry_point: analyst
+completion_agent: analyst
+continuation: [analyst]
+
+routing:
+  analyst:
+    complete:
+      core: "Task completed successfully"
+    blocked:
+      core: "Need user input"
+  builder:
+    complete:
+      analyst: "Build complete, ready for next step"
+    blocked:
+      analyst: "Build blocked, need guidance"
+  verifier:
+    complete:
+      analyst: "Verification passed"
+    blocked:
+      builder: "Verification failed, rework needed"
+
+injectOriginalMessage: true
+
+# Reliability-specific guardrails
+guardrails:
+  max_messages:
+    limit: 30
+    strict: true
+    warning: true
+  max_turns:
+    limit: 20
+    strict: false
+    warning: true
+
+# FSM: analyze → build → verify → complete (with retry loop)
+fsm:
+  initial: analyze
+
+  context:
+    iteration: 0
+    max_iterations: 3
+    build_attempts: 0
+
+  states:
+    analyze:
+      agents: [analyst]
+      exit:
+        set:
+          iteration: "0"
+        when:
+          - condition: "true"
+            target: build
+        default: build
+
+    build:
+      agents: [builder]
+      entry:
+        gates:
+          builder:
+            - build-ready
+      exit:
+        run: increment-build
+        when:
+          - condition: "true"
+            target: verify
+        default: verify
+
+    verify:
+      agents: [verifier]
+      entry:
+        gates:
+          verifier:
+            - verify-ready
+      exit:
+        when:
+          - condition: "build_attempts >= max_iterations"
+            target: complete
+          - condition: "true"
+            target: complete
+        default: complete
+
+    complete:
+      agents: [analyst]
+
+  scripts:
+    build-ready: |
+      echo "Build gate: checking readiness"
+      exit 0
+
+    verify-ready: |
+      echo "Verify gate: checking build artifacts"
+      exit 0
+
+    increment-build: |
+      echo "Build iteration incremented"
+      exit 0
+
+workspace:
+  path: ".ai/output/{task-id}/"
+
+playbook_notes: |
+  FSM reliability test mesh — exercises:
+
+  1. Gate scripts at state entry (tests gate failure → circuit breaker)
+  2. Iteration counting (tests SLI per-step tracking)
+  3. Multi-agent handoff (tests heartbeat during transitions)
+  4. Build retry loop (tests recovery patterns)
+
+  The analyze→build→verify→complete flow mirrors real dev workflows
+  while being lightweight enough for reliability testing.
diff --git a/meshes/reliability-fsm/verifier/prompt.md b/meshes/reliability-fsm/verifier/prompt.md
new file mode 100644
index 00000000..979d7afd
--- /dev/null
+++ b/meshes/reliability-fsm/verifier/prompt.md
@@ -0,0 +1,14 @@
+# Verifier Agent
+
+You are a verification agent in an FSM reliability test mesh. You validate the builder's output.
+
+## Responsibilities
+
+- Check the builder's implementation against requirements
+- Verify correctness and completeness
+- Approve or reject with specific feedback
+
+## Routing
+
+- When verification passes: route `complete` (FSM transitions to complete)
+- When rework needed: route `blocked` → builder
diff --git a/meshes/reliability-test/checker/prompt.md b/meshes/reliability-test/checker/prompt.md
new file mode 100644
index 00000000..ec582087
--- /dev/null
+++ b/meshes/reliability-test/checker/prompt.md
@@ -0,0 +1,21 @@
+# Checker Agent
+
+You are a checker agent in a reliability test mesh. Your job is to verify the worker's output.
+
+## Responsibilities
+
+1. Review the implementation from the worker
+2. Verify it meets the original task requirements
+3. Check for obvious errors or omissions
+4. Approve or send back for rework
+
+## Verification Checklist
+
+- [ ] Code compiles/runs without errors
+- [ ] Meets the requirements from the plan
+- [ ] No obvious bugs or missing edge cases
+
+## Routing
+
+- When all checks pass: route `complete` → core (task done)
+- When rework needed: route `blocked` → worker
diff --git a/meshes/reliability-test/config.yaml b/meshes/reliability-test/config.yaml
new file mode 100644
index 00000000..584b4233
--- /dev/null
+++ b/meshes/reliability-test/config.yaml
@@ -0,0 +1,86 @@
+# reliability-test/config.yaml
+# Test mesh for validating four-nines reliability features
+#
+# Exercises: circuit breakers, heartbeat monitoring, SLI tracking,
+# dead letter queue, safe mode, and failure recovery.
+#
+# This mesh has an intentionally fragile agent (chaos-agent) that may
+# produce routing errors or slow output to test reliability detection.
+
+mesh: reliability-test
+description: "Test mesh for four-nines reliability features: circuit breakers, heartbeat, SLI, DLQ, safe mode"
+
+agents:
+  - name: planner
+    model: haiku
+    prompt: planner/prompt.md
+
+  - name: worker
+    model: haiku
+    prompt: worker/prompt.md
+
+  - name: checker
+    model: haiku
+    prompt: checker/prompt.md
+
+entry_point: planner
+completion_agent: checker
+
+routing:
+  planner:
+    complete:
+      worker: "Plan ready, execute implementation"
+    blocked:
+      core: "Need clarification from user"
+
+  worker:
+    complete:
+      checker: "Implementation done, verify results"
+    blocked:
+      planner: "Need to revise plan"
+
+  checker:
+    complete:
+      core: "All checks passed, task complete"
+    blocked:
+      worker: "Checks failed, rework needed"
+
+# Reliability-specific guardrails for testing
+guardrails:
+  max_messages:
+    strict: true
+    warning: true
+    limit: 20
+  max_turns:
+    strict: false
+    warning: true
+    limit: 15
+  routing_error:
+    strict: false
+    warning: true
+    max_retries: 2
+
+# Workspace for output
+workspace:
+  path: ".ai/output/{task-id}/"
+
+lifecycle:
+  post:
+    - commit:auto
+
+playbook_notes: |
+  Reliability test mesh for exercising four-nines patterns:
+
+  1. PLANNER: Breaks down the task into steps
+  2. WORKER: Executes the implementation
+  3. CHECKER: Validates the output
+
+  This mesh is configured with tight guardrails to exercise:
+  - Circuit breaker trips on repeated failures
+  - Heartbeat detection on stalled agents
+  - SLI tracking for success/failure rates
+  - DLQ routing for undeliverable messages
+  - Safe mode escalation when SLI drops
+
+  Run with: tx msg "Implement a simple hello world function"
+  Monitor with: tx status (shows reliability metrics)
diff --git a/meshes/reliability-test/planner/prompt.md b/meshes/reliability-test/planner/prompt.md
new file mode 100644
index 00000000..e0c9d914
--- /dev/null
+++ b/meshes/reliability-test/planner/prompt.md
@@ -0,0 +1,19 @@
+# Planner Agent
+
+You are a planning agent in a reliability test mesh. Your job is to break tasks into clear, actionable steps.
+
+## Responsibilities
+
+1. Analyze the incoming task
+2. Break it into 2-3 concrete implementation steps
+3. Forward the plan to the worker agent
+
+## Output Format
+
+Write a clear plan with numbered steps. Each step should be specific and actionable.
+Keep plans simple — this is a reliability test, not a complex project.
+
+## Routing
+
+- When plan is ready: route `complete` → worker
+- When you need human input: route `blocked` → core
diff --git a/meshes/reliability-test/worker/prompt.md b/meshes/reliability-test/worker/prompt.md
new file mode 100644
index 00000000..a1e85127
--- /dev/null
+++ b/meshes/reliability-test/worker/prompt.md
@@ -0,0 +1,20 @@
+# Worker Agent
+
+You are a worker agent in a reliability test mesh. Your job is to execute the plan from the planner.
+
+## Responsibilities
+
+1. Read the plan from the planner
+2. Execute each step (write code, create files, etc.)
+3. Forward completed work to the checker
+
+## Guidelines
+
+- Follow the plan step by step
+- Write clean, working code
+- Report any issues back to the planner
+
+## Routing
+
+- When implementation is done: route `complete` → checker
+- When plan needs revision: route `blocked` → planner
diff --git a/src/reliability/circuit-breaker.ts b/src/reliability/circuit-breaker.ts
new file mode 100644
index 00000000..14a19a5c
--- /dev/null
+++ b/src/reliability/circuit-breaker.ts
@@ -0,0 +1,193 @@
+/**
+ * CircuitBreaker - Prevent cascading failures in mesh execution
+ *
+ * Nine 3 pattern: When an agent or model repeatedly fails, stop
+ * sending it work and fail fast instead of wasting tokens.
+ *
+ * States:
+ * - CLOSED: Normal operation, requests pass through
+ * - OPEN: Failures exceeded threshold, requests fail immediately
+ * - HALF_OPEN: After cooldown, allow one probe request
+ *
+ * Applied per agent (mesh/agent) to isolate failures.
+ */
+
+import { log } from '../shared/logger.ts';
+
+export interface CircuitBreakerConfig {
+  /** Number of failures before opening circuit (default: 3) */
+  failureThreshold: number;
+  /** Time in ms before trying again after opening (default: 60000) */
+  cooldownMs: number;
+  /** Time window for counting failures in ms (default: 300000 = 5 min) */
+  windowMs: number;
+}
+
+export type CircuitBreakerState = 'closed' | 'open' | 'half_open';
+
+interface CircuitState {
+  state: CircuitBreakerState;
+  failures: number;
+  lastFailureAt: number;
+  openedAt: number;
+  successesSinceHalfOpen: number;
+}
+
+const DEFAULT_CONFIG: CircuitBreakerConfig = {
+  failureThreshold: 3,
+  cooldownMs: 60_000,
+  windowMs: 300_000,
+};
+
+export class CircuitBreaker {
+  private circuits: Map<string, CircuitState> = new Map();
+  private config: CircuitBreakerConfig;
+
+  constructor(config?: Partial<CircuitBreakerConfig>) {
+    this.config = { ...DEFAULT_CONFIG, ...config };
+  }
+
+  /**
+   * Check if a request should be allowed through
+   */
+  canExecute(agentId: string): boolean {
+    const circuit = this.circuits.get(agentId);
+    if (!circuit) return true;
+
+    switch (circuit.state) {
+      case 'closed':
+        return true;
+
+      case 'open': {
+        const elapsed = Date.now() - circuit.openedAt;
+        if (elapsed >= this.config.cooldownMs) {
+          // Transition to half-open
+          circuit.state = 'half_open';
+          circuit.successesSinceHalfOpen = 0;
+          log.info('circuit-breaker', 'Circuit half-open (probe allowed)', {
+            agentId,
+            cooldownMs: this.config.cooldownMs,
+          });
+          return true;
+        }
+        return false;
+      }
+
+      case 'half_open':
+        // Allow single probe request
+        return true;
+    }
+  }
+
+  /**
+   * Record a successful execution
+   */
+  recordSuccess(agentId: string): void {
+    const circuit = this.circuits.get(agentId);
+    if (!circuit) return;
+
+    if (circuit.state === 'half_open') {
+      // Close the circuit on success in half-open state
+      circuit.state = 'closed';
+      circuit.failures = 0;
+      log.info('circuit-breaker', 'Circuit closed (recovery successful)', { agentId });
+    }
+  }
+
+  /**
+   * Record a failed execution
+   */
+  recordFailure(agentId: string, reason: string): void {
+    const now = Date.now();
+    let circuit = this.circuits.get(agentId);
+
+    if (!circuit) {
+      circuit = {
+        state: 'closed',
+        failures: 0,
+        lastFailureAt: 0,
+        openedAt: 0,
+        successesSinceHalfOpen: 0,
+      };
+      this.circuits.set(agentId, circuit);
+    }
+
+    // Reset failure count if outside window
+    if (now - circuit.lastFailureAt > this.config.windowMs) {
+      circuit.failures = 0;
+    }
+
+    circuit.failures++;
+    circuit.lastFailureAt = now;
+
+    log.warn('circuit-breaker', 'Failure recorded', {
+      agentId,
+      failures: circuit.failures,
+      threshold: this.config.failureThreshold,
+      reason,
+    });
+
+    if (circuit.state === 'half_open') {
+      // Failed during probe - reopen
+      circuit.state = 'open';
+      circuit.openedAt = now;
+      log.error('circuit-breaker', 'Circuit reopened (probe failed)', { agentId, reason });
+      return;
+    }
+
+    if (circuit.failures >= this.config.failureThreshold) {
+      circuit.state = 'open';
+      circuit.openedAt = now;
+      log.error('circuit-breaker', 'Circuit opened (threshold exceeded)', {
+        agentId,
+        failures: circuit.failures,
+        threshold: this.config.failureThreshold,
+        reason,
+      });
+    }
+  }
+
+  /**
+   * Get the current state of a circuit
+   */
+  getState(agentId: string): CircuitBreakerState {
+    return this.circuits.get(agentId)?.state || 'closed';
+  }
+
+  /**
+   * Get all circuit states (for status display)
+   */
+  getAllStates(): Map<string, { state: CircuitBreakerState; failures: number }> {
+    const result = new Map<string, { state: CircuitBreakerState; failures: number }>();
+    for (const [id, circuit] of this.circuits) {
+      result.set(id, { state: circuit.state, failures: circuit.failures });
+    }
+    return result;
+  }
+
+  /**
+   * Reset a specific circuit (manual recovery)
+   */
+  reset(agentId: string): void {
+    this.circuits.delete(agentId);
+    log.info('circuit-breaker', 'Circuit manually reset', { agentId });
+  }
+
+  /**
+   * Reset all circuits (e.g., on mesh restart)
+   */
+  resetAll(): void {
+    this.circuits.clear();
+  }
+
+  /**
+   * Reset circuits for a specific mesh
+   */
+  resetForMesh(meshName: string): void {
+    for (const agentId of this.circuits.keys()) {
+      if (agentId.startsWith(`${meshName}/`)) {
+        this.circuits.delete(agentId);
+      }
+    }
+  }
+}
diff --git a/src/reliability/dead-letter-queue.ts b/src/reliability/dead-letter-queue.ts
new file mode 100644
index 00000000..98b6ec06
--- /dev/null
+++ b/src/reliability/dead-letter-queue.ts
@@ -0,0 +1,188 @@
+/**
+ * DeadLetterQueue - Messages that failed delivery after max retries
+ *
+ * Nine 2 pattern: Instead of silently dropping failed messages,
+ * route them to a DLQ for inspection and manual replay.
+ *
+ * Features:
+ * - Automatic retry with exponential backoff (up to maxRetries)
+ * - DLQ storage in SQLite for persistence across restarts
+ * - Replay capability for manual recovery
+ * - Failure reason tracking for taxonomy
+ */
+
+import type Database from 'better-sqlite3';
+import { log } from '../shared/logger.ts';
+
+export interface DLQEntry {
+  id: number;
+  from_agent: string;
+  to_agent: string;
+  type: string;
+  payload: string;
+  source_file: string | null;
+  failure_reason: string;
+  retry_count: number;
+  max_retries: number;
+  first_failed_at: number;
+  last_failed_at: number;
+  replayed_at: number | null;
+}
+
+export interface DLQStats {
+  total: number;
+  pending: number;     // Not yet replayed
+  replayed: number;    // Successfully replayed
+  byReason: Record<string, number>;
+  byAgent: Record<string, number>;
+}
+
+export class DeadLetterQueue {
+  private db: Database.Database;
+  private maxRetries: number;
+
+  constructor(db: Database.Database, maxRetries = 3) {
+    this.db = db;
+    this.maxRetries = maxRetries;
+    this.ensureSchema();
+  }
+
+  private ensureSchema(): void {
+    this.db.exec(`
+      CREATE TABLE IF NOT EXISTS dead_letter_queue (
+        id INTEGER PRIMARY KEY AUTOINCREMENT,
+        from_agent TEXT NOT NULL,
+        to_agent TEXT NOT NULL,
+        type TEXT NOT NULL,
+        payload TEXT NOT NULL,
+        source_file TEXT,
+        failure_reason TEXT NOT NULL,
+        retry_count INTEGER DEFAULT 0,
+        max_retries INTEGER NOT NULL,
+        first_failed_at INTEGER NOT NULL,
+        last_failed_at INTEGER NOT NULL,
+        replayed_at INTEGER
+      );
+      CREATE INDEX IF NOT EXISTS idx_dlq_agent ON dead_letter_queue(to_agent, replayed_at);
+      CREATE INDEX IF NOT EXISTS idx_dlq_reason ON dead_letter_queue(failure_reason);
+    `);
+  }
+
+  /**
+   * Add a failed message to the DLQ
+   */
+  add(entry: {
+    from_agent: string;
+    to_agent: string;
+    type: string;
+    payload: Record<string, unknown>;
+    source_file?: string;
+    failure_reason: string;
+    retry_count?: number;
+  }): number {
+    const now = Date.now();
+    const result = this.db.prepare(`
+      INSERT INTO dead_letter_queue
+        (from_agent, to_agent, type, payload, source_file, failure_reason,
+         retry_count, max_retries, first_failed_at, last_failed_at)
+      VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
+    `).run(
+      entry.from_agent,
+      entry.to_agent,
+      entry.type,
+      JSON.stringify(entry.payload),
+      entry.source_file || null,
+      entry.failure_reason,
+      entry.retry_count || 0,
+      this.maxRetries,
+      now,
+      now
+    );
+
+    log.warn('dlq', 'Message added to dead letter queue', {
+      id: result.lastInsertRowid,
+      from: entry.from_agent,
+      to: entry.to_agent,
+      reason: entry.failure_reason,
+      retries: entry.retry_count || 0,
+    });
+
+    return result.lastInsertRowid as number;
+  }
+
+  /**
+   * Get all unreplayed DLQ entries
+   */
+  getPending(): DLQEntry[] {
+    return this.db.prepare(`
+      SELECT * FROM dead_letter_queue
+      WHERE replayed_at IS NULL
+      ORDER BY last_failed_at DESC
+    `).all() as DLQEntry[];
+  }
+
+  /**
+   * Get DLQ entries for a specific agent
+   */
+  getForAgent(agentId: string): DLQEntry[] {
+    return this.db.prepare(`
+      SELECT * FROM dead_letter_queue
+      WHERE to_agent = ? AND replayed_at IS NULL
+      ORDER BY last_failed_at DESC
+    `).all(agentId) as DLQEntry[];
+  }
+
+  /**
+   * Mark a DLQ entry as replayed
+   */
+  markReplayed(id: number): void {
+    this.db.prepare(`
+      UPDATE dead_letter_queue SET replayed_at = ? WHERE id = ?
+    `).run(Date.now(), id);
+
+    log.info('dlq', 'DLQ entry replayed', { id });
+  }
+
+  /**
+   * Get DLQ statistics
+   */
+  getStats(): DLQStats {
+    const total = (this.db.prepare(
+      'SELECT COUNT(*) as c FROM dead_letter_queue'
+    ).get() as { c: number }).c;
+
+    const pending = (this.db.prepare(
+      'SELECT COUNT(*) as c FROM dead_letter_queue WHERE replayed_at IS NULL'
+    ).get() as { c: number }).c;
+
+    const byReasonRows = this.db.prepare(`
+      SELECT failure_reason, COUNT(*) as c FROM dead_letter_queue
+      WHERE replayed_at IS NULL GROUP BY failure_reason
+    `).all() as Array<{ failure_reason: string; c: number }>;
+
+    const byAgentRows = this.db.prepare(`
+      SELECT to_agent, COUNT(*) as c FROM dead_letter_queue
+      WHERE replayed_at IS NULL GROUP BY to_agent
+    `).all() as Array<{ to_agent: string; c: number }>;
+
+    const byReason: Record<string, number> = {};
+    for (const row of byReasonRows) byReason[row.failure_reason] = row.c;
+
+    const byAgent: Record<string, number> = {};
+    for (const row of byAgentRows) byAgent[row.to_agent] = row.c;
+
+    return { total, pending, replayed: total - pending, byReason, byAgent };
+  }
+
+  /**
+   * Clear old replayed entries (garbage collection)
+   */
+  clearReplayed(olderThanMs = 24 * 60 * 60 * 1000): number {
+    const cutoff = Date.now() - olderThanMs;
+    const result = this.db.prepare(`
+      DELETE FROM dead_letter_queue
+      WHERE replayed_at IS NOT NULL AND replayed_at < ?
+    `).run(cutoff);
+    return result.changes;
+  }
+}
diff --git a/src/reliability/heartbeat-monitor.ts b/src/reliability/heartbeat-monitor.ts
new file mode 100644
index 00000000..8bfde400
--- /dev/null
+++ b/src/reliability/heartbeat-monitor.ts
@@ -0,0 +1,221 @@
+/**
+ * HeartbeatMonitor - Detect stalled/hung workers
+ *
+ * Nine 3 pattern: Workers that stop producing output are likely stuck.
+ * Monitor last output timestamps and escalate when stale.
+ *
+ * Stale detection levels:
+ * 1. Warning (60s no output): Log, could be thinking
+ * 2. Stale (120s no output): Inject nudge to worker
+ * 3. Dead (300s no output): Kill worker, route to DLQ
+ */
+
+import { log } from '../shared/logger.ts';
+
+export interface HeartbeatConfig {
+  /** Warn threshold in ms (default: 60000 = 1 min) */
+  warnMs: number;
+  /** Stale threshold in ms (default: 120000 = 2 min) */
+  staleMs: number;
+  /** Dead threshold in ms (default: 300000 = 5 min) */
+  deadMs: number;
+  /** Check interval in ms (default: 15000 = 15s) */
+  checkIntervalMs: number;
+}
+
+export type HealthStatus = 'healthy' | 'warn' | 'stale' | 'dead';
+
+export interface AgentHealth {
+  agentId: string;
+  status: HealthStatus;
+  lastOutputAt: number;
+  silenceMs: number;
+  startedAt: number;
+}
+
+const DEFAULT_CONFIG: HeartbeatConfig = {
+  warnMs: 60_000,
+  staleMs: 120_000,
+  deadMs: 300_000,
+  checkIntervalMs: 15_000,
+};
+
+type HeartbeatCallback = (health: AgentHealth) => void;
+
+export class HeartbeatMonitor {
+  private agents: Map<string, { lastOutputAt: number; startedAt: number }> = new Map();
+  private config: HeartbeatConfig;
+  private checkInterval: NodeJS.Timeout | null = null;
+  private onStale: HeartbeatCallback | null = null;
+  private onDead: HeartbeatCallback | null = null;
+  private onWarn: HeartbeatCallback | null = null;
+  /** Track which agents have already been notified at each level to avoid spam */
+  private notified: Map<string, HealthStatus> = new Map();
+
+  constructor(config?: Partial<HeartbeatConfig>) {
+    this.config = { ...DEFAULT_CONFIG, ...config };
+  }
+
+  /**
+   * Register callbacks for health state changes
+   */
+  on(event: 'warn' | 'stale' | 'dead', callback: HeartbeatCallback): void {
+    switch (event) {
+      case 'warn': this.onWarn = callback; break;
+      case 'stale': this.onStale = callback; break;
+      case 'dead': this.onDead = callback; break;
+    }
+  }
+
+  /**
+   * Register an agent for monitoring
+   */
+  register(agentId: string): void {
+    const now = Date.now();
+    this.agents.set(agentId, { lastOutputAt: now, startedAt: now });
+    this.notified.delete(agentId);
+  }
+
+  /**
+   * Record output from an agent (heartbeat)
+   */
+  heartbeat(agentId: string): void {
+    const entry = this.agents.get(agentId);
+    if (entry) {
+      entry.lastOutputAt = Date.now();
+      // Reset notification level on activity
+      this.notified.delete(agentId);
+    }
+  }
+
+  /**
+   * Unregister an agent (worker completed/killed)
+   */
+  unregister(agentId: string): void {
+    this.agents.delete(agentId);
+    this.notified.delete(agentId);
+  }
+
+  /**
+   * Start periodic health checks
+   */
+  start(): void {
+    if (this.checkInterval) return;
+
+    this.checkInterval = setInterval(() => {
+      this.checkAll();
+    }, this.config.checkIntervalMs);
+
+    log.debug('heartbeat', 'Monitor started', {
+      checkIntervalMs: this.config.checkIntervalMs,
+    });
+  }
+
+  /**
+   * Stop periodic health checks
+   */
+  stop(): void {
+    if (this.checkInterval) {
+      clearInterval(this.checkInterval);
+      this.checkInterval = null;
+    }
+  }
+
+  /**
+   * Check health of all registered agents
+   */
+  checkAll(): AgentHealth[] {
+    const now = Date.now();
+    const results: AgentHealth[] = [];
+
+    for (const [agentId, entry] of this.agents) {
+      const silenceMs = now - entry.lastOutputAt;
+      const health = this.classify(agentId, silenceMs, entry);
+      results.push(health);
+
+      // Fire callbacks only on state escalation (don't re-notify at same level)
+      const prevLevel = this.notified.get(agentId);
+      if (health.status !== 'healthy' && health.status !== prevLevel) {
+        this.notified.set(agentId, health.status);
+        this.fireCallback(health);
+      }
+    }
+
+    return results;
+  }
+
+  /**
+   * Get health for a specific agent
+   */
+  getHealth(agentId: string): AgentHealth | null {
+    const entry = this.agents.get(agentId);
+    if (!entry) return null;
+    const silenceMs = Date.now() - entry.lastOutputAt;
+    return this.classify(agentId, silenceMs, entry);
+  }
+
+  private classify(
+    agentId: string,
+    silenceMs: number,
+    entry: { lastOutputAt: number; startedAt: number }
+  ): AgentHealth {
+    let status: HealthStatus = 'healthy';
+    if (silenceMs >= this.config.deadMs) status = 'dead';
+    else if (silenceMs >= this.config.staleMs) status = 'stale';
+    else if (silenceMs >= this.config.warnMs) status = 'warn';
+
+    return {
+      agentId,
+      status,
+      lastOutputAt: entry.lastOutputAt,
+      silenceMs,
+      startedAt: entry.startedAt,
+    };
+  }
+
+  private fireCallback(health: AgentHealth): void {
+    switch (health.status) {
+      case 'warn':
+        log.warn('heartbeat', 'Agent quiet', {
+          agentId: health.agentId,
+          silenceMs: health.silenceMs,
+        });
+        this.onWarn?.(health);
+        break;
+      case 'stale':
+        log.warn('heartbeat', 'Agent stale', {
+          agentId: health.agentId,
+          silenceMs: health.silenceMs,
+        });
+        this.onStale?.(health);
+        break;
+      case 'dead':
+        log.error('heartbeat', 'Agent presumed dead', {
+          agentId: health.agentId,
+          silenceMs: health.silenceMs,
+        });
+        this.onDead?.(health);
+        break;
+    }
+  }
+
+  /**
+   * Clear all monitoring state (mesh reset)
+   */
+  clear(): void {
+    this.agents.clear();
+    this.notified.clear();
+  }
+
+  /**
+   * Clear monitoring for a specific mesh
+   */
+  clearForMesh(meshName: string): void {
+    for (const agentId of this.agents.keys()) {
+      if (agentId.startsWith(`${meshName}/`)) {
+        this.agents.delete(agentId);
+        this.notified.delete(agentId);
+      }
+    }
+  }
+}
diff --git a/src/reliability/index.ts b/src/reliability/index.ts
new file mode 100644
index 00000000..b1989b53
--- /dev/null
+++ b/src/reliability/index.ts
@@ -0,0 +1,19 @@
+/**
+ * Reliability Module - March of Nines
+ *
+ * Implements four-nines (99.99%) reliability patterns for TX mesh execution:
+ *
+ * Nine 1 (90%):  Basic error handling, logging ✓ (existing)
+ * Nine 2 (99%):  Dead letter queue, message retry, idempotency
+ * Nine 3 (99.9%): Circuit breakers, heartbeat detection, structured traces
+ * Nine 4 (99.99%): SLI tracking, failure taxonomy, safe-mode, canary checks
+ *
+ * Reference: Karpathy's "March of Nines" - each nine requires new approaches,
+ * not just more of what got you the previous nine.
+ */
+
+export { DeadLetterQueue, type DLQEntry, type DLQStats } from './dead-letter-queue.ts';
+export { CircuitBreaker, type CircuitBreakerConfig, type CircuitBreakerState } from './circuit-breaker.ts';
+export { HeartbeatMonitor, type HeartbeatConfig, type AgentHealth } from './heartbeat-monitor.ts';
+export { SLITracker, type SLIConfig, type SLISnapshot, type FailureCategory } from './sli-tracker.ts';
+export { SafeMode, type SafeModeConfig, type SafeModeState } from './safe-mode.ts';
diff --git a/src/reliability/reliability-manager.ts b/src/reliability/reliability-manager.ts
new file mode 100644
index 00000000..012177bf
--- /dev/null
+++ b/src/reliability/reliability-manager.ts
@@ -0,0 +1,276 @@
+/**
+ * ReliabilityManager - Central coordinator for all reliability features
+ *
+ * Provides a single integration point for the dispatcher to wire up:
+ * - Dead letter queue (failed message recovery)
+ * - Circuit breakers (cascading failure prevention)
+ * - Heartbeat monitoring (stalled worker detection)
+ * - SLI tracking (reliability measurement)
+ * - Safe mode (gradual autonomy control)
+ *
+ * Usage in dispatcher.start():
+ *   this.reliability = new ReliabilityManager(this.queue.getDb(), this.config.workDir);
+ *   this.reliability.start();
+ *
+ * Wire events:
+ *   // On worker complete
+ *   this.reliability.recordSuccess(meshName, agentId, durationMs);
+ *   // On worker error
+ *   this.reliability.recordFailure(meshName, agentId, 'crash', error.message);
+ *   // On worker output (heartbeat)
+ *   this.reliability.heartbeat(agentId);
+ */
+
+import type Database from 'better-sqlite3';
+import { DeadLetterQueue, type DLQStats } from './dead-letter-queue.ts';
+import { CircuitBreaker, type CircuitBreakerState } from './circuit-breaker.ts';
+import { HeartbeatMonitor, type AgentHealth } from './heartbeat-monitor.ts';
+import { SLITracker, type SLISnapshot, type FailureCategory } from './sli-tracker.ts';
+import { SafeMode, type SafeModeLevel, type SafeModeState } from './safe-mode.ts';
+import { log } from '../shared/logger.ts';
+import fs from 'node:fs';
+import path from 'node:path';
+import YAML from 'yaml';
+
+export interface ReliabilityConfig {
+  circuitBreaker?: {
+    failureThreshold?: number;
+    cooldownMs?: number;
+    windowMs?: number;
+  };
+  heartbeat?: {
+    warnMs?: number;
+    staleMs?: number;
+    deadMs?: number;
+    checkIntervalMs?: number;
+  };
+  safeMode?: {
+    defaultLevel?: SafeModeLevel;
+    autoEscalate?: boolean;
+    cautiousThreshold?: number;
+    restrictedThreshold?: number;
+    lockdownThreshold?: number;
+  };
+  dlq?: {
+    maxRetries?: number;
+  };
+  sli?: {
+    retentionMs?: number;
+  };
+}
+
+export interface ReliabilityStatus {
+  sli: SLISnapshot;
+  dlq: DLQStats;
+  safeMode: SafeModeState;
+  circuitBreakers: Array<{ agentId: string; state: CircuitBreakerState; failures: number }>;
+  agentHealth: AgentHealth[];
+}
+
+export class ReliabilityManager {
+  readonly dlq: DeadLetterQueue;
+  readonly circuitBreaker: CircuitBreaker;
+  readonly heartbeat: HeartbeatMonitor;
+  readonly sli: SLITracker;
+  readonly safeMode: SafeMode;
+  private workDir: string;
+
+  constructor(db: Database.Database, workDir: string, config?: ReliabilityConfig) {
+    this.workDir = workDir;
+
+    // Load config from config.yaml if exists
+    const fileConfig = this.loadConfigFromFile(workDir);
+    const merged = { ...fileConfig, ...config };
+
+    this.dlq = new DeadLetterQueue(db, merged.dlq?.maxRetries);
+    this.circuitBreaker = new CircuitBreaker(merged.circuitBreaker);
+    this.heartbeat = new HeartbeatMonitor(merged.heartbeat);
+    this.sli = new SLITracker(merged.sli);
+    this.safeMode = new SafeMode(merged.safeMode);
+
+    // Wire heartbeat callbacks
+    this.heartbeat.on('stale', (health) => {
+      log.warn('reliability', `Agent stale: ${health.agentId}`, {
+        silenceMs: health.silenceMs,
+      });
+    });
+
+    this.heartbeat.on('dead', (health) => {
+      this.recordFailure(
+        health.agentId.split('/')[0],
+        health.agentId,
+        'stuck',
+        `No output for ${Math.round(health.silenceMs / 1000)}s`
+      );
+    });
+
+    log.info('reliability', 'ReliabilityManager initialized', {
+      dlqMaxRetries: merged.dlq?.maxRetries || 3,
+      cbThreshold: merged.circuitBreaker?.failureThreshold || 3,
+      safeModeDefault: merged.safeMode?.defaultLevel || 'normal',
+      autoEscalate: merged.safeMode?.autoEscalate || false,
+    });
+  }
+
+  /**
+   * Load reliability config from .ai/tx/data/config.yaml
+   */
+  private loadConfigFromFile(workDir: string): ReliabilityConfig {
+    const configPath = path.join(workDir, '.ai', 'tx', 'data', 'config.yaml');
+    if (!fs.existsSync(configPath)) return {};
+
+    try {
+      const content = YAML.parse(fs.readFileSync(configPath, 'utf-8'));
+      return content?.reliability || {};
+    } catch {
+      return {};
+    }
+  }
+
+  /**
+   * Start monitoring (heartbeat timer)
+   */
+  start(): void {
+    this.heartbeat.start();
+    log.info('reliability', 'Monitoring started');
+  }
+
+  /**
+   * Stop monitoring
+   */
+  stop(): void {
+    this.heartbeat.stop();
+  }
+
+  // ============================================================
+  // Integration API (called by dispatcher)
+  // ============================================================
+
+  /**
+   * Check if an agent can execute (circuit breaker + safe mode)
+   * Returns { allowed, reason } — dispatcher should skip spawn if !allowed
+   */
+  canSpawn(meshName: string, agentId: string): { allowed: boolean; reason?: string } {
+    // Circuit breaker check
+    if (!this.circuitBreaker.canExecute(agentId)) {
+      this.sli.recordFailure(meshName, agentId, 'circuit_open', 'Circuit breaker is open');
+      return { allowed: false, reason: `Circuit breaker OPEN for ${agentId}` };
+    }
+
+    // Safe mode check
+    const safeLevel = this.safeMode.getLevel(meshName);
+    if (safeLevel === 'lockdown') {
+      return { allowed: false, reason: `Safe mode LOCKDOWN for mesh ${meshName}` };
+    }
+
+    return { allowed: true };
+  }
+
+  /**
+   * Register agent for heartbeat monitoring (call on spawn)
+   */
+  registerAgent(agentId: string): void {
+    this.heartbeat.register(agentId);
+  }
+
+  /**
+   * Record heartbeat (call on worker output)
+   */
+  recordHeartbeat(agentId: string): void {
+    this.heartbeat.heartbeat(agentId);
+  }
+
+  /**
+   * Record successful completion
+   */
+  recordSuccess(meshName: string, agentId: string, durationMs?: number): void {
+    this.sli.recordSuccess(meshName, agentId, durationMs);
+    this.circuitBreaker.recordSuccess(agentId);
+    this.heartbeat.unregister(agentId);
+  }
+
+  /**
+   * Record failure
+   */
+  recordFailure(
+    meshName: string,
+    agentId: string,
+    category: FailureCategory,
+    reason?: string
+  ): void {
+    this.sli.recordFailure(meshName, agentId, category, reason);
+    this.circuitBreaker.recordFailure(agentId, reason || category);
+    this.heartbeat.unregister(agentId);
+
+    // Auto-evaluate safe mode after each failure
+    const snapshot = this.sli.getSnapshot(300_000); // 5 min window
+    this.safeMode.evaluateSLI(snapshot.successRate, meshName);
+  }
+
+  /**
+   * Route a failed message to DLQ
+   */
+  deadLetter(msg: {
+    from_agent: string;
+    to_agent: string;
+    type: string;
+    payload: Record<string, unknown>;
+    source_file?: string;
+  }, reason: string, retryCount?: number): void {
+    this.dlq.add({
+      from_agent: msg.from_agent,
+      to_agent: msg.to_agent,
+      type: msg.type,
+      payload: msg.payload,
+      source_file: msg.source_file,
+      failure_reason: reason,
+      retry_count: retryCount,
+    });
+  }
+
+  /**
+   * Clean up for a mesh (call on mesh complete)
+   */
+  cleanupMesh(meshName: string): void {
+    this.circuitBreaker.resetForMesh(meshName);
+    this.heartbeat.clearForMesh(meshName);
+  }
+
+  // ============================================================
+  // Status API (for CLI / monitoring)
+  // ============================================================
+
+  /**
+   * Get comprehensive reliability status
+   */
+  getStatus(windowMs?: number): ReliabilityStatus {
+    const cbStates = this.circuitBreaker.getAllStates();
+    const circuitBreakers: Array<{ agentId: string; state: CircuitBreakerState; failures: number }> = [];
+    for (const [agentId, info] of cbStates) {
+      circuitBreakers.push({ agentId, ...info });
+    }
+
+    return {
+      sli: this.sli.getSnapshot(windowMs),
+      dlq: this.dlq.getStats(),
+      safeMode: this.safeMode.getState(),
+      circuitBreakers,
+      agentHealth: this.heartbeat.checkAll(),
+    };
+  }
+
+  /**
+   * Write status to log file for monitoring
+   */
+  logStatus(): void {
+    const status = this.getStatus(300_000); // 5 min window
+    log.info('reliability', 'Status snapshot', {
+      ninesLevel: status.sli.ninesLevel,
+      successRate: status.sli.successRate,
+      totalEvents: status.sli.totalEvents,
+      dlqPending: status.dlq.pending,
+      safeModeLevel: status.safeMode.level,
+      openCircuits: status.circuitBreakers.filter(cb => cb.state === 'open').length,
+    });
+  }
+}
diff --git a/src/reliability/safe-mode.ts b/src/reliability/safe-mode.ts
new file mode 100644
index 00000000..66bed950
--- /dev/null
+++ b/src/reliability/safe-mode.ts
@@ -0,0 +1,235 @@
+/**
+ * SafeMode - Gradual autonomy toggle for mesh execution
+ *
+ * Nine 4 pattern: Treat autonomy as a knob, not a switch.
+ * When reliability drops below SLI thresholds, automatically
+ * restrict agent capabilities to prevent further damage.
+ *
+ * Levels:
+ * - normal: Full autonomy (all tools, all actions)
+ * - cautious: Disable risky tools (Bash write ops), require confirmation
+ * - restricted: Read-only mode, no file writes, no bash commands
+ * - lockdown: Stop all agent execution, alert human
+ *
+ * Safe mode can be triggered manually or automatically via SLI thresholds.
+ */
+
+import { log } from '../shared/logger.ts';
+
+export type SafeModeLevel = 'normal' | 'cautious' | 'restricted' | 'lockdown';
+
+export interface SafeModeConfig {
+  /** Default safe mode level (default: 'normal') */
+  defaultLevel: SafeModeLevel;
+  /** SLI threshold for auto-escalation to cautious (default: 0.95) */
+  cautiousThreshold: number;
+  /** SLI threshold for auto-escalation to restricted (default: 0.90) */
+  restrictedThreshold: number;
+  /** SLI threshold for auto-escalation to lockdown (default: 0.80) */
+  lockdownThreshold: number;
+  /** Enable auto-escalation based on SLI (default: false) */
+  autoEscalate: boolean;
+}
+
+export interface SafeModeState {
+  level: SafeModeLevel;
+  reason: string;
+  changedAt: number;
+  autoEscalated: boolean;
+  /** Tools disabled at this level */
+  disabledTools: string[];
+  /** Actions blocked at this level */
+  blockedActions: string[];
+}
+
+const DEFAULT_CONFIG: SafeModeConfig = {
+  defaultLevel: 'normal',
+  cautiousThreshold: 0.95,
+  restrictedThreshold: 0.90,
+  lockdownThreshold: 0.80,
+  autoEscalate: false,
+};
+
+/** Tools restricted at each level */
+const TOOL_RESTRICTIONS: Record<SafeModeLevel, string[]> = {
+  normal: [],
+  cautious: [],  // No tool blocks, but Bash writes require confirmation via guardrails
+  restricted: ['Write', 'Edit', 'NotebookEdit', 'Bash'],
+  lockdown: ['Write', 'Edit', 'NotebookEdit', 'Bash', 'Glob', 'Grep', 'Read'],
+};
+
+/** Actions blocked at each level */
+const ACTION_RESTRICTIONS: Record<SafeModeLevel, string[]> = {
+  normal: [],
+  cautious: ['destructive_bash', 'git_push', 'file_delete'],
+  restricted: ['all_writes', 'all_bash', 'git_operations'],
+  lockdown: ['all_operations'],
+};
+
+export class SafeMode {
+  private config: SafeModeConfig;
+  private levels: Map<string, SafeModeLevel> = new Map();  // per-mesh
+  private globalLevel: SafeModeLevel;
+  private changeHistory: Array<{ meshName: string | null; from: SafeModeLevel; to: SafeModeLevel; reason: string; at: number }> = [];
+
+  constructor(config?: Partial<SafeModeConfig>) {
+    this.config = { ...DEFAULT_CONFIG, ...config };
+    this.globalLevel = this.config.defaultLevel;
+  }
+
+  /**
+   * Get effective safe mode level for a mesh
+   * Mesh-specific overrides take priority over global
+   */
+  getLevel(meshName?: string): SafeModeLevel {
+    if (meshName && this.levels.has(meshName)) {
+      return this.levels.get(meshName)!;
+    }
+    return this.globalLevel;
+  }
+
+  /**
+   * Get full state for display/API
+   */
+  getState(meshName?: string): SafeModeState {
+    const level = this.getLevel(meshName);
+    const lastChange = this.changeHistory.filter(
+      h => h.meshName === (meshName || null)
+    ).pop();
+
+    return {
+      level,
+      reason: lastChange?.reason || 'default',
+      changedAt: lastChange?.at || 0,
+      autoEscalated: lastChange?.reason.startsWith('auto:') || false,
+      disabledTools: TOOL_RESTRICTIONS[level],
+      blockedActions: ACTION_RESTRICTIONS[level],
+    };
+  }
+
+  /**
+   * Set safe mode level for a specific mesh
+   */
+  setLevel(meshName: string, level: SafeModeLevel, reason: string): void {
+    const prev = this.levels.get(meshName) || this.globalLevel;
+    this.levels.set(meshName, level);
+
+    this.changeHistory.push({
+      meshName,
+      from: prev,
+      to: level,
+      reason,
+      at: Date.now(),
+    });
+
+    log.info('safe-mode', `Level changed: ${prev} → ${level}`, {
+      meshName,
+      reason,
+    });
+  }
+
+  /**
+   * Set global safe mode level
+   */
+  setGlobalLevel(level: SafeModeLevel, reason: string): void {
+    const prev = this.globalLevel;
+    this.globalLevel = level;
+
+    this.changeHistory.push({
+      meshName: null,
+      from: prev,
+      to: level,
+      reason,
+      at: Date.now(),
+    });
+
+    log.info('safe-mode', `Global level changed: ${prev} → ${level}`, { reason });
+  }
+
+  /**
+   * Check if a tool is allowed at the current safe mode level
+   */
+  isToolAllowed(toolName: string, meshName?: string): boolean {
+    const level = this.getLevel(meshName);
+    return !TOOL_RESTRICTIONS[level].includes(toolName);
+  }
+
+  /**
+   * Check if an action is allowed
+   */
+  isActionAllowed(action: string, meshName?: string): boolean {
+    const level = this.getLevel(meshName);
+    const blocked = ACTION_RESTRICTIONS[level];
+    return !blocked.includes(action) && !blocked.includes('all_operations');
+  }
+
+  /**
+   * Auto-evaluate safe mode based on current SLI success rate
+   * Only acts if autoEscalate is enabled
+   */
+  evaluateSLI(successRate: number, meshName?: string): SafeModeLevel {
+    if (!this.config.autoEscalate) {
+      return this.getLevel(meshName);
+    }
+
+    let targetLevel: SafeModeLevel = 'normal';
+    let reason = '';
+
+    if (successRate < this.config.lockdownThreshold) {
+      targetLevel = 'lockdown';
+      reason = `auto: SLI ${(successRate * 100).toFixed(1)}% < ${this.config.lockdownThreshold * 100}% lockdown threshold`;
+    } else if (successRate < this.config.restrictedThreshold) {
+      targetLevel = 'restricted';
+      reason = `auto: SLI ${(successRate * 100).toFixed(1)}% < ${this.config.restrictedThreshold * 100}% restricted threshold`;
+    } else if (successRate < this.config.cautiousThreshold) {
+      targetLevel = 'cautious';
+      reason = `auto: SLI ${(successRate * 100).toFixed(1)}% < ${this.config.cautiousThreshold * 100}% cautious threshold`;
+    }
+
+    const currentLevel = this.getLevel(meshName);
+    // Only escalate, never auto-de-escalate (human must clear)
+    if (this.severity(targetLevel) > this.severity(currentLevel)) {
+      if (meshName) {
+        this.setLevel(meshName, targetLevel, reason);
+      } else {
+        this.setGlobalLevel(targetLevel, reason);
+      }
+    }
+
+    return this.getLevel(meshName);
+  }
+
+  private severity(level: SafeModeLevel): number {
+    const map: Record<SafeModeLevel, number> = {
+      normal: 0,
+      cautious: 1,
+      restricted: 2,
+      lockdown: 3,
+    };
+    return map[level];
+  }
+
+  /**
+   * Get change history (for forensics)
+   */
+  getHistory(): typeof this.changeHistory {
+    return [...this.changeHistory];
+  }
+
+  /**
+   * Reset safe mode for a mesh (manual recovery)
+   */
+  resetMesh(meshName: string): void {
+    this.levels.delete(meshName);
+    log.info('safe-mode', 'Mesh safe mode reset', { meshName });
+  }
+
+  /**
+   * Reset all safe mode state
+   */
+  resetAll(): void {
+    this.levels.clear();
+    this.globalLevel = this.config.defaultLevel;
+    this.changeHistory = [];
+  }
+}
diff --git a/src/reliability/sli-tracker.ts b/src/reliability/sli-tracker.ts
new file mode 100644
index 00000000..0a802462
--- /dev/null
+++ b/src/reliability/sli-tracker.ts
@@ -0,0 +1,239 @@
+/**
+ * SLITracker - Service Level Indicator tracking for mesh reliability
+ *
+ * Nine 4 pattern: You can't improve what you can't measure.
+ * Track success rates, latencies, and failure categories per mesh.
+ *
+ * Tracks:
+ * - Message delivery success rate (target: 99.99%)
+ * - Worker completion rate
+ * - Mean time to recovery (MTTR)
+ * - Failure taxonomy (categorized failures for targeted fixes)
+ * - Per-step success rate (for multi-step workflows)
+ */
+
+import { log } from '../shared/logger.ts';
+
+export type FailureCategory =
+  | 'model_error'        // API/model failure
+  | 'routing_error'      // Message sent to wrong/missing agent
+  | 'timeout'            // Worker exceeded time limit
+  | 'guardrail_kill'     // Killed by guardrail enforcement
+  | 'crash'              // Unexpected process crash
+  | 'stuck'              // Agent stopped producing output
+  | 'policy_violation'   // Usage policy error
+  | 'gate_failure'       // FSM gate/script failure
+  | 'circuit_open'       // Circuit breaker prevented execution
+  | 'unknown';           // Uncategorized
+
+export interface SLIConfig {
+  /** How long to retain data in ms (default: 7 days) */
+  retentionMs: number;
+  /** Bucketing interval for rate calculations (default: 60000 = 1 min) */
+  bucketMs: number;
+}
+
+interface EventRecord {
+  timestamp: number;
+  meshName: string;
+  agentId: string;
+  success: boolean;
+  durationMs?: number;
+  category?: FailureCategory;
+  reason?: string;
+}
+
+export interface SLISnapshot {
+  /** Overall success rate (0-1) */
+  successRate: number;
+  /** Total events tracked */
+  totalEvents: number;
+  /** Total successes */
+  totalSuccesses: number;
+  /** Total failures */
+  totalFailures: number;
+  /** Mean time to recovery in ms (avg time from failure to next success for same agent) */
+  mttrMs: number | null;
+  /** Failure breakdown by category */
+  failuresByCategory: Record<string, number>;
+  /** Per-mesh success rates */
+  byMesh: Record<string, { success: number; total: number; rate: number }>;
+  /** Per-agent success rates */
+  byAgent: Record<string, { success: number; total: number; rate: number }>;
+  /** Current nines level (e.g., "99.9%") */
+  ninesLevel: string;
+  /** Window start timestamp */
+  windowStart: number;
+  /** Window end timestamp */
+  windowEnd: number;
+}
+
+const DEFAULT_CONFIG: SLIConfig = {
+  retentionMs: 7 * 24 * 60 * 60 * 1000,  // 7 days
+  bucketMs: 60_000,
+};
+
+export class SLITracker {
+  private events: EventRecord[] = [];
+  private config: SLIConfig;
+  private lastFailureByAgent: Map<string, number> = new Map();
+  private mttrSamples: number[] = [];
+
+  constructor(config?: Partial<SLIConfig>) {
+    this.config = { ...DEFAULT_CONFIG, ...config };
+  }
+
+  /**
+   * Record a successful operation
+   */
+  recordSuccess(meshName: string, agentId: string, durationMs?: number): void {
+    const now = Date.now();
+    this.events.push({
+      timestamp: now,
+      meshName,
+      agentId,
+      success: true,
+      durationMs,
+    });
+
+    // MTTR: if this agent had a recent failure, record recovery time
+    const lastFailure = this.lastFailureByAgent.get(agentId);
+    if (lastFailure) {
+      this.mttrSamples.push(now - lastFailure);
+      this.lastFailureByAgent.delete(agentId);
+    }
+
+    this.gc();
+  }
+
+  /**
+   * Record a failed operation
+   */
+  recordFailure(
+    meshName: string,
+    agentId: string,
+    category: FailureCategory,
+    reason?: string
+  ): void {
+    const now = Date.now();
+    this.events.push({
+      timestamp: now,
+      meshName,
+      agentId,
+      success: false,
+      category,
+      reason,
+    });
+
+    this.lastFailureByAgent.set(agentId, now);
+
+    log.warn('sli', 'Failure recorded', {
+      meshName,
+      agentId,
+      category,
+      reason,
+    });
+
+    this.gc();
+  }
+
+  /**
+   * Get SLI snapshot for a time window
+   */
+  getSnapshot(windowMs?: number): SLISnapshot {
+    const now = Date.now();
+    const windowStart = windowMs ? now - windowMs : 0;
+    const events = this.events.filter(e => e.timestamp >= windowStart);
+
+    const totalEvents = events.length;
+    const totalSuccesses = events.filter(e => e.success).length;
+    const totalFailures = totalEvents - totalSuccesses;
+    const successRate = totalEvents > 0 ? totalSuccesses / totalEvents : 1;
+
+    // Failure breakdown
+    const failuresByCategory: Record<string, number> = {};
+    for (const e of events) {
+      if (!e.success && e.category) {
+        failuresByCategory[e.category] = (failuresByCategory[e.category] || 0) + 1;
+      }
+    }
+
+    // Per-mesh rates
+    const byMesh: Record<string, { success: number; total: number; rate: number }> = {};
+    for (const e of events) {
+      if (!byMesh[e.meshName]) {
+        byMesh[e.meshName] = { success: 0, total: 0, rate: 0 };
+      }
+      byMesh[e.meshName].total++;
+      if (e.success) byMesh[e.meshName].success++;
+    }
+    for (const mesh of Object.values(byMesh)) {
+      mesh.rate = mesh.total > 0 ? mesh.success / mesh.total : 1;
+    }
+
+    // Per-agent rates
+    const byAgent: Record<string, { success: number; total: number; rate: number }> = {};
+    for (const e of events) {
+      if (!byAgent[e.agentId]) {
+        byAgent[e.agentId] = { success: 0, total: 0, rate: 0 };
+      }
+      byAgent[e.agentId].total++;
+      if (e.success) byAgent[e.agentId].success++;
+    }
+    for (const agent of Object.values(byAgent)) {
+      agent.rate = agent.total > 0 ? agent.success / agent.total : 1;
+    }
+
+    // MTTR
+    const mttrMs = this.mttrSamples.length > 0
+      ? this.mttrSamples.reduce((a, b) => a + b, 0) / this.mttrSamples.length
+      : null;
+
+    return {
+      successRate,
+      totalEvents,
+      totalSuccesses,
+      totalFailures,
+      mttrMs,
+      failuresByCategory,
+      byMesh,
+      byAgent,
+      ninesLevel: this.calculateNines(successRate),
+      windowStart,
+      windowEnd: now,
+    };
+  }
+
+  /**
+   * Calculate human-readable nines level
+   */
+  private calculateNines(rate: number): string {
+    if (rate >= 0.9999) return '99.99% (4 nines)';
+    if (rate >= 0.999) return '99.9% (3 nines)';
+    if (rate >= 0.99) return '99% (2 nines)';
+    if (rate >= 0.9) return '90% (1 nine)';
+    return `${(rate * 100).toFixed(1)}% (< 1 nine)`;
+  }
+
+  /**
+   * Garbage collect old events
+   */
+  private gc(): void {
+    const cutoff = Date.now() - this.config.retentionMs;
+    this.events = this.events.filter(e => e.timestamp >= cutoff);
+
+    // Also clean MTTR samples (keep last 100)
+    if (this.mttrSamples.length > 100) {
+      this.mttrSamples = this.mttrSamples.slice(-100);
+    }
+  }
+
+  /**
+   * Reset all tracking (e.g., fresh start)
+   */
+  reset(): void {
+    this.events = [];
+    this.lastFailureByAgent.clear();
+    this.mttrSamples = [];
+  }
+}
diff --git a/src/worker/dispatcher.ts b/src/worker/dispatcher.ts
index 5a119a31..f2d6d578 100644
--- a/src/worker/dispatcher.ts
+++ b/src/worker/dispatcher.ts
@@ -46,6 +46,7 @@ import { GuardrailConfig } from './guardrail-config.ts';
 import { buildPathContext, validateAgentArtifacts, findWriters, resolveManifestVariables } from './manifest-validator.ts';
 import { SystemMessageWriter } from '../core/system-message-writer.ts';
 import { NudgeDetector } from './nudge-detector.ts';
+import { ReliabilityManager } from '../reliability/reliability-manager.ts';
 import YAML from 'yaml';
 
 /**
@@ -351,6 +352,8 @@ export class WorkerDispatcher extends EventEmitter {
   systemWriter!: SystemMessageWriter;
   // Auto-nudge recovery for stalled routes
   private nudgeDetector?: NudgeDetector;
+  // Reliability: circuit breakers, heartbeat, SLI, DLQ, safe-mode
+  reliability?: ReliabilityManager;
 
   constructor(config: DispatcherConfig, queue: MessageQueue) {
     super();
@@ -1185,6 +1188,10 @@ export class WorkerDispatcher extends EventEmitter {
     const nudgeConfig = this.guardrails.getNudgeConfig?.() ?? {};
     this.nudgeDetector = new NudgeDetector(this.systemWriter, this.queue, nudgeConfig);
 
+    // Initialize reliability manager (circuit breakers, heartbeat, SLI, DLQ, safe-mode)
+    this.reliability = new ReliabilityManager(this.queue.getDb(), this.config.workDir);
+    this.reliability.start();
+
     // Subscribe to consumer events for event-driven dispatch
     if (consumer) {
       this.boundMessageHandler = (event: { agentId: string }) => {
@@ -3999,6 +4006,19 @@ You are working in an isolated git worktree for feature: **${hookContext.feature
       const worker = new SdkRunner(runnerConfig, this.queue);
       workerRef.current = worker;  // Populate ref for write-gate kill callback
 
+      // Reliability: register agent for heartbeat monitoring + circuit breaker check
+      if (this.reliability) {
+        const spawnCheck = this.reliability.canSpawn(meshName!, agentId);
+        if (!spawnCheck.allowed) {
+          log.warn('dispatcher', `Spawn blocked by reliability`, {
+            agentId, reason: spawnCheck.reason,
+          });
+          log.activity('reliability:blocked', agentId, spawnCheck.reason || 'blocked');
+          return;
+        }
+        this.reliability.registerAgent(agentId);
+      }
+
       // Parity gate: emit session-start for consumer to clear stale pending asks
       this.emit('session-start', { agentId });
 
@@ -4036,6 +4056,8 @@ You are working in an isolated git worktree for feature: **${hookContext.feature
             result.worker.lastOutputAt = Date.now();
           }
         }
+        // Reliability: heartbeat on output
+        this.reliability?.recordHeartbeat(agentId);
         this.emit('worker:output', data);
       });
 
@@ -4745,6 +4767,10 @@ You are working in an isolated git worktree for feature: **${hookContext.feature
             : undefined,
         });
 
+        // Reliability: record successful completion
+        const durationMs = Date.now() - (activeWorker?.startedAt || Date.now());
+        this.reliability?.recordSuccess(meshName!, agentId, durationMs);
+
         // OAOM: Check queue for next message
         this.processNextQueuedMessage(agentId);
       });
@@ -4894,6 +4920,13 @@ You are working in an isolated git worktree for feature: **${hookContext.feature
         }
 
         this.emit('worker:error', { ...data, workerId: errorWorkerId, transitionName: 'error' });
+
+        // Reliability: record failure with categorization
+        const category = data.error?.includes('usage policy') ? 'policy_violation'
+          : data.error?.includes('timeout') ? 'timeout'
+          : data.error?.includes('overloaded') ? 'model_error'
+          : 'crash';
+        this.reliability?.recordFailure(meshName!, agentId, category as any, data.error);
       });
 
       // Add worker to active workers with unique workerId for parallel execution
@@ -5870,6 +5903,10 @@ ${output}
     // Cancel any pending nudge timers for this mesh
     this.nudgeDetector?.cancelForMesh(meshName);
 
+    // Reliability: cleanup mesh-level state, log status
+    this.reliability?.cleanupMesh(meshName);
+    this.reliability?.logStatus();
+
     // Find session by meshName (delegates to MetricsAggregator)
     const result = this.metricsAggregator.findSessionByMeshName(meshName);
 

From 8fd11cc2ce4eee453e66247bfb18d65eae0f0787 Mon Sep 17 00:00:00 2001
From: Claude <noreply@anthropic.com>
Date: Sun, 8 Mar 2026 23:57:38 +0000
Subject: [PATCH 02/12] feat(reliability): Add DLQ replay via
 SystemMessageWriter and circuit breaker checkpointing
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

- DLQ: replayOne(), replayAll(), replayForAgent() — re-injects failed messages
  back into the live system through SystemMessageWriter with [DLQ REPLAY] prefix
  and original failure context
- Circuit Breaker: SQLite checkpointing — persists open/half_open circuit states
  to circuit_breaker_checkpoints table, restores on restart so agents that were
  failing before a crash stay circuit-broken
- HeartbeatMonitor: Fix NodeJS.Timeout type to ReturnType<typeof setInterval>
- ReliabilityManager: Expose replayDLQ(), replayDLQEntry(), replayDLQForAgent()
  and pass DB to CircuitBreaker constructor for persistence

https://claude.ai/code/session_012PTx7bZNNh74rAshCcuSQg
---
 src/reliability/circuit-breaker.ts     | 100 ++++++++++++++++++++++++-
 src/reliability/dead-letter-queue.ts   |  88 +++++++++++++++++++++-
 src/reliability/heartbeat-monitor.ts   |   2 +-
 src/reliability/index.ts               |   3 +-
 src/reliability/reliability-manager.ts |  31 +++++++-
 5 files changed, 218 insertions(+), 6 deletions(-)

diff --git a/src/reliability/circuit-breaker.ts b/src/reliability/circuit-breaker.ts
index 14a19a5c..6b27f370 100644
--- a/src/reliability/circuit-breaker.ts
+++ b/src/reliability/circuit-breaker.ts
@@ -10,8 +10,10 @@
  * - HALF_OPEN: After cooldown, allow one probe request
  *
  * Applied per agent (mesh/agent) to isolate failures.
+ * Checkpoints to SQLite for persistence across restarts.
  */
 
+import type Database from 'better-sqlite3';
 import { log } from '../shared/logger.ts';
 
 export interface CircuitBreakerConfig {
@@ -42,9 +44,99 @@ const DEFAULT_CONFIG: CircuitBreakerConfig = {
 export class CircuitBreaker {
   private circuits: Map<string, CircuitState> = new Map();
   private config: CircuitBreakerConfig;
+  private db: Database.Database | null = null;
 
-  constructor(config?: Partial<CircuitBreakerConfig>) {
+  constructor(config?: Partial<CircuitBreakerConfig>, db?: Database.Database) {
     this.config = { ...DEFAULT_CONFIG, ...config };
+
+    if (db) {
+      this.db = db;
+      this.ensureSchema();
+      this.restoreCheckpoint();
+    }
+  }
+
+  /**
+   * Create checkpoint table if it doesn't exist
+   */
+  private ensureSchema(): void {
+    this.db?.exec(`
+      CREATE TABLE IF NOT EXISTS circuit_breaker_checkpoints (
+        agent_id TEXT PRIMARY KEY,
+        state TEXT NOT NULL,
+        failures INTEGER NOT NULL,
+        last_failure_at INTEGER NOT NULL,
+        opened_at INTEGER NOT NULL,
+        updated_at INTEGER NOT NULL
+      );
+    `);
+  }
+
+  /**
+   * Restore circuit states from SQLite checkpoint on startup
+   */
+  private restoreCheckpoint(): void {
+    if (!this.db) return;
+
+    const rows = this.db.prepare(
+      'SELECT * FROM circuit_breaker_checkpoints'
+    ).all() as Array<{
+      agent_id: string;
+      state: CircuitBreakerState;
+      failures: number;
+      last_failure_at: number;
+      opened_at: number;
+    }>;
+
+    for (const row of rows) {
+      // Only restore non-closed circuits (closed is default)
+      if (row.state !== 'closed') {
+        this.circuits.set(row.agent_id, {
+          state: row.state,
+          failures: row.failures,
+          lastFailureAt: row.last_failure_at,
+          openedAt: row.opened_at,
+          successesSinceHalfOpen: 0,
+        });
+      }
+    }
+
+    if (rows.length > 0) {
+      const nonClosed = rows.filter(r => r.state !== 'closed').length;
+      log.info('circuit-breaker', 'Restored checkpoints', {
+        total: rows.length,
+        nonClosed,
+      });
+    }
+  }
+
+  /**
+   * Persist current circuit state to SQLite
+   */
+  private checkpoint(agentId: string): void {
+    if (!this.db) return;
+
+    const circuit = this.circuits.get(agentId);
+    if (!circuit || circuit.state === 'closed') {
+      // Remove checkpoint for closed circuits (default state)
+      this.db.prepare(
+        'DELETE FROM circuit_breaker_checkpoints WHERE agent_id = ?'
+      ).run(agentId);
+      return;
+    }
+
+    this.db.prepare(`
+      INSERT OR REPLACE INTO circuit_breaker_checkpoints
+        (agent_id, state, failures, last_failure_at, opened_at, updated_at)
+      VALUES (?, ?, ?, ?, ?, ?)
+    `).run(
+      agentId,
+      circuit.state,
+      circuit.failures,
+      circuit.lastFailureAt,
+      circuit.openedAt,
+      Date.now()
+    );
   }
 
   /**
@@ -91,6 +183,7 @@ export class CircuitBreaker {
       circuit.state = 'closed';
       circuit.failures = 0;
       log.info('circuit-breaker', 'Circuit closed (recovery successful)', { agentId });
+      this.checkpoint(agentId);
     }
   }
 
@@ -132,6 +225,7 @@ export class CircuitBreaker {
       circuit.state = 'open';
       circuit.openedAt = now;
       log.error('circuit-breaker', 'Circuit reopened (probe failed)', { agentId, reason });
+      this.checkpoint(agentId);
       return;
     }
 
@@ -144,6 +238,7 @@ export class CircuitBreaker {
         threshold: this.config.failureThreshold,
         reason,
       });
+      this.checkpoint(agentId);
     }
   }
 
@@ -170,6 +265,7 @@ export class CircuitBreaker {
    */
   reset(agentId: string): void {
     this.circuits.delete(agentId);
+    this.checkpoint(agentId);
     log.info('circuit-breaker', 'Circuit manually reset', { agentId });
   }
 
@@ -178,6 +274,7 @@ export class CircuitBreaker {
    */
   resetAll(): void {
     this.circuits.clear();
+    this.db?.prepare('DELETE FROM circuit_breaker_checkpoints').run();
   }
 
   /**
@@ -187,6 +284,7 @@ export class CircuitBreaker {
     for (const agentId of this.circuits.keys()) {
       if (agentId.startsWith(`${meshName}/`)) {
         this.circuits.delete(agentId);
+        this.checkpoint(agentId);
       }
     }
   }
diff --git a/src/reliability/dead-letter-queue.ts b/src/reliability/dead-letter-queue.ts
index 98b6ec06..494a3003 100644
--- a/src/reliability/dead-letter-queue.ts
+++ b/src/reliability/dead-letter-queue.ts
@@ -7,12 +7,13 @@
  * Features:
  * - Automatic retry with exponential backoff (up to maxRetries)
  * - DLQ storage in SQLite for persistence across restarts
- * - Replay capability for manual recovery
+ * - Replay via SystemMessageWriter (re-injects into live system)
  * - Failure reason tracking for taxonomy
  */
 
 import type Database from 'better-sqlite3';
 import { log } from '../shared/logger.ts';
+import type { SystemMessageWriter } from '../core/system-message-writer.ts';
 
 export interface DLQEntry {
   id: number;
@@ -37,6 +38,12 @@ export interface DLQStats {
   byAgent: Record<string, number>;
 }
 
+export interface ReplayResult {
+  id: number;
+  success: boolean;
+  error?: string;
+}
+
 export class DeadLetterQueue {
   private db: Database.Database;
   private maxRetries: number;
@@ -143,6 +150,85 @@ export class DeadLetterQueue {
     log.info('dlq', 'DLQ entry replayed', { id });
   }
 
+  /**
+   * Replay a single DLQ entry via SystemMessageWriter.
+   * Re-injects the message into the live system with a [DLQ REPLAY] prefix.
+   */
+  replayOne(id: number, writer: SystemMessageWriter): ReplayResult {
+    const entry = this.db.prepare(
+      'SELECT * FROM dead_letter_queue WHERE id = ? AND replayed_at IS NULL'
+    ).get(id) as DLQEntry | undefined;
+
+    if (!entry) {
+      return { id, success: false, error: 'Entry not found or already replayed' };
+    }
+
+    try {
+      const payload = JSON.parse(entry.payload) as Record<string, unknown>;
+      const headline = (payload.headline as string) || 'DLQ Replay';
+      const body = (payload.body as string) || JSON.stringify(payload, null, 2);
+
+      writer.write({
+        to: entry.to_agent,
+        from: entry.from_agent,
+        type: entry.type,
+        headline: `[DLQ REPLAY] ${headline}`,
+        body: `> Replayed from dead letter queue (DLQ #${entry.id})\n> Original failure: ${entry.failure_reason}\n> Failed at: ${new Date(entry.first_failed_at).toISOString()}\n> Retries: ${entry.retry_count}/${entry.max_retries}\n\n${body}`,
+        msgId: `dlq-replay-${entry.id}-${Date.now()}`,
+      });
+
+      this.markReplayed(id);
+
+      log.info('dlq', 'Message replayed via SystemMessageWriter', {
+        id,
+        to: entry.to_agent,
+        from: entry.from_agent,
+        originalReason: entry.failure_reason,
+      });
+
+      return { id, success: true };
+    } catch (err) {
+      const error = (err as Error).message;
+      log.error('dlq', 'Replay failed', { id, error });
+      return { id, success: false, error };
+    }
+  }
+
+  /**
+   * Replay all pending DLQ entries via SystemMessageWriter.
+   * Returns results for each entry.
+   */
+  replayAll(writer: SystemMessageWriter): ReplayResult[] {
+    const pending = this.getPending();
+    const results: ReplayResult[] = [];
+
+    for (const entry of pending) {
+      results.push(this.replayOne(entry.id, writer));
+    }
+
+    log.info('dlq', 'Bulk replay complete', {
+      total: pending.length,
+      succeeded: results.filter(r => r.success).length,
+      failed: results.filter(r => !r.success).length,
+    });
+
+    return results;
+  }
+
+  /**
+   * Replay all pending DLQ entries for a specific agent.
+   */
+  replayForAgent(agentId: string, writer: SystemMessageWriter): ReplayResult[] {
+    const entries = this.getForAgent(agentId);
+    const results: ReplayResult[] = [];
+
+    for (const entry of entries) {
+      results.push(this.replayOne(entry.id, writer));
+    }
+
+    return results;
+  }
+
   /**
    * Get DLQ statistics
    */
diff --git a/src/reliability/heartbeat-monitor.ts b/src/reliability/heartbeat-monitor.ts
index 8bfde400..19e4d7ad 100644
--- a/src/reliability/heartbeat-monitor.ts
+++ b/src/reliability/heartbeat-monitor.ts
@@ -45,7 +45,7 @@ type HeartbeatCallback = (health: AgentHealth) => void;
 export class HeartbeatMonitor {
   private agents: Map<string, { lastOutputAt: number; startedAt: number }> = new Map();
   private config: HeartbeatConfig;
-  private checkInterval: NodeJS.Timeout | null = null;
+  private checkInterval: ReturnType<typeof setInterval> | null = null;
   private onStale: HeartbeatCallback | null = null;
   private onDead: HeartbeatCallback | null = null;
   private onWarn: HeartbeatCallback | null = null;
diff --git a/src/reliability/index.ts b/src/reliability/index.ts
index b1989b53..796b0c14 100644
--- a/src/reliability/index.ts
+++ b/src/reliability/index.ts
@@ -12,7 +12,8 @@
  * not just more of what got you the previous nine.
  */
 
-export { DeadLetterQueue, type DLQEntry, type DLQStats } from './dead-letter-queue.ts';
+export { DeadLetterQueue, type DLQEntry, type DLQStats, type ReplayResult } from './dead-letter-queue.ts';
+export { ReliabilityManager, type ReliabilityConfig, type ReliabilityStatus } from './reliability-manager.ts';
 export { CircuitBreaker, type CircuitBreakerConfig, type CircuitBreakerState } from './circuit-breaker.ts';
 export { HeartbeatMonitor, type HeartbeatConfig, type AgentHealth } from './heartbeat-monitor.ts';
 export { SLITracker, type SLIConfig, type SLISnapshot, type FailureCategory } from './sli-tracker.ts';
diff --git a/src/reliability/reliability-manager.ts b/src/reliability/reliability-manager.ts
index 012177bf..069103de 100644
--- a/src/reliability/reliability-manager.ts
+++ b/src/reliability/reliability-manager.ts
@@ -22,11 +22,12 @@
  */
 
 import type Database from 'better-sqlite3';
-import { DeadLetterQueue, type DLQStats } from './dead-letter-queue.ts';
+import { DeadLetterQueue, type DLQStats, type ReplayResult } from './dead-letter-queue.ts';
 import { CircuitBreaker, type CircuitBreakerState } from './circuit-breaker.ts';
 import { HeartbeatMonitor, type AgentHealth } from './heartbeat-monitor.ts';
 import { SLITracker, type SLISnapshot, type FailureCategory } from './sli-tracker.ts';
 import { SafeMode, type SafeModeLevel, type SafeModeState } from './safe-mode.ts';
+import type { SystemMessageWriter } from '../core/system-message-writer.ts';
 import { log } from '../shared/logger.ts';
 import fs from 'node:fs';
 import path from 'node:path';
@@ -83,7 +84,7 @@ export class ReliabilityManager {
     const merged = { ...fileConfig, ...config };
 
     this.dlq = new DeadLetterQueue(db, merged.dlq?.maxRetries);
-    this.circuitBreaker = new CircuitBreaker(merged.circuitBreaker);
+    this.circuitBreaker = new CircuitBreaker(merged.circuitBreaker, db);
     this.heartbeat = new HeartbeatMonitor(merged.heartbeat);
     this.sli = new SLITracker(merged.sli);
     this.safeMode = new SafeMode(merged.safeMode);
@@ -236,6 +237,32 @@ export class ReliabilityManager {
     this.heartbeat.clearForMesh(meshName);
   }
 
+  // ============================================================
+  // DLQ Replay API
+  // ============================================================
+
+  /**
+   * Replay all pending DLQ entries via SystemMessageWriter.
+   * Re-injects failed messages back into the live system.
+   */
+  replayDLQ(writer: SystemMessageWriter): ReplayResult[] {
+    return this.dlq.replayAll(writer);
+  }
+
+  /**
+   * Replay a single DLQ entry by ID.
+   */
+  replayDLQEntry(id: number, writer: SystemMessageWriter): ReplayResult {
+    return this.dlq.replayOne(id, writer);
+  }
+
+  /**
+   * Replay all DLQ entries for a specific agent.
+   */
+  replayDLQForAgent(agentId: string, writer: SystemMessageWriter): ReplayResult[] {
+    return this.dlq.replayForAgent(agentId, writer);
+  }
+
   // ============================================================
   // Status API (for CLI / monitoring)
   // ============================================================

From ce10e2c08634a83c282f23ebbb3e8036fa1f5fa8 Mon Sep 17 00:00:00 2001
From: Claude <noreply@anthropic.com>
Date: Mon, 9 Mar 2026 17:28:56 +0000
Subject: [PATCH 03/12] feat(reliability): Session-aware DLQ recovery instead
 of raw message replay

Replace naive message replay with session-aware recovery that preserves
conversation history. DLQ now captures sessionId at failure time and uses
RecoveryMode (session_resume/requeue/manual) to determine the right
recovery strategy. When a worker crashes mid-work with an active session,
recovery resumes the SDK session instead of replaying a raw message.

- Rewrite DLQ schema with session_id, recovery_mode, failure_category
- Update ReliabilityManager with session-aware deadLetter() and recover*() APIs
- Wire dispatcher error handler to capture sessionId and route exhausted
  retries to DLQ with full session context
- Export RecoveryMode, RecoveryResult, FailureContext types from index

https://claude.ai/code/session_012PTx7bZNNh74rAshCcuSQg
---
 src/reliability/dead-letter-queue.ts   | 274 +++++++++++++++----------
 src/reliability/index.ts               |  10 +-
 src/reliability/reliability-manager.ts | 185 ++++++++++++++---
 src/worker/dispatcher.ts               |  28 ++-
 4 files changed, 345 insertions(+), 152 deletions(-)

diff --git a/src/reliability/dead-letter-queue.ts b/src/reliability/dead-letter-queue.ts
index 494a3003..94a4c2df 100644
--- a/src/reliability/dead-letter-queue.ts
+++ b/src/reliability/dead-letter-queue.ts
@@ -1,46 +1,71 @@
 /**
- * DeadLetterQueue - Messages that failed delivery after max retries
+ * DeadLetterQueue - Session-aware failure recovery
  *
- * Nine 2 pattern: Instead of silently dropping failed messages,
- * route them to a DLQ for inspection and manual replay.
+ * Nine 2 pattern: Instead of silently dropping failed work,
+ * capture the session context at failure time and enable recovery
+ * via session resume (not raw message replay).
  *
- * Features:
- * - Automatic retry with exponential backoff (up to maxRetries)
- * - DLQ storage in SQLite for persistence across restarts
- * - Replay via SystemMessageWriter (re-injects into live system)
- * - Failure reason tracking for taxonomy
+ * Two recovery modes:
+ * 1. Session resume: Agent crashed mid-work → resume with sessionId
+ *    (preserves full conversation history + tool state)
+ * 2. Message re-queue: Message undeliverable → re-queue to dispatcher
+ *    (for circuit-open or routing failures where no session exists)
+ *
+ * The key insight: replaying a raw message loses all conversation context.
+ * Session resume picks up exactly where the agent left off.
  */
 
 import type Database from 'better-sqlite3';
 import { log } from '../shared/logger.ts';
-import type { SystemMessageWriter } from '../core/system-message-writer.ts';
+
+/**
+ * Recovery mode determines how to restore failed work
+ */
+export type RecoveryMode =
+  | 'session_resume'   // Crashed mid-work: resume via sessionId
+  | 'requeue'          // Undeliverable: re-insert into message queue
+  | 'manual';          // Needs human intervention
 
 export interface DLQEntry {
   id: number;
+  agent_id: string;           // The agent that failed (mesh/agent)
+  mesh_name: string;
+  recovery_mode: RecoveryMode;
+  session_id: string | null;  // For session_resume: SDK session to resume
+  /** Original message context (for requeue mode) */
   from_agent: string;
   to_agent: string;
   type: string;
-  payload: string;
+  payload: string;            // JSON-serialized original payload
   source_file: string | null;
+  /** Failure context */
   failure_reason: string;
+  failure_category: string;   // SLI failure category
   retry_count: number;
   max_retries: number;
+  /** Worker state at failure time */
+  messages_sent: number;      // How many messages worker sent before failing
+  output_snapshot: string | null;  // Last output (truncated) for diagnostics
+  /** Timestamps */
   first_failed_at: number;
   last_failed_at: number;
-  replayed_at: number | null;
+  recovered_at: number | null;
 }
 
 export interface DLQStats {
   total: number;
-  pending: number;     // Not yet replayed
-  replayed: number;    // Successfully replayed
+  pending: number;        // Not yet recovered
+  recovered: number;      // Successfully recovered
   byReason: Record<string, number>;
   byAgent: Record<string, number>;
+  byMode: Record<RecoveryMode, number>;
 }
 
-export interface ReplayResult {
+export interface RecoveryResult {
   id: number;
   success: boolean;
+  mode: RecoveryMode;
+  sessionId?: string;
   error?: string;
 }
 
@@ -58,72 +83,137 @@ export class DeadLetterQueue {
     this.db.exec(`
       CREATE TABLE IF NOT EXISTS dead_letter_queue (
         id INTEGER PRIMARY KEY AUTOINCREMENT,
+        agent_id TEXT NOT NULL,
+        mesh_name TEXT NOT NULL,
+        recovery_mode TEXT NOT NULL DEFAULT 'requeue',
+        session_id TEXT,
         from_agent TEXT NOT NULL,
         to_agent TEXT NOT NULL,
         type TEXT NOT NULL,
         payload TEXT NOT NULL,
         source_file TEXT,
         failure_reason TEXT NOT NULL,
+        failure_category TEXT NOT NULL DEFAULT 'unknown',
         retry_count INTEGER DEFAULT 0,
         max_retries INTEGER NOT NULL,
+        messages_sent INTEGER DEFAULT 0,
+        output_snapshot TEXT,
         first_failed_at INTEGER NOT NULL,
         last_failed_at INTEGER NOT NULL,
-        replayed_at INTEGER
+        recovered_at INTEGER
       );
-      CREATE INDEX IF NOT EXISTS idx_dlq_agent ON dead_letter_queue(to_agent, replayed_at);
-      CREATE INDEX IF NOT EXISTS idx_dlq_reason ON dead_letter_queue(failure_reason);
+      CREATE INDEX IF NOT EXISTS idx_dlq_agent ON dead_letter_queue(agent_id, recovered_at);
+      CREATE INDEX IF NOT EXISTS idx_dlq_mesh ON dead_letter_queue(mesh_name, recovered_at);
+      CREATE INDEX IF NOT EXISTS idx_dlq_mode ON dead_letter_queue(recovery_mode, recovered_at);
     `);
   }
 
   /**
-   * Add a failed message to the DLQ
+   * Add a failed operation to the DLQ with full session context.
+   *
+   * The recovery_mode is determined by what state existed at failure:
+   * - session_resume: Agent had an active sessionId → can resume
+   * - requeue: No session (e.g., failed before starting, or routing error)
+   * - manual: Repeated failures, needs human decision
    */
   add(entry: {
+    agent_id: string;
+    mesh_name: string;
+    session_id?: string;
     from_agent: string;
     to_agent: string;
     type: string;
     payload: Record<string, unknown>;
     source_file?: string;
     failure_reason: string;
+    failure_category: string;
     retry_count?: number;
+    messages_sent?: number;
+    output_snapshot?: string;
   }): number {
     const now = Date.now();
+    const retryCount = entry.retry_count || 0;
+
+    // Determine recovery mode from context
+    let mode: RecoveryMode;
+    if (retryCount >= this.maxRetries) {
+      mode = 'manual';  // Exhausted retries
+    } else if (entry.session_id) {
+      mode = 'session_resume';  // Has session → can resume
+    } else {
+      mode = 'requeue';  // No session → re-inject message
+    }
+
     const result = this.db.prepare(`
       INSERT INTO dead_letter_queue
-        (from_agent, to_agent, type, payload, source_file, failure_reason,
-         retry_count, max_retries, first_failed_at, last_failed_at)
-      VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
+        (agent_id, mesh_name, recovery_mode, session_id,
+         from_agent, to_agent, type, payload, source_file,
+         failure_reason, failure_category, retry_count, max_retries,
+         messages_sent, output_snapshot,
+         first_failed_at, last_failed_at)
+      VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
     `).run(
+      entry.agent_id,
+      entry.mesh_name,
+      mode,
+      entry.session_id || null,
       entry.from_agent,
       entry.to_agent,
       entry.type,
       JSON.stringify(entry.payload),
       entry.source_file || null,
       entry.failure_reason,
-      entry.retry_count || 0,
+      entry.failure_category,
+      retryCount,
       this.maxRetries,
+      entry.messages_sent || 0,
+      entry.output_snapshot?.slice(0, 2000) || null,  // Truncate snapshot
       now,
       now
     );
 
-    log.warn('dlq', 'Message added to dead letter queue', {
+    log.warn('dlq', 'Added to dead letter queue', {
       id: result.lastInsertRowid,
-      from: entry.from_agent,
-      to: entry.to_agent,
+      agent: entry.agent_id,
+      mode,
+      sessionId: entry.session_id?.slice(0, 8),
       reason: entry.failure_reason,
-      retries: entry.retry_count || 0,
+      category: entry.failure_category,
+      retries: retryCount,
     });
 
     return result.lastInsertRowid as number;
   }
 
   /**
-   * Get all unreplayed DLQ entries
+   * Get all unrecovered DLQ entries
    */
   getPending(): DLQEntry[] {
     return this.db.prepare(`
       SELECT * FROM dead_letter_queue
-      WHERE replayed_at IS NULL
+      WHERE recovered_at IS NULL
+      ORDER BY last_failed_at DESC
+    `).all() as DLQEntry[];
+  }
+
+  /**
+   * Get DLQ entries that can be auto-recovered (session_resume or requeue)
+   */
+  getRecoverable(): DLQEntry[] {
+    return this.db.prepare(`
+      SELECT * FROM dead_letter_queue
+      WHERE recovered_at IS NULL AND recovery_mode != 'manual'
+      ORDER BY last_failed_at ASC
+    `).all() as DLQEntry[];
+  }
+
+  /**
+   * Get entries requiring manual intervention
+   */
+  getManual(): DLQEntry[] {
+    return this.db.prepare(`
+      SELECT * FROM dead_letter_queue
+      WHERE recovered_at IS NULL AND recovery_mode = 'manual'
       ORDER BY last_failed_at DESC
     `).all() as DLQEntry[];
   }
@@ -134,99 +224,55 @@ export class DeadLetterQueue {
   getForAgent(agentId: string): DLQEntry[] {
     return this.db.prepare(`
       SELECT * FROM dead_letter_queue
-      WHERE to_agent = ? AND replayed_at IS NULL
+      WHERE agent_id = ? AND recovered_at IS NULL
       ORDER BY last_failed_at DESC
     `).all(agentId) as DLQEntry[];
   }
 
   /**
-   * Mark a DLQ entry as replayed
+   * Get DLQ entries for a specific mesh
    */
-  markReplayed(id: number): void {
-    this.db.prepare(`
-      UPDATE dead_letter_queue SET replayed_at = ? WHERE id = ?
-    `).run(Date.now(), id);
-
-    log.info('dlq', 'DLQ entry replayed', { id });
+  getForMesh(meshName: string): DLQEntry[] {
+    return this.db.prepare(`
+      SELECT * FROM dead_letter_queue
+      WHERE mesh_name = ? AND recovered_at IS NULL
+      ORDER BY last_failed_at DESC
+    `).all(meshName) as DLQEntry[];
   }
 
   /**
-   * Replay a single DLQ entry via SystemMessageWriter.
-   * Re-injects the message into the live system with a [DLQ REPLAY] prefix.
+   * Get a single entry by ID
    */
-  replayOne(id: number, writer: SystemMessageWriter): ReplayResult {
-    const entry = this.db.prepare(
-      'SELECT * FROM dead_letter_queue WHERE id = ? AND replayed_at IS NULL'
+  getById(id: number): DLQEntry | undefined {
+    return this.db.prepare(
+      'SELECT * FROM dead_letter_queue WHERE id = ?'
     ).get(id) as DLQEntry | undefined;
-
-    if (!entry) {
-      return { id, success: false, error: 'Entry not found or already replayed' };
-    }
-
-    try {
-      const payload = JSON.parse(entry.payload) as Record<string, unknown>;
-      const headline = (payload.headline as string) || 'DLQ Replay';
-      const body = (payload.body as string) || JSON.stringify(payload, null, 2);
-
-      writer.write({
-        to: entry.to_agent,
-        from: entry.from_agent,
-        type: entry.type,
-        headline: `[DLQ REPLAY] ${headline}`,
-        body: `> Replayed from dead letter queue (DLQ #${entry.id})\n> Original failure: ${entry.failure_reason}\n> Failed at: ${new Date(entry.first_failed_at).toISOString()}\n> Retries: ${entry.retry_count}/${entry.max_retries}\n\n${body}`,
-        msgId: `dlq-replay-${entry.id}-${Date.now()}`,
-      });
-
-      this.markReplayed(id);
-
-      log.info('dlq', 'Message replayed via SystemMessageWriter', {
-        id,
-        to: entry.to_agent,
-        from: entry.from_agent,
-        originalReason: entry.failure_reason,
-      });
-
-      return { id, success: true };
-    } catch (err) {
-      const error = (err as Error).message;
-      log.error('dlq', 'Replay failed', { id, error });
-      return { id, success: false, error };
-    }
   }
 
   /**
-   * Replay all pending DLQ entries via SystemMessageWriter.
-   * Returns results for each entry.
+   * Mark a DLQ entry as recovered
    */
-  replayAll(writer: SystemMessageWriter): ReplayResult[] {
-    const pending = this.getPending();
-    const results: ReplayResult[] = [];
-
-    for (const entry of pending) {
-      results.push(this.replayOne(entry.id, writer));
-    }
-
-    log.info('dlq', 'Bulk replay complete', {
-      total: pending.length,
-      succeeded: results.filter(r => r.success).length,
-      failed: results.filter(r => !r.success).length,
-    });
+  markRecovered(id: number): void {
+    this.db.prepare(`
+      UPDATE dead_letter_queue SET recovered_at = ? WHERE id = ?
+    `).run(Date.now(), id);
 
-    return results;
+    log.info('dlq', 'Entry recovered', { id });
   }
 
   /**
-   * Replay all pending DLQ entries for a specific agent.
+   * Escalate a requeue entry to manual (e.g., after failed recovery attempt)
    */
-  replayForAgent(agentId: string, writer: SystemMessageWriter): ReplayResult[] {
-    const entries = this.getForAgent(agentId);
-    const results: ReplayResult[] = [];
-
-    for (const entry of entries) {
-      results.push(this.replayOne(entry.id, writer));
-    }
+  escalateToManual(id: number, reason: string): void {
+    const now = Date.now();
+    this.db.prepare(`
+      UPDATE dead_letter_queue
+      SET recovery_mode = 'manual', failure_reason = ?, last_failed_at = ?,
+          retry_count = retry_count + 1
+      WHERE id = ?
+    `).run(`${reason} (escalated from auto-recovery)`, now, id);
 
-    return results;
+    log.warn('dlq', 'Entry escalated to manual recovery', { id, reason });
   }
 
   /**
@@ -238,36 +284,44 @@ export class DeadLetterQueue {
     ).get() as { c: number }).c;
 
     const pending = (this.db.prepare(
-      'SELECT COUNT(*) as c FROM dead_letter_queue WHERE replayed_at IS NULL'
+      'SELECT COUNT(*) as c FROM dead_letter_queue WHERE recovered_at IS NULL'
     ).get() as { c: number }).c;
 
     const byReasonRows = this.db.prepare(`
       SELECT failure_reason, COUNT(*) as c FROM dead_letter_queue
-      WHERE replayed_at IS NULL GROUP BY failure_reason
+      WHERE recovered_at IS NULL GROUP BY failure_reason
     `).all() as Array<{ failure_reason: string; c: number }>;
 
     const byAgentRows = this.db.prepare(`
-      SELECT to_agent, COUNT(*) as c FROM dead_letter_queue
-      WHERE replayed_at IS NULL GROUP BY to_agent
-    `).all() as Array<{ to_agent: string; c: number }>;
+      SELECT agent_id, COUNT(*) as c FROM dead_letter_queue
+      WHERE recovered_at IS NULL GROUP BY agent_id
+    `).all() as Array<{ agent_id: string; c: number }>;
+
+    const byModeRows = this.db.prepare(`
+      SELECT recovery_mode, COUNT(*) as c FROM dead_letter_queue
+      WHERE recovered_at IS NULL GROUP BY recovery_mode
+    `).all() as Array<{ recovery_mode: RecoveryMode; c: number }>;
 
     const byReason: Record<string, number> = {};
     for (const row of byReasonRows) byReason[row.failure_reason] = row.c;
 
     const byAgent: Record<string, number> = {};
-    for (const row of byAgentRows) byAgent[row.to_agent] = row.c;
+    for (const row of byAgentRows) byAgent[row.agent_id] = row.c;
+
+    const byMode: Record<RecoveryMode, number> = { session_resume: 0, requeue: 0, manual: 0 };
+    for (const row of byModeRows) byMode[row.recovery_mode] = row.c;
 
-    return { total, pending, replayed: total - pending, byReason, byAgent };
+    return { total, pending, recovered: total - pending, byReason, byAgent, byMode };
   }
 
   /**
-   * Clear old replayed entries (garbage collection)
+   * Clear old recovered entries (garbage collection)
    */
-  clearReplayed(olderThanMs = 24 * 60 * 60 * 1000): number {
+  clearRecovered(olderThanMs = 24 * 60 * 60 * 1000): number {
     const cutoff = Date.now() - olderThanMs;
     const result = this.db.prepare(`
       DELETE FROM dead_letter_queue
-      WHERE replayed_at IS NOT NULL AND replayed_at < ?
+      WHERE recovered_at IS NOT NULL AND recovered_at < ?
     `).run(cutoff);
     return result.changes;
   }
diff --git a/src/reliability/index.ts b/src/reliability/index.ts
index 796b0c14..718ecc3d 100644
--- a/src/reliability/index.ts
+++ b/src/reliability/index.ts
@@ -4,16 +4,16 @@
  * Implements four-nines (99.99%) reliability patterns for TX mesh execution:
  *
  * Nine 1 (90%):  Basic error handling, logging ✓ (existing)
- * Nine 2 (99%):  Dead letter queue, message retry, idempotency
+ * Nine 2 (99%):  Dead letter queue, session-aware recovery, idempotency
  * Nine 3 (99.9%): Circuit breakers, heartbeat detection, structured traces
  * Nine 4 (99.99%): SLI tracking, failure taxonomy, safe-mode, canary checks
  *
- * Reference: Karpathy's "March of Nines" - each nine requires new approaches,
- * not just more of what got you the previous nine.
+ * Key insight: Recovery via session resume (not raw message replay) preserves
+ * full conversation history and tool state. Reference: Karpathy's "March of Nines"
  */
 
-export { DeadLetterQueue, type DLQEntry, type DLQStats, type ReplayResult } from './dead-letter-queue.ts';
-export { ReliabilityManager, type ReliabilityConfig, type ReliabilityStatus } from './reliability-manager.ts';
+export { DeadLetterQueue, type DLQEntry, type DLQStats, type RecoveryMode, type RecoveryResult } from './dead-letter-queue.ts';
+export { ReliabilityManager, type ReliabilityConfig, type ReliabilityStatus, type FailureContext, type SessionResumeHandler, type RequeueHandler } from './reliability-manager.ts';
 export { CircuitBreaker, type CircuitBreakerConfig, type CircuitBreakerState } from './circuit-breaker.ts';
 export { HeartbeatMonitor, type HeartbeatConfig, type AgentHealth } from './heartbeat-monitor.ts';
 export { SLITracker, type SLIConfig, type SLISnapshot, type FailureCategory } from './sli-tracker.ts';
diff --git a/src/reliability/reliability-manager.ts b/src/reliability/reliability-manager.ts
index 069103de..73e50b31 100644
--- a/src/reliability/reliability-manager.ts
+++ b/src/reliability/reliability-manager.ts
@@ -2,7 +2,7 @@
  * ReliabilityManager - Central coordinator for all reliability features
  *
  * Provides a single integration point for the dispatcher to wire up:
- * - Dead letter queue (failed message recovery)
+ * - Dead letter queue (session-aware failure recovery)
  * - Circuit breakers (cascading failure prevention)
  * - Heartbeat monitoring (stalled worker detection)
  * - SLI tracking (reliability measurement)
@@ -15,19 +15,18 @@
  * Wire events:
  *   // On worker complete
  *   this.reliability.recordSuccess(meshName, agentId, durationMs);
- *   // On worker error
- *   this.reliability.recordFailure(meshName, agentId, 'crash', error.message);
+ *   // On worker error (with session context for DLQ)
+ *   this.reliability.recordFailure(meshName, agentId, 'crash', error.message, { sessionId, messagesSent });
  *   // On worker output (heartbeat)
  *   this.reliability.heartbeat(agentId);
  */
 
 import type Database from 'better-sqlite3';
-import { DeadLetterQueue, type DLQStats, type ReplayResult } from './dead-letter-queue.ts';
+import { DeadLetterQueue, type DLQEntry, type DLQStats, type RecoveryMode, type RecoveryResult } from './dead-letter-queue.ts';
 import { CircuitBreaker, type CircuitBreakerState } from './circuit-breaker.ts';
 import { HeartbeatMonitor, type AgentHealth } from './heartbeat-monitor.ts';
 import { SLITracker, type SLISnapshot, type FailureCategory } from './sli-tracker.ts';
 import { SafeMode, type SafeModeLevel, type SafeModeState } from './safe-mode.ts';
-import type { SystemMessageWriter } from '../core/system-message-writer.ts';
 import { log } from '../shared/logger.ts';
 import fs from 'node:fs';
 import path from 'node:path';
@@ -68,6 +67,28 @@ export interface ReliabilityStatus {
   agentHealth: AgentHealth[];
 }
 
+/** Context captured at failure time for session-aware DLQ */
+export interface FailureContext {
+  sessionId?: string | null;
+  messagesSent?: number;
+  outputSnapshot?: string;
+  sourceFile?: string;
+  fromAgent?: string;
+  toAgent?: string;
+  msgType?: string;
+  payload?: Record<string, unknown>;
+}
+
+/** Callback for session resume recovery */
+export type SessionResumeHandler = (
+  agentId: string,
+  sessionId: string,
+  meshName: string
+) => Promise<{ success: boolean; error?: string }>;
+
+/** Callback for message requeue recovery */
+export type RequeueHandler = (entry: DLQEntry) => { success: boolean; error?: string };
+
 export class ReliabilityManager {
   readonly dlq: DeadLetterQueue;
   readonly circuitBreaker: CircuitBreaker;
@@ -191,13 +212,18 @@ export class ReliabilityManager {
   }
 
   /**
-   * Record failure
+   * Record failure with optional session context for DLQ routing.
+   *
+   * When failureCtx includes a sessionId, the DLQ entry is marked for
+   * session_resume recovery (picks up exactly where the agent left off).
+   * Without a sessionId, it falls back to message requeue.
    */
   recordFailure(
     meshName: string,
     agentId: string,
     category: FailureCategory,
-    reason?: string
+    reason?: string,
+    failureCtx?: FailureContext
   ): void {
     this.sli.recordFailure(meshName, agentId, category, reason);
     this.circuitBreaker.recordFailure(agentId, reason || category);
@@ -209,23 +235,33 @@ export class ReliabilityManager {
   }
 
   /**
-   * Route a failed message to DLQ
+   * Route a failed operation to the DLQ with full session context.
+   *
+   * The DLQ auto-determines recovery mode:
+   * - session_resume: sessionId present → can resume conversation
+   * - requeue: no session → re-inject message into queue
+   * - manual: retries exhausted → needs human intervention
    */
-  deadLetter(msg: {
-    from_agent: string;
-    to_agent: string;
-    type: string;
-    payload: Record<string, unknown>;
-    source_file?: string;
-  }, reason: string, retryCount?: number): void {
+  deadLetter(
+    meshName: string,
+    agentId: string,
+    category: FailureCategory,
+    reason: string,
+    ctx?: FailureContext
+  ): void {
     this.dlq.add({
-      from_agent: msg.from_agent,
-      to_agent: msg.to_agent,
-      type: msg.type,
-      payload: msg.payload,
-      source_file: msg.source_file,
+      agent_id: agentId,
+      mesh_name: meshName,
+      session_id: ctx?.sessionId || undefined,
+      from_agent: ctx?.fromAgent || agentId,
+      to_agent: ctx?.toAgent || agentId,
+      type: ctx?.msgType || 'task',
+      payload: ctx?.payload || {},
+      source_file: ctx?.sourceFile,
       failure_reason: reason,
-      retry_count: retryCount,
+      failure_category: category,
+      messages_sent: ctx?.messagesSent,
+      output_snapshot: ctx?.outputSnapshot,
     });
   }
 
@@ -238,29 +274,114 @@ export class ReliabilityManager {
   }
 
   // ============================================================
-  // DLQ Replay API
+  // Session-Aware Recovery API
   // ============================================================
 
   /**
-   * Replay all pending DLQ entries via SystemMessageWriter.
-   * Re-injects failed messages back into the live system.
+   * Recover all auto-recoverable DLQ entries.
+   *
+   * For session_resume entries: calls sessionResumeHandler to resume
+   * the SDK session where it left off (preserves conversation history).
+   *
+   * For requeue entries: calls requeueHandler to re-inject the message
+   * into the queue for fresh dispatch.
    */
-  replayDLQ(writer: SystemMessageWriter): ReplayResult[] {
-    return this.dlq.replayAll(writer);
+  async recoverAll(
+    sessionResumeHandler: SessionResumeHandler,
+    requeueHandler: RequeueHandler
+  ): Promise<RecoveryResult[]> {
+    const entries = this.dlq.getRecoverable();
+    const results: RecoveryResult[] = [];
+
+    for (const entry of entries) {
+      const result = await this.recoverEntry(entry, sessionResumeHandler, requeueHandler);
+      results.push(result);
+    }
+
+    return results;
   }
 
   /**
-   * Replay a single DLQ entry by ID.
+   * Recover DLQ entries for a specific mesh.
    */
-  replayDLQEntry(id: number, writer: SystemMessageWriter): ReplayResult {
-    return this.dlq.replayOne(id, writer);
+  async recoverForMesh(
+    meshName: string,
+    sessionResumeHandler: SessionResumeHandler,
+    requeueHandler: RequeueHandler
+  ): Promise<RecoveryResult[]> {
+    const entries = this.dlq.getForMesh(meshName);
+    const results: RecoveryResult[] = [];
+
+    for (const entry of entries) {
+      if (entry.recovery_mode === 'manual') continue;
+      const result = await this.recoverEntry(entry, sessionResumeHandler, requeueHandler);
+      results.push(result);
+    }
+
+    return results;
   }
 
   /**
-   * Replay all DLQ entries for a specific agent.
+   * Recover a single DLQ entry by ID.
    */
-  replayDLQForAgent(agentId: string, writer: SystemMessageWriter): ReplayResult[] {
-    return this.dlq.replayForAgent(agentId, writer);
+  async recoverById(
+    id: number,
+    sessionResumeHandler: SessionResumeHandler,
+    requeueHandler: RequeueHandler
+  ): Promise<RecoveryResult> {
+    const entry = this.dlq.getById(id);
+    if (!entry) {
+      return { id, success: false, mode: 'manual', error: 'DLQ entry not found' };
+    }
+    return this.recoverEntry(entry, sessionResumeHandler, requeueHandler);
+  }
+
+  /**
+   * Recover a single DLQ entry using the appropriate recovery mode.
+   */
+  private async recoverEntry(
+    entry: DLQEntry,
+    sessionResumeHandler: SessionResumeHandler,
+    requeueHandler: RequeueHandler
+  ): Promise<RecoveryResult> {
+    if (entry.recovery_mode === 'session_resume' && entry.session_id) {
+      // Resume the SDK session — preserves full conversation history
+      try {
+        const result = await sessionResumeHandler(entry.agent_id, entry.session_id, entry.mesh_name);
+        if (result.success) {
+          this.dlq.markRecovered(entry.id);
+          log.info('reliability', 'DLQ entry recovered via session resume', {
+            id: entry.id,
+            agent: entry.agent_id,
+            sessionId: entry.session_id.slice(0, 8),
+          });
+          return { id: entry.id, success: true, mode: 'session_resume', sessionId: entry.session_id };
+        } else {
+          // Session resume failed — escalate to manual
+          this.dlq.escalateToManual(entry.id, result.error || 'Session resume failed');
+          return { id: entry.id, success: false, mode: 'session_resume', error: result.error };
+        }
+      } catch (err) {
+        this.dlq.escalateToManual(entry.id, (err as Error).message);
+        return { id: entry.id, success: false, mode: 'session_resume', error: (err as Error).message };
+      }
+    } else if (entry.recovery_mode === 'requeue') {
+      // Re-inject message into the queue
+      const result = requeueHandler(entry);
+      if (result.success) {
+        this.dlq.markRecovered(entry.id);
+        log.info('reliability', 'DLQ entry recovered via requeue', {
+          id: entry.id,
+          agent: entry.agent_id,
+        });
+        return { id: entry.id, success: true, mode: 'requeue' };
+      } else {
+        this.dlq.escalateToManual(entry.id, result.error || 'Requeue failed');
+        return { id: entry.id, success: false, mode: 'requeue', error: result.error };
+      }
+    } else {
+      return { id: entry.id, success: false, mode: 'manual', error: 'Requires manual intervention' };
+    }
   }
 
   // ============================================================
diff --git a/src/worker/dispatcher.ts b/src/worker/dispatcher.ts
index f2d6d578..bc8bf5ea 100644
--- a/src/worker/dispatcher.ts
+++ b/src/worker/dispatcher.ts
@@ -4886,6 +4886,12 @@ You are working in an isolated git worktree for feature: **${hookContext.feature
 
         await machine.error(data.error);
 
+        // Reliability: categorize failure
+        const category = data.error?.includes('usage policy') ? 'policy_violation'
+          : data.error?.includes('timeout') ? 'timeout'
+          : data.error?.includes('overloaded') ? 'model_error'
+          : 'crash';
+
         // Check if we can retry
         const canRetry = await machine.canTransition('retry', {
           status: 'initializing',
@@ -4915,17 +4921,29 @@ You are working in an isolated git worktree for feature: **${hookContext.feature
           }, 1000);
         } else {
           log.error('dispatcher', `Worker exhausted retries`, { agentId, workerId: errorWorkerId });
+
+          // Reliability: route to DLQ with session context for recovery
+          if (this.reliability) {
+            const sessionId = activeWorker?.runner.getSessionId() || undefined;
+            const msgsSent = activeWorker?.messagesSent?.length || 0;
+            this.reliability.deadLetter(meshName!, agentId, category, data.error || 'Unknown error', {
+              sessionId,
+              messagesSent: msgsSent,
+              fromAgent: nextMsg?.from_agent,
+              toAgent: agentId,
+              msgType: nextMsg?.type,
+              payload: nextMsg?.payload as Record<string, unknown>,
+              sourceFile: nextMsg?.source_file,
+            });
+          }
+
           // Cleanup using consolidated helper
           this.cleanupWorker(agentId, errorWorkerId);
         }
 
         this.emit('worker:error', { ...data, workerId: errorWorkerId, transitionName: 'error' });
 
-        // Reliability: record failure with categorization
-        const category = data.error?.includes('usage policy') ? 'policy_violation'
-          : data.error?.includes('timeout') ? 'timeout'
-          : data.error?.includes('overloaded') ? 'model_error'
-          : 'crash';
+        // Reliability: record failure
         this.reliability?.recordFailure(meshName!, agentId, category as any, data.error);
       });
 

From 962acfe4f9d4a5bdd35efdb18f9e5fb5e95e8ce2 Mon Sep 17 00:00:00 2001
From: Claude <noreply@anthropic.com>
Date: Mon, 9 Mar 2026 17:56:29 +0000
Subject: [PATCH 04/12] feat(cli): Add tx mesh health and tx mesh dlq commands
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Surface reliability internals via CLI:

- tx mesh health [mesh] — SLI nines dashboard with success rate,
  MTTR, failure categories, circuit breaker states, safe mode level,
  agent health, and DLQ summary. Per-mesh/per-agent breakdown when
  mesh name provided.

- tx mesh dlq [mesh] — List pending dead letter queue entries with
  recovery mode (session_resume/requeue/manual), failure context,
  retry counts, and session hints.

- tx mesh dlq clear — Garbage collect recovered DLQ entries.

Both support --json for programmatic consumption.

https://claude.ai/code/session_012PTx7bZNNh74rAshCcuSQg
---
 src/cli/mesh.ts | 214 ++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 214 insertions(+)

diff --git a/src/cli/mesh.ts b/src/cli/mesh.ts
index 1e89f8f9..c746ab82 100644
--- a/src/cli/mesh.ts
+++ b/src/cli/mesh.ts
@@ -20,6 +20,9 @@
  *   tx mesh ideal <mesh>          Ideal execution stages from routing + manifest
  *   tx mesh drain <mesh>          Drain all pending messages (mark delivered, unblock queue)
  *   tx mesh dump [mesh]           Chronological dump of all events (logs, messages, sessions, prompts)
+ *   tx mesh health [mesh]         Reliability dashboard (SLI nines, circuit breakers, safe mode)
+ *   tx mesh dlq [mesh]            Dead letter queue entries with recovery modes
+ *   tx mesh dlq clear             Clear recovered DLQ entries
  */
 
 import { MessageQueue, FSMPersistence } from '../queue/index.ts';
@@ -28,6 +31,8 @@ import { HeadlessRunner } from '../worker/headless-runner.ts';
 import { SessionStore } from '../session/index.ts';
 import { validateMesh } from './validate-mesh.ts';
 import { MeshValidator } from '../worker/mesh-validator.ts';
+import { ReliabilityManager } from '../reliability/reliability-manager.ts';
+import { DeadLetterQueue } from '../reliability/dead-letter-queue.ts';
 import { log } from '../shared/logger.ts';
 import { chalk } from '../shared/colors.ts';
 import { formatTimeAgo } from '../shared/time.ts';
@@ -3435,6 +3440,204 @@ async function meshDump(meshName: string | undefined, flags: MeshFlags): Promise
 /**
  * Print usage help
  */
+/**
+ * Show reliability health: SLI nines, circuit breakers, safe mode, heartbeat
+ */
+async function meshHealth(meshName: string | undefined, flags: MeshFlags): Promise<void> {
+  const cwd = process.env.TX_CWD || process.cwd();
+  const queuePath = path.join(cwd, '.ai/tx/queue.db');
+
+  if (!fs.existsSync(queuePath)) {
+    console.log(chalk.yellow('No queue database found. Run a mesh first.'));
+    return;
+  }
+
+  const queue = new MessageQueue(queuePath);
+  const reliability = new ReliabilityManager(queue.getDb(), cwd);
+  const status = reliability.getStatus(300_000); // 5 min window
+
+  if (flags.json) {
+    console.log(JSON.stringify(status, null, 2));
+    return;
+  }
+
+  // Header
+  const nines = status.sli.ninesLevel;
+  const rate = (status.sli.successRate * 100).toFixed(2);
+  const ninesColor = status.sli.successRate >= 0.9999 ? chalk.green
+    : status.sli.successRate >= 0.999 ? chalk.cyan
+    : status.sli.successRate >= 0.99 ? chalk.yellow
+    : chalk.red;
+
+  console.log();
+  console.log(chalk.bold('Reliability Health'));
+  console.log(chalk.dim('─'.repeat(50)));
+
+  // SLI
+  console.log(`  Nines:          ${ninesColor(nines)} (${rate}% success)`);
+  console.log(`  Events:         ${status.sli.totalEvents} total  ${chalk.green(String(status.sli.totalSuccesses))} ok  ${chalk.red(String(status.sli.totalFailures))} fail`);
+  if (status.sli.mttrMs != null) {
+    console.log(`  MTTR:           ${(status.sli.mttrMs / 1000).toFixed(1)}s`);
+  }
+
+  // Failure categories
+  const cats = status.sli.failuresByCategory;
+  if (Object.keys(cats).length > 0) {
+    console.log(`  Failures:       ${Object.entries(cats).map(([k, v]) => `${k}=${v}`).join('  ')}`);
+  }
+
+  // Safe mode
+  const safeLevelColor = status.safeMode.level === 'normal' ? chalk.green
+    : status.safeMode.level === 'cautious' ? chalk.yellow
+    : status.safeMode.level === 'restricted' ? chalk.red
+    : chalk.bgRed;
+  console.log(`  Safe mode:      ${safeLevelColor(status.safeMode.level)}${status.safeMode.autoEscalated ? chalk.dim(' (auto)') : ''}`);
+
+  // Circuit breakers
+  const open = status.circuitBreakers.filter(cb => cb.state === 'open');
+  const halfOpen = status.circuitBreakers.filter(cb => cb.state === 'half_open');
+  if (open.length > 0 || halfOpen.length > 0) {
+    console.log(chalk.dim('─'.repeat(50)));
+    console.log(chalk.bold('  Circuit Breakers'));
+    for (const cb of open) {
+      console.log(`    ${chalk.red('OPEN')}       ${cb.agentId}  (${cb.failures} failures)`);
+    }
+    for (const cb of halfOpen) {
+      console.log(`    ${chalk.yellow('HALF_OPEN')}  ${cb.agentId}  (${cb.failures} failures)`);
+    }
+  }
+
+  // Agent health
+  const unhealthy = status.agentHealth.filter(h => h.status !== 'healthy');
+  if (unhealthy.length > 0) {
+    console.log(chalk.dim('─'.repeat(50)));
+    console.log(chalk.bold('  Agent Health'));
+    for (const h of unhealthy) {
+      const statusColor = h.status === 'dead' ? chalk.red : h.status === 'stale' ? chalk.yellow : chalk.dim;
+      console.log(`    ${statusColor(h.status.padEnd(8))}  ${h.agentId}  silent ${(h.silenceMs / 1000).toFixed(0)}s`);
+    }
+  }
+
+  // DLQ summary
+  if (status.dlq.pending > 0) {
+    console.log(chalk.dim('─'.repeat(50)));
+    console.log(`  DLQ:            ${chalk.red(String(status.dlq.pending) + ' pending')}  ${chalk.dim(String(status.dlq.recovered) + ' recovered')}`);
+    const modes = status.dlq.byMode;
+    if (modes.session_resume > 0) console.log(`                  ${chalk.cyan(String(modes.session_resume))} session_resume`);
+    if (modes.requeue > 0) console.log(`                  ${chalk.yellow(String(modes.requeue))} requeue`);
+    if (modes.manual > 0) console.log(`                  ${chalk.red(String(modes.manual))} manual`);
+    console.log(`                  ${chalk.dim('Use')} tx mesh dlq ${chalk.dim('for details')}`);
+  } else {
+    console.log(`  DLQ:            ${chalk.green('clean')}`);
+  }
+
+  // Per-mesh breakdown if requested
+  if (meshName) {
+    const meshSLI = status.sli.byMesh[meshName];
+    if (meshSLI) {
+      console.log(chalk.dim('─'.repeat(50)));
+      console.log(chalk.bold(`  Mesh: ${meshName}`));
+      console.log(`    Rate:  ${(meshSLI.rate * 100).toFixed(1)}%  (${meshSLI.success}/${meshSLI.total})`);
+    }
+    // Per-agent within mesh
+    const agentEntries = Object.entries(status.sli.byAgent)
+      .filter(([id]) => id.startsWith(`${meshName}/`));
+    for (const [id, data] of agentEntries) {
+      const agentName = id.split('/').slice(1).join('/');
+      const rateColor = data.rate >= 0.99 ? chalk.green : data.rate >= 0.9 ? chalk.yellow : chalk.red;
+      console.log(`    ${agentName.padEnd(20)} ${rateColor((data.rate * 100).toFixed(0) + '%')}  (${data.success}/${data.total})`);
+    }
+  }
+
+  console.log();
+}
+
+/**
+ * Show and manage dead letter queue entries
+ *
+ * tx mesh dlq              List pending DLQ entries
+ * tx mesh dlq <mesh>       List DLQ entries for a mesh
+ * tx mesh dlq clear        Clear recovered entries
+ */
+async function meshDLQ(meshName: string | undefined, flags: MeshFlags): Promise<void> {
+  const cwd = process.env.TX_CWD || process.cwd();
+  const queuePath = path.join(cwd, '.ai/tx/queue.db');
+
+  if (!fs.existsSync(queuePath)) {
+    console.log(chalk.yellow('No queue database found.'));
+    return;
+  }
+
+  const queue = new MessageQueue(queuePath);
+  const dlq = new DeadLetterQueue(queue.getDb());
+
+  // Special action: clear recovered
+  if (meshName === 'clear') {
+    const cleared = dlq.clearRecovered();
+    console.log(cleared > 0
+      ? chalk.green(`Cleared ${cleared} recovered DLQ entries.`)
+      : chalk.dim('No recovered entries to clear.'));
+    return;
+  }
+
+  const entries = meshName ? dlq.getForMesh(meshName) : dlq.getPending();
+  const stats = dlq.getStats();
+
+  if (flags.json) {
+    console.log(JSON.stringify({ stats, entries }, null, 2));
+    return;
+  }
+
+  console.log();
+  console.log(chalk.bold('Dead Letter Queue'));
+  console.log(chalk.dim('─'.repeat(70)));
+  console.log(`  Total: ${stats.total}  Pending: ${chalk.red(String(stats.pending))}  Recovered: ${chalk.green(String(stats.recovered))}`);
+
+  if (entries.length === 0) {
+    console.log(chalk.green('\n  No pending DLQ entries.'));
+    console.log();
+    return;
+  }
+
+  console.log();
+
+  for (const entry of entries) {
+    const modeColor = entry.recovery_mode === 'session_resume' ? chalk.cyan
+      : entry.recovery_mode === 'requeue' ? chalk.yellow
+      : chalk.red;
+    const age = formatTimeAgo(entry.first_failed_at);
+    const sessionHint = entry.session_id ? chalk.dim(` sid:${entry.session_id.slice(0, 8)}`) : '';
+
+    console.log(`  ${chalk.dim('#' + entry.id)}  ${modeColor(entry.recovery_mode.padEnd(16))} ${chalk.bold(entry.agent_id)}`);
+    console.log(`      ${chalk.dim('mesh:')} ${entry.mesh_name}  ${chalk.dim('category:')} ${entry.failure_category}  ${chalk.dim('retries:')} ${entry.retry_count}/${entry.max_retries}${sessionHint}`);
+    console.log(`      ${chalk.dim('reason:')} ${entry.failure_reason.slice(0, 80)}`);
+    if (entry.messages_sent > 0) {
+      console.log(`      ${chalk.dim('msgs sent:')} ${entry.messages_sent} before failure`);
+    }
+    console.log(`      ${chalk.dim('failed:')} ${age}${entry.recovered_at ? chalk.green('  recovered') : ''}`);
+    console.log();
+  }
+
+  // Recovery hints
+  const resumable = entries.filter(e => e.recovery_mode === 'session_resume');
+  const requeueable = entries.filter(e => e.recovery_mode === 'requeue');
+  const manual = entries.filter(e => e.recovery_mode === 'manual');
+
+  if (resumable.length > 0 || requeueable.length > 0 || manual.length > 0) {
+    console.log(chalk.dim('─'.repeat(70)));
+    if (resumable.length > 0) {
+      console.log(`  ${chalk.cyan(String(resumable.length))} can resume session (conversation preserved)`);
+    }
+    if (requeueable.length > 0) {
+      console.log(`  ${chalk.yellow(String(requeueable.length))} can be requeued (fresh dispatch)`);
+    }
+    if (manual.length > 0) {
+      console.log(`  ${chalk.red(String(manual.length))} need manual intervention`);
+    }
+    console.log();
+  }
+}
+
 function printUsage(): void {
   console.log(`
 ${chalk.bold('Usage:')} tx mesh <action> [mesh] [options]
@@ -3456,6 +3659,9 @@ ${chalk.bold('Actions:')}
   ${chalk.cyan('guardrails')} [mesh]       Show guardrail violations from activity logs
   ${chalk.cyan('ideal')} <mesh>            Ideal execution stages from routing + manifest
   ${chalk.cyan('dump')} [mesh]             Chronological dump of all events (logs, msgs, sessions, prompts)
+  ${chalk.cyan('health')} [mesh]           Reliability dashboard (SLI nines, circuits, safe mode, DLQ)
+  ${chalk.cyan('dlq')} [mesh]              Dead letter queue (pending failures, recovery modes)
+  ${chalk.cyan('dlq clear')}               Clear recovered DLQ entries
 
 ${chalk.bold('Options:')}
   ${chalk.dim('--json')}                  Output as JSON
@@ -3647,6 +3853,14 @@ export async function mesh(args: string[]): Promise<void> {
         await meshDump(meshName, flags);
         break;
 
+      case 'health':
+        await meshHealth(meshName, flags);
+        break;
+
+      case 'dlq':
+        await meshDLQ(meshName, flags);
+        break;
+
       default:
         printUsage();
     }

From 96bc2a0f8a5f5a11b106d1095862414754d27699 Mon Sep 17 00:00:00 2001
From: Claude <noreply@anthropic.com>
Date: Mon, 9 Mar 2026 23:52:38 +0000
Subject: [PATCH 05/12] feat(reliability): Wire all features end-to-end with
 CLI and docs
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Every reliability feature now has actuation, not just observation:

Heartbeat dead → kill:
  ReliabilityManager.bindDispatcher() receives killAgent callback.
  When heartbeat fires 'dead', it kills the stuck worker via
  AbortController.abort() and records the failure.

DLQ recovery (3 trigger paths):
  1. Automatic on startup — dispatcher calls recoverAll()
  2. CLI — tx mesh recover <mesh> sends SIGUSR2 to dispatcher
  3. Front-matter — message with recover: true triggers recovery
  Session resume: writes message with session-id front-matter so
  dispatcher spawns worker resuming the SDK conversation.
  Requeue: re-injects original message via SystemMessageWriter.

Safe mode enforcement:
  createSafeModeHook() returns a PreToolUse hook (same pattern as
  write-gate) that blocks Write/Edit/Bash at restricted+ levels.
  Hook is registered per-agent at spawn time.

SIGUSR2 dlq-recover control signal in start.ts.
tx mesh recover CLI with SIGUSR2 + message fallback.
Test mesh config with tight thresholds for quick testing.
docs/reliability.md — complete guide for all features.

https://claude.ai/code/session_012PTx7bZNNh74rAshCcuSQg
---
 docs/reliability.md                    | 242 +++++++++++++++++++++
 meshes/reliability-test/config.yaml    |  56 ++---
 src/cli/mesh.ts                        | 100 +++++++++
 src/cli/start.ts                       |  10 +
 src/reliability/index.ts               |   8 +-
 src/reliability/reliability-manager.ts | 277 ++++++++++++++++---------
 src/worker/dispatcher.ts               |  57 +++++
 7 files changed, 615 insertions(+), 135 deletions(-)
 create mode 100644 docs/reliability.md

diff --git a/docs/reliability.md b/docs/reliability.md
new file mode 100644
index 00000000..9e62dfd0
--- /dev/null
+++ b/docs/reliability.md
@@ -0,0 +1,242 @@
+# Reliability — Four Nines
+
+TX reliability features organized by Karpathy's "March of Nines" — each nine requires fundamentally new approaches.
+
+## Quick Start
+
+```bash
+# View reliability dashboard
+tx mesh health
+
+# View per-mesh reliability
+tx mesh health reliability-test
+
+# View dead letter queue
+tx mesh dlq
+
+# Recover failed work
+tx mesh recover reliability-test
+```
+
+## Configuration
+
+Set reliability thresholds in `.ai/tx/data/config.yaml`:
+
+```yaml
+reliability:
+  circuitBreaker:
+    failureThreshold: 3    # Failures before circuit opens
+    cooldownMs: 30000      # How long circuit stays open
+  heartbeat:
+    warnMs: 60000          # Warn after 60s silence
+    staleMs: 120000        # Stale after 120s
+    deadMs: 300000         # Kill worker after 300s silence
+  safeMode:
+    autoEscalate: true     # Auto-restrict on SLI drop
+    cautiousThreshold: 0.95
+    restrictedThreshold: 0.90
+    lockdownThreshold: 0.80
+  dlq:
+    maxRetries: 3
+```
+
+## Features
+
+### 1. Circuit Breaker
+
+**What it does**: Stops spawning an agent that keeps failing. Prevents cascade failures.
+
+**States**: `closed` (normal) → `open` (blocked) → `half_open` (testing)
+
+**How it works**:
+- Each agent has an independent circuit
+- After `failureThreshold` consecutive failures, circuit opens
+- While open, `canSpawn()` returns false — dispatcher skips that agent
+- After `cooldownMs`, circuit moves to half_open — allows one test spawn
+- Success closes the circuit; failure re-opens it
+
+**State persists to SQLite** — survives restarts.
+
+**Observe it**:
+```bash
+tx mesh health           # Shows open/half_open circuits
+tx spy                   # Watch for reliability:blocked activity
+```
+
+### 2. Heartbeat Monitor
+
+**What it does**: Detects stuck workers and kills them.
+
+**Thresholds**: `warn` → `stale` → `dead`
+
+**How it works**:
+- On spawn, agent is registered with the heartbeat monitor
+- Every worker output event records a heartbeat
+- A background timer checks silence intervals
+- At `warnMs`: logs a warning
+- At `staleMs`: logs a stale warning
+- At `deadMs`: **kills the worker** via `AbortController.abort()`, records failure, routes to DLQ
+
+**Observe it**:
+```bash
+tx mesh health           # Shows unhealthy agents with silence duration
+tx logs --component reliability  # Heartbeat kill events
+```
+
+### 3. Dead Letter Queue (DLQ)
+
+**What it does**: Captures failed work with enough context to recover it.
+
+**Recovery modes**:
+- `session_resume`: Agent had an active SDK session → recovery spawns a new worker with `session-id` front-matter, resuming the conversation where it left off. **Conversation history preserved.**
+- `requeue`: No session existed → original message is re-injected into the queue for fresh dispatch.
+- `manual`: Retries exhausted → needs human decision.
+
+**How entries are created**:
+- Worker exhausts all retries → dispatcher calls `reliability.deadLetter()` with the worker's sessionId, messages sent, and failure category
+- Heartbeat kills a stuck worker → recorded as failure, may generate DLQ entry on next retry exhaustion
+
+**How recovery works**:
+
+1. **Automatic on startup**: When `tx start` runs, the dispatcher calls `recoverAll()` — recovers any pending session_resume and requeue entries from the previous run.
+
+2. **CLI**: `tx mesh recover <mesh>` sends a SIGUSR2 signal to the running dispatcher, triggering recovery for that mesh's DLQ entries.
+
+3. **Front-matter message**: An agent (or core) can write a message with `recover: true` to trigger DLQ recovery:
+   ```markdown
+   ---
+   to: reliability-test/planner
+   from: core/core
+   type: task
+   recover: true
+   ---
+   Recover failed work.
+   ```
+
+4. **Fallback**: If the dispatcher isn't running, `tx mesh recover` writes a recovery message to the msgs dir that will be processed on next start.
+
+**Observe it**:
+```bash
+tx mesh dlq              # List pending entries with recovery mode
+tx mesh dlq my-mesh      # Filter by mesh
+tx mesh dlq --json       # Machine-readable output
+tx mesh dlq clear        # GC recovered entries
+```
+
+### 4. SLI Tracker
+
+**What it does**: Measures success rate, failure categories, MTTR, and nines level.
+
+**Metrics tracked**:
+- Success rate (per-mesh, per-agent, overall)
+- Nines level (90%, 99%, 99.9%, 99.99%)
+- Mean Time To Recovery (MTTR)
+- Failure taxonomy: `crash`, `timeout`, `model_error`, `policy_violation`, `circuit_open`, `stuck`
+
+**How it works**:
+- `recordSuccess()` on worker completion, `recordFailure()` on worker error
+- In-memory with configurable retention window
+- Feeds safe mode auto-escalation
+
+**Observe it**:
+```bash
+tx mesh health              # Nines level, MTTR, failure breakdown
+tx mesh health my-mesh      # Per-agent success rates
+tx mesh health --json       # Full snapshot
+```
+
+### 5. Safe Mode
+
+**What it does**: Restricts agent capabilities when reliability drops.
+
+**Levels**:
+| Level | Tool restrictions | Trigger |
+|-------|------------------|---------|
+| `normal` | None | Default |
+| `cautious` | None (action-level blocks only) | SLI < cautiousThreshold |
+| `restricted` | Write, Edit, NotebookEdit, Bash blocked | SLI < restrictedThreshold |
+| `lockdown` | All tools blocked, spawns blocked | SLI < lockdownThreshold |
+
+**How it works**:
+- After every failure, SLI is evaluated against thresholds
+- If `autoEscalate: true` and SLI drops below a threshold, safe mode escalates
+- **Only escalates, never auto-de-escalates** — human must clear it
+- At `restricted`+: a PreToolUse hook blocks Write/Edit/Bash calls
+- At `lockdown`: `canSpawn()` blocks all new workers for that mesh
+
+**Enforcement**: Safe mode hook is registered as a PreToolUse hook alongside write-gate and identity-gate. When an agent tries to use a blocked tool, it gets a rejection message explaining the restriction.
+
+**Observe it**:
+```bash
+tx mesh health           # Shows current safe mode level
+tx spy                   # Watch safe-mode:blocked activity events
+```
+
+## Test Mesh
+
+The `reliability-test` mesh is configured with tight thresholds for quick testing:
+- Circuit breaker opens after 2 failures (not 3)
+- Heartbeat kills after 120s (not 300s)
+- Safe mode auto-escalates at 80%/50%/25% (not 95%/90%/80%)
+
+```bash
+# Run the test mesh
+tx msg "Write a hello world function" --to reliability-test/planner
+
+# Monitor reliability during execution
+tx mesh health reliability-test
+
+# If failures occur, check DLQ
+tx mesh dlq reliability-test
+
+# Recover failed work
+tx mesh recover reliability-test
+```
+
+## Front-Matter Options
+
+Agents can interact with reliability features via message front-matter:
+
+| Field | Value | Effect |
+|-------|-------|--------|
+| `recover` | `true` | Triggers DLQ recovery for the target mesh |
+| `session-id` | SDK session ID | Spawns worker resuming that session |
+| `resume-mesh` | `true` | Preserves mesh state instead of clearing on entry |
+
+## CLI Reference
+
+| Command | Description |
+|---------|-------------|
+| `tx mesh health [mesh]` | Reliability dashboard (SLI, circuits, safe mode, DLQ) |
+| `tx mesh health --json` | Machine-readable health output |
+| `tx mesh dlq [mesh]` | List dead letter queue entries |
+| `tx mesh dlq clear` | Clear recovered DLQ entries |
+| `tx mesh recover <mesh>` | Trigger DLQ recovery via running dispatcher |
+| `tx mesh recover --all` | Recover all pending DLQ entries |
+
+## Architecture
+
+```
+                    ┌──────────────────────┐
+                    │  ReliabilityManager  │
+                    │                      │
+                    │  ┌─ SLI Tracker     │
+                    │  ├─ Circuit Breaker  │ ← SQLite persisted
+                    │  ├─ Heartbeat Monitor│ ← kills via bindings
+                    │  ├─ Dead Letter Queue│ ← SQLite persisted
+                    │  └─ Safe Mode       │ ← PreToolUse hook
+                    │                      │
+                    │  bindDispatcher({    │
+                    │    killAgent,        │ ← WorkerLifecycle.killForAgent
+                    │    requeueMessage,   │ ← SystemMessageWriter.write
+                    │  })                  │
+                    └──────────┬───────────┘
+                               │
+              ┌────────────────┼────────────────┐
+              │                │                │
+        ┌─────┴─────┐   ┌─────┴─────┐   ┌─────┴─────┐
+        │ canSpawn() │   │recordFail │   │ heartbeat │
+        │ safe mode  │   │ + DLQ     │   │ dead→kill │
+        │ + circuit  │   │ + SLI     │   │ + DLQ     │
+        └────────────┘   └───────────┘   └───────────┘
+```
diff --git a/meshes/reliability-test/config.yaml b/meshes/reliability-test/config.yaml
index 584b4233..7f7fedc6 100644
--- a/meshes/reliability-test/config.yaml
+++ b/meshes/reliability-test/config.yaml
@@ -1,14 +1,11 @@
 # reliability-test/config.yaml
 # Test mesh for validating four-nines reliability features
 #
-# Exercises: circuit breakers, heartbeat monitoring, SLI tracking,
-# dead letter queue, safe mode, and failure recovery.
-#
-# This mesh has an intentionally fragile agent (chaos-agent) that may
-# produce routing errors or slow output to test reliability detection.
+# Uses tight thresholds so reliability events trigger quickly during testing.
+# See docs/reliability.md for how to exercise each feature.
 
 mesh: reliability-test
-description: "Test mesh for four-nines reliability features: circuit breakers, heartbeat, SLI, DLQ, safe mode"
+description: "Reliability test mesh: circuit breakers, heartbeat, SLI, DLQ, safe mode"
 
 agents:
   - name: planner
@@ -45,42 +42,31 @@ routing:
     blocked:
       worker: "Checks failed, rework needed"
 
-# Reliability-specific guardrails for testing
+# Tight guardrails to trigger reliability events quickly
 guardrails:
   max_messages:
     strict: true
     warning: true
-    limit: 20
+    limit: 10
   max_turns:
     strict: false
     warning: true
-    limit: 15
-  routing_error:
-    strict: false
-    warning: true
-    max_retries: 2
+    limit: 8
+
+# Reliability config — tight thresholds for testing
+reliability:
+  circuitBreaker:
+    failureThreshold: 2       # Opens after just 2 failures (default: 3)
+    cooldownMs: 15000          # 15s cooldown (default: 30s)
+  heartbeat:
+    warnMs: 30000              # Warn after 30s silence (default: 60s)
+    staleMs: 60000             # Stale after 60s (default: 120s)
+    deadMs: 120000             # Dead after 120s (default: 300s)
+  safeMode:
+    autoEscalate: true         # Auto-restrict on SLI drop
+    cautiousThreshold: 0.80    # Cautious when <80% success
+    restrictedThreshold: 0.50  # Restricted when <50%
+    lockdownThreshold: 0.25    # Lockdown when <25%
 
-# Workspace for output
 workspace:
   path: ".ai/output/{task-id}/"
-
-lifecycle:
-  post:
-    - commit:auto
-
-playbook_notes: |
-  Reliability test mesh for exercising four-nines patterns:
-
-  1. PLANNER: Breaks down the task into steps
-  2. WORKER: Executes the implementation
-  3. CHECKER: Validates the output
-
-  This mesh is configured with tight guardrails to exercise:
-  - Circuit breaker trips on repeated failures
-  - Heartbeat detection on stalled agents
-  - SLI tracking for success/failure rates
-  - DLQ routing for undeliverable messages
-  - Safe mode escalation when SLI drops
-
-  Run with: tx msg "Implement a simple hello world function"
-  Monitor with: tx status (shows reliability metrics)
diff --git a/src/cli/mesh.ts b/src/cli/mesh.ts
index c746ab82..93360739 100644
--- a/src/cli/mesh.ts
+++ b/src/cli/mesh.ts
@@ -3638,6 +3638,100 @@ async function meshDLQ(meshName: string | undefined, flags: MeshFlags): Promise<
   }
 }
 
+/**
+ * Trigger DLQ recovery for a mesh via the running dispatcher.
+ * Uses SIGUSR2 control signal (same pattern as tx mesh kill).
+ *
+ * If the dispatcher is not running, falls back to writing a recovery
+ * message directly so it's picked up on next start.
+ *
+ * tx mesh recover <mesh>     Recover DLQ entries for a mesh
+ * tx mesh recover --all      Recover all DLQ entries
+ */
+async function meshRecover(meshName: string | undefined, flags: MeshFlags): Promise<void> {
+  const cwd = process.env.TX_CWD || process.cwd();
+  const queuePath = path.join(cwd, '.ai/tx/queue.db');
+  const dataDir = path.join(cwd, '.ai/tx/data');
+  const pidFile = path.join(dataDir, '.pid');
+  const controlFile = path.join(dataDir, 'control.json');
+
+  if (!fs.existsSync(queuePath)) {
+    console.log(chalk.yellow('No queue database found.'));
+    return;
+  }
+
+  // Check what's in the DLQ first
+  const queue = new MessageQueue(queuePath);
+  const dlq = new DeadLetterQueue(queue.getDb());
+
+  const entries = meshName && !flags.all
+    ? dlq.getForMesh(meshName).filter(e => e.recovery_mode !== 'manual')
+    : dlq.getRecoverable();
+
+  if (entries.length === 0) {
+    console.log(chalk.green('No recoverable DLQ entries.'));
+    return;
+  }
+
+  const resumable = entries.filter(e => e.recovery_mode === 'session_resume');
+  const requeueable = entries.filter(e => e.recovery_mode === 'requeue');
+  console.log(`\nRecovering ${entries.length} entries: ${chalk.cyan(String(resumable.length))} session_resume, ${chalk.yellow(String(requeueable.length))} requeue`);
+
+  // Try SIGUSR2 to running dispatcher
+  if (fs.existsSync(pidFile)) {
+    const pid = parseInt(fs.readFileSync(pidFile, 'utf-8').trim(), 10);
+    if (!isNaN(pid)) {
+      const target = meshName && !flags.all ? meshName : '_all';
+      fs.writeFileSync(controlFile, JSON.stringify({ action: 'dlq-recover', mesh: target }));
+
+      try {
+        process.kill(pid, 'SIGUSR2');
+
+        // Wait for ACK
+        for (let i = 0; i < 50; i++) {
+          if (!fs.existsSync(controlFile)) {
+            console.log(chalk.green(`Recovery triggered successfully.`));
+            return;
+          }
+          await new Promise(r => setTimeout(r, 100));
+        }
+        console.log(chalk.yellow('Timeout waiting for dispatcher. Entries will be recovered on next start.'));
+        if (fs.existsSync(controlFile)) fs.unlinkSync(controlFile);
+        return;
+      } catch {
+        // Process not running — fall through to message-based recovery
+        if (fs.existsSync(controlFile)) fs.unlinkSync(controlFile);
+      }
+    }
+  }
+
+  // Fallback: write a recovery message so next start picks it up
+  // Use SystemMessageWriter pattern — write directly to msgs dir
+  if (meshName) {
+    const msgsDir = path.join(cwd, '.ai/tx/msgs');
+    if (!fs.existsSync(msgsDir)) fs.mkdirSync(msgsDir, { recursive: true });
+
+    // Look up entry point from mesh config
+    const meshDir = path.join(cwd, 'meshes', meshName);
+    let entryPoint = 'worker';
+    const configPath = path.join(meshDir, 'config.yaml');
+    if (fs.existsSync(configPath)) {
+      try {
+        const cfg = YAML.parse(fs.readFileSync(configPath, 'utf-8'));
+        entryPoint = cfg.entry_point || cfg.agents?.[0]?.name || 'worker';
+      } catch { /* use default */ }
+    }
+
+    const timestamp = Date.now();
+    const filename = `${timestamp}-task-system-dlq-recovery--${meshName}-${entryPoint}-recover.md`;
+    const content = `---\nto: ${meshName}/${entryPoint}\nfrom: system/dlq-recovery\ntype: task\nheadline: DLQ recovery\nrecover: true\ntimestamp: ${new Date(timestamp).toISOString()}\n---\n\nRecover failed work from dead letter queue.\n`;
+    fs.writeFileSync(path.join(msgsDir, filename), content);
+    console.log(chalk.cyan(`Recovery message written. Will be processed on next tx start.`));
+  } else {
+    console.log(chalk.yellow('Cannot write fallback recovery without mesh name. Use: tx mesh recover <mesh>'));
+  }
+}
+
 function printUsage(): void {
   console.log(`
 ${chalk.bold('Usage:')} tx mesh <action> [mesh] [options]
@@ -3662,6 +3756,8 @@ ${chalk.bold('Actions:')}
   ${chalk.cyan('health')} [mesh]           Reliability dashboard (SLI nines, circuits, safe mode, DLQ)
   ${chalk.cyan('dlq')} [mesh]              Dead letter queue (pending failures, recovery modes)
   ${chalk.cyan('dlq clear')}               Clear recovered DLQ entries
+  ${chalk.cyan('recover')} <mesh>           Trigger DLQ recovery (session resume or requeue)
+  ${chalk.cyan('recover')} --all            Recover all pending DLQ entries
 
 ${chalk.bold('Options:')}
   ${chalk.dim('--json')}                  Output as JSON
@@ -3861,6 +3957,10 @@ export async function mesh(args: string[]): Promise<void> {
         await meshDLQ(meshName, flags);
         break;
 
+      case 'recover':
+        await meshRecover(meshName, flags);
+        break;
+
       default:
         printUsage();
     }
diff --git a/src/cli/start.ts b/src/cli/start.ts
index 3906c4d0..2dda9920 100644
--- a/src/cli/start.ts
+++ b/src/cli/start.ts
@@ -271,6 +271,16 @@ export async function start(workDir?: string, options?: StartOptions): Promise<v
         } else if (ctrl.action === 'reload-meshes') {
           disp.reloadMeshConfigs(ctrl.mesh);
           log.info('start', 'SIGUSR2: reloaded mesh configs', { mesh: ctrl.mesh || 'all' });
+        } else if (ctrl.action === 'dlq-recover') {
+          if (disp.reliability) {
+            const results = ctrl.mesh === '_all'
+              ? disp.reliability.recoverAll()
+              : disp.reliability.recoverForMesh(ctrl.mesh);
+            const succeeded = results.filter((r: { success: boolean }) => r.success).length;
+            log.info('start', 'SIGUSR2: DLQ recovery', {
+              mesh: ctrl.mesh, attempted: results.length, succeeded,
+            });
+          }
         }
 
         // Delete control file as ACK
diff --git a/src/reliability/index.ts b/src/reliability/index.ts
index 718ecc3d..ee6861fb 100644
--- a/src/reliability/index.ts
+++ b/src/reliability/index.ts
@@ -3,17 +3,17 @@
  *
  * Implements four-nines (99.99%) reliability patterns for TX mesh execution:
  *
- * Nine 1 (90%):  Basic error handling, logging ✓ (existing)
+ * Nine 1 (90%):  Basic error handling, logging (existing)
  * Nine 2 (99%):  Dead letter queue, session-aware recovery, idempotency
- * Nine 3 (99.9%): Circuit breakers, heartbeat detection, structured traces
- * Nine 4 (99.99%): SLI tracking, failure taxonomy, safe-mode, canary checks
+ * Nine 3 (99.9%): Circuit breakers, heartbeat detection + kill, structured traces
+ * Nine 4 (99.99%): SLI tracking, failure taxonomy, safe-mode enforcement
  *
  * Key insight: Recovery via session resume (not raw message replay) preserves
  * full conversation history and tool state. Reference: Karpathy's "March of Nines"
  */
 
 export { DeadLetterQueue, type DLQEntry, type DLQStats, type RecoveryMode, type RecoveryResult } from './dead-letter-queue.ts';
-export { ReliabilityManager, type ReliabilityConfig, type ReliabilityStatus, type FailureContext, type SessionResumeHandler, type RequeueHandler } from './reliability-manager.ts';
+export { ReliabilityManager, type ReliabilityConfig, type ReliabilityStatus, type FailureContext, type DispatcherBindings } from './reliability-manager.ts';
 export { CircuitBreaker, type CircuitBreakerConfig, type CircuitBreakerState } from './circuit-breaker.ts';
 export { HeartbeatMonitor, type HeartbeatConfig, type AgentHealth } from './heartbeat-monitor.ts';
 export { SLITracker, type SLIConfig, type SLISnapshot, type FailureCategory } from './sli-tracker.ts';
diff --git a/src/reliability/reliability-manager.ts b/src/reliability/reliability-manager.ts
index 73e50b31..4537b6fe 100644
--- a/src/reliability/reliability-manager.ts
+++ b/src/reliability/reliability-manager.ts
@@ -1,24 +1,21 @@
 /**
  * ReliabilityManager - Central coordinator for all reliability features
  *
- * Provides a single integration point for the dispatcher to wire up:
+ * Provides a single integration point for the dispatcher:
  * - Dead letter queue (session-aware failure recovery)
  * - Circuit breakers (cascading failure prevention)
- * - Heartbeat monitoring (stalled worker detection)
+ * - Heartbeat monitoring (stalled worker detection + kill)
  * - SLI tracking (reliability measurement)
- * - Safe mode (gradual autonomy control)
+ * - Safe mode (gradual autonomy control via PreToolUse hooks)
  *
  * Usage in dispatcher.start():
- *   this.reliability = new ReliabilityManager(this.queue.getDb(), this.config.workDir);
+ *   this.reliability = new ReliabilityManager(db, workDir);
+ *   this.reliability.bindDispatcher({
+ *     killAgent: (agentId, reason) => this.workerLifecycle.killForAgent(agentId, reason),
+ *     requeueMessage: (from, to, type, payload) => this.systemWriter.write({...}),
+ *     getActiveSessionId: (agentId) => worker?.runner.getSessionId(),
+ *   });
  *   this.reliability.start();
- *
- * Wire events:
- *   // On worker complete
- *   this.reliability.recordSuccess(meshName, agentId, durationMs);
- *   // On worker error (with session context for DLQ)
- *   this.reliability.recordFailure(meshName, agentId, 'crash', error.message, { sessionId, messagesSent });
- *   // On worker output (heartbeat)
- *   this.reliability.heartbeat(agentId);
  */
 
 import type Database from 'better-sqlite3';
@@ -79,15 +76,17 @@ export interface FailureContext {
   payload?: Record<string, unknown>;
 }
 
-/** Callback for session resume recovery */
-export type SessionResumeHandler = (
-  agentId: string,
-  sessionId: string,
-  meshName: string
-) => Promise<{ success: boolean; error?: string }>;
-
-/** Callback for message requeue recovery */
-export type RequeueHandler = (entry: DLQEntry) => { success: boolean; error?: string };
+/**
+ * Dispatcher callbacks — these let the reliability manager
+ * take real action (kill workers, requeue messages) without
+ * importing the dispatcher directly.
+ */
+export interface DispatcherBindings {
+  /** Kill all workers for an agent, returns count killed */
+  killAgent: (agentId: string, reason: string) => number;
+  /** Write a message into the queue (for requeue recovery) */
+  requeueMessage: (from: string, to: string, type: string, payload: Record<string, unknown>, extraFrontmatter?: Record<string, string>) => void;
+}
 
 export class ReliabilityManager {
   readonly dlq: DeadLetterQueue;
@@ -96,6 +95,7 @@ export class ReliabilityManager {
   readonly sli: SLITracker;
   readonly safeMode: SafeMode;
   private workDir: string;
+  private bindings?: DispatcherBindings;
 
   constructor(db: Database.Database, workDir: string, config?: ReliabilityConfig) {
     this.workDir = workDir;
@@ -110,7 +110,23 @@ export class ReliabilityManager {
     this.sli = new SLITracker(merged.sli);
     this.safeMode = new SafeMode(merged.safeMode);
 
-    // Wire heartbeat callbacks
+    log.info('reliability', 'ReliabilityManager initialized', {
+      dlqMaxRetries: merged.dlq?.maxRetries || 3,
+      cbThreshold: merged.circuitBreaker?.failureThreshold || 3,
+      safeModeDefault: merged.safeMode?.defaultLevel || 'normal',
+      autoEscalate: merged.safeMode?.autoEscalate || false,
+    });
+  }
+
+  /**
+   * Bind dispatcher actions. Must be called before start().
+   * This gives the reliability manager the ability to actually
+   * kill stuck workers and requeue messages — not just observe.
+   */
+  bindDispatcher(bindings: DispatcherBindings): void {
+    this.bindings = bindings;
+
+    // Now that we can kill, wire the heartbeat dead callback
     this.heartbeat.on('stale', (health) => {
       log.warn('reliability', `Agent stale: ${health.agentId}`, {
         silenceMs: health.silenceMs,
@@ -118,20 +134,25 @@ export class ReliabilityManager {
     });
 
     this.heartbeat.on('dead', (health) => {
-      this.recordFailure(
-        health.agentId.split('/')[0],
-        health.agentId,
-        'stuck',
-        `No output for ${Math.round(health.silenceMs / 1000)}s`
-      );
-    });
+      const meshName = health.agentId.split('/')[0];
 
-    log.info('reliability', 'ReliabilityManager initialized', {
-      dlqMaxRetries: merged.dlq?.maxRetries || 3,
-      cbThreshold: merged.circuitBreaker?.failureThreshold || 3,
-      safeModeDefault: merged.safeMode?.defaultLevel || 'normal',
-      autoEscalate: merged.safeMode?.autoEscalate || false,
+      // Record failure (updates SLI, circuit breaker, safe mode)
+      this.recordFailure(meshName, health.agentId, 'stuck',
+        `No output for ${Math.round(health.silenceMs / 1000)}s`);
+
+      // Kill the stuck worker
+      const killed = this.bindings!.killAgent(health.agentId, `heartbeat dead: ${Math.round(health.silenceMs / 1000)}s silent`);
+      log.warn('reliability', `Killed stuck agent`, {
+        agentId: health.agentId,
+        silenceMs: health.silenceMs,
+        workersKilled: killed,
+      });
+
+      log.activity('reliability:heartbeat-kill', health.agentId,
+        `Killed after ${Math.round(health.silenceMs / 1000)}s silence`);
     });
+
+    log.info('reliability', 'Dispatcher bindings attached');
   }
 
   /**
@@ -213,17 +234,12 @@ export class ReliabilityManager {
 
   /**
    * Record failure with optional session context for DLQ routing.
-   *
-   * When failureCtx includes a sessionId, the DLQ entry is marked for
-   * session_resume recovery (picks up exactly where the agent left off).
-   * Without a sessionId, it falls back to message requeue.
    */
   recordFailure(
     meshName: string,
     agentId: string,
     category: FailureCategory,
     reason?: string,
-    failureCtx?: FailureContext
   ): void {
     this.sli.recordFailure(meshName, agentId, category, reason);
     this.circuitBreaker.recordFailure(agentId, reason || category);
@@ -265,6 +281,46 @@ export class ReliabilityManager {
     });
   }
 
+  /**
+   * Create a PreToolUse hook that enforces safe mode tool restrictions.
+   * Returns a hook object compatible with the dispatcher's chaos hooks.
+   *
+   * At 'restricted' level: blocks Write, Edit, NotebookEdit, Bash
+   * At 'lockdown' level: blocks everything (spawn already blocked)
+   * At 'cautious' level: allows all tools (restrictions are action-level)
+   */
+  createSafeModeHook(meshName: string, agentId: string): { matcher: string; hooks: Array<(input: unknown) => { decision: string; reason?: string }> } | null {
+    const level = this.safeMode.getLevel(meshName);
+    if (level === 'normal') return null;
+
+    const state = this.safeMode.getState(meshName);
+    const disabledTools = state.disabledTools;
+    if (disabledTools.length === 0) return null;
+
+    return {
+      matcher: '*',  // Check all tools
+      hooks: [(input: unknown) => {
+        const toolInput = input as { tool_name?: string };
+        const toolName = toolInput?.tool_name || '';
+
+        if (disabledTools.includes(toolName)) {
+          log.warn('safe-mode', `Blocked tool ${toolName}`, {
+            agentId,
+            meshName,
+            level,
+          });
+          log.activity('safe-mode:blocked', agentId, `${toolName} blocked at ${level} level`);
+
+          return {
+            decision: 'block',
+            reason: `Safe mode ${level}: ${toolName} is disabled. Current restrictions: ${disabledTools.join(', ')}`,
+          };
+        }
+        return { decision: 'allow' };
+      }],
+    };
+  }
+
   /**
    * Clean up for a mesh (call on mesh complete)
    */
@@ -274,28 +330,33 @@ export class ReliabilityManager {
   }
 
   // ============================================================
-  // Session-Aware Recovery API
+  // DLQ Recovery — triggered by CLI or front-matter message
   // ============================================================
 
   /**
    * Recover all auto-recoverable DLQ entries.
    *
-   * For session_resume entries: calls sessionResumeHandler to resume
-   * the SDK session where it left off (preserves conversation history).
+   * For session_resume: writes a new message to the target agent
+   * with session-id front-matter so the dispatcher spawns with resume.
+   *
+   * For requeue: re-injects the original message into the queue.
    *
-   * For requeue entries: calls requeueHandler to re-inject the message
-   * into the queue for fresh dispatch.
+   * Requires bindings — call bindDispatcher() first.
    */
-  async recoverAll(
-    sessionResumeHandler: SessionResumeHandler,
-    requeueHandler: RequeueHandler
-  ): Promise<RecoveryResult[]> {
+  recoverAll(): RecoveryResult[] {
+    if (!this.bindings) {
+      log.error('reliability', 'Cannot recover: no dispatcher bindings');
+      return [];
+    }
+
     const entries = this.dlq.getRecoverable();
+    if (entries.length === 0) return [];
+
+    log.info('reliability', `Recovering ${entries.length} DLQ entries`);
     const results: RecoveryResult[] = [];
 
     for (const entry of entries) {
-      const result = await this.recoverEntry(entry, sessionResumeHandler, requeueHandler);
-      results.push(result);
+      results.push(this.recoverEntry(entry));
     }
 
     return results;
@@ -304,18 +365,15 @@ export class ReliabilityManager {
   /**
    * Recover DLQ entries for a specific mesh.
    */
-  async recoverForMesh(
-    meshName: string,
-    sessionResumeHandler: SessionResumeHandler,
-    requeueHandler: RequeueHandler
-  ): Promise<RecoveryResult[]> {
+  recoverForMesh(meshName: string): RecoveryResult[] {
+    if (!this.bindings) return [];
+
     const entries = this.dlq.getForMesh(meshName);
     const results: RecoveryResult[] = [];
 
     for (const entry of entries) {
       if (entry.recovery_mode === 'manual') continue;
-      const result = await this.recoverEntry(entry, sessionResumeHandler, requeueHandler);
-      results.push(result);
+      results.push(this.recoverEntry(entry));
     }
 
     return results;
@@ -324,63 +382,90 @@ export class ReliabilityManager {
   /**
    * Recover a single DLQ entry by ID.
    */
-  async recoverById(
-    id: number,
-    sessionResumeHandler: SessionResumeHandler,
-    requeueHandler: RequeueHandler
-  ): Promise<RecoveryResult> {
+  recoverById(id: number): RecoveryResult {
+    if (!this.bindings) {
+      return { id, success: false, mode: 'manual', error: 'No dispatcher bindings' };
+    }
+
     const entry = this.dlq.getById(id);
     if (!entry) {
       return { id, success: false, mode: 'manual', error: 'DLQ entry not found' };
     }
-    return this.recoverEntry(entry, sessionResumeHandler, requeueHandler);
+    return this.recoverEntry(entry);
   }
 
   /**
    * Recover a single DLQ entry using the appropriate recovery mode.
+   *
+   * session_resume: Write a message to the agent with session-id in
+   * front-matter. The dispatcher's existing session-id handling spawns
+   * a new worker that resumes the SDK conversation.
+   *
+   * requeue: Re-inject the original message from→to with its payload.
    */
-  private async recoverEntry(
-    entry: DLQEntry,
-    sessionResumeHandler: SessionResumeHandler,
-    requeueHandler: RequeueHandler
-  ): Promise<RecoveryResult> {
-    if (entry.recovery_mode === 'session_resume' && entry.session_id) {
-      // Resume the SDK session — preserves full conversation history
-      try {
-        const result = await sessionResumeHandler(entry.agent_id, entry.session_id, entry.mesh_name);
-        if (result.success) {
-          this.dlq.markRecovered(entry.id);
-          log.info('reliability', 'DLQ entry recovered via session resume', {
-            id: entry.id,
-            agent: entry.agent_id,
-            sessionId: entry.session_id.slice(0, 8),
-          });
-          return { id: entry.id, success: true, mode: 'session_resume', sessionId: entry.session_id };
-        } else {
-          // Session resume failed — escalate to manual
-          this.dlq.escalateToManual(entry.id, result.error || 'Session resume failed');
-          return { id: entry.id, success: false, mode: 'session_resume', error: result.error };
+  private recoverEntry(entry: DLQEntry): RecoveryResult {
+    try {
+      if (entry.recovery_mode === 'session_resume' && entry.session_id) {
+        // Write a recovery message with session-id front-matter
+        // The dispatcher already handles session-id: spawns worker resuming that session
+        this.bindings!.requeueMessage(
+          'system/dlq-recovery',
+          entry.agent_id,
+          'task',
+          {
+            headline: `DLQ recovery: resuming session ${entry.session_id.slice(0, 8)}`,
+            body: `Resuming failed work. Original failure: ${entry.failure_reason}`,
+            'resume-mesh': 'true',
+          },
+          { 'session-id': entry.session_id }
+        );
+
+        this.dlq.markRecovered(entry.id);
+        log.info('reliability', 'DLQ entry recovered via session resume', {
+          id: entry.id,
+          agent: entry.agent_id,
+          sessionId: entry.session_id.slice(0, 8),
+        });
+        log.activity('reliability:recovered', entry.agent_id,
+          `Session resume (sid:${entry.session_id.slice(0, 8)})`);
+
+        return { id: entry.id, success: true, mode: 'session_resume', sessionId: entry.session_id };
+
+      } else if (entry.recovery_mode === 'requeue') {
+        // Re-inject the original message
+        let payload: Record<string, unknown>;
+        try {
+          payload = JSON.parse(entry.payload);
+        } catch {
+          payload = { body: entry.payload };
         }
-      } catch (err) {
-        this.dlq.escalateToManual(entry.id, (err as Error).message);
-        return { id: entry.id, success: false, mode: 'session_resume', error: (err as Error).message };
-      }
-    } else if (entry.recovery_mode === 'requeue') {
-      // Re-inject message into the queue
-      const result = requeueHandler(entry);
-      if (result.success) {
+
+        this.bindings!.requeueMessage(
+          entry.from_agent,
+          entry.to_agent,
+          entry.type,
+          { ...payload, headline: payload.headline || `DLQ requeue: ${entry.failure_reason.slice(0, 50)}` },
+        );
+
         this.dlq.markRecovered(entry.id);
         log.info('reliability', 'DLQ entry recovered via requeue', {
           id: entry.id,
           agent: entry.agent_id,
+          from: entry.from_agent,
+          to: entry.to_agent,
         });
+        log.activity('reliability:recovered', entry.agent_id, 'Message requeued');
+
         return { id: entry.id, success: true, mode: 'requeue' };
+
       } else {
-        this.dlq.escalateToManual(entry.id, result.error || 'Requeue failed');
-        return { id: entry.id, success: false, mode: 'requeue', error: result.error };
+        return { id: entry.id, success: false, mode: 'manual', error: 'Requires manual intervention' };
       }
-    } else {
-      return { id: entry.id, success: false, mode: 'manual', error: 'Requires manual intervention' };
+    } catch (err) {
+      const msg = (err as Error).message;
+      this.dlq.escalateToManual(entry.id, msg);
+      log.error('reliability', 'Recovery failed', { id: entry.id, error: msg });
+      return { id: entry.id, success: false, mode: entry.recovery_mode, error: msg };
     }
   }
 
diff --git a/src/worker/dispatcher.ts b/src/worker/dispatcher.ts
index bc8bf5ea..5c3820c2 100644
--- a/src/worker/dispatcher.ts
+++ b/src/worker/dispatcher.ts
@@ -1190,8 +1190,35 @@ export class WorkerDispatcher extends EventEmitter {
 
     // Initialize reliability manager (circuit breakers, heartbeat, SLI, DLQ, safe-mode)
     this.reliability = new ReliabilityManager(this.queue.getDb(), this.config.workDir);
+    this.reliability.bindDispatcher({
+      killAgent: (agentId: string, reason: string) => {
+        return this.workerLifecycle.killForAgent(agentId, reason);
+      },
+      requeueMessage: (from: string, to: string, type: string, payload: Record<string, unknown>, extraFrontmatter?: Record<string, string>) => {
+        this.systemWriter.write({
+          from,
+          to,
+          type,
+          headline: (payload.headline as string) || 'DLQ recovery',
+          body: (payload.body as string) || '',
+          extraFrontmatter: { ...extraFrontmatter, ...Object.fromEntries(
+            Object.entries(payload).filter(([k]) => !['headline', 'body'].includes(k)).map(([k, v]) => [k, String(v)])
+          )},
+        });
+      },
+    });
     this.reliability.start();
 
+    // Recover any pending DLQ entries from previous crash
+    const dlqRecovery = this.reliability.recoverAll();
+    if (dlqRecovery.length > 0) {
+      log.info('dispatcher', 'DLQ startup recovery', {
+        attempted: dlqRecovery.length,
+        succeeded: dlqRecovery.filter(r => r.success).length,
+        failed: dlqRecovery.filter(r => !r.success).length,
+      });
+    }
+
     // Subscribe to consumer events for event-driven dispatch
     if (consumer) {
       this.boundMessageHandler = (event: { agentId: string }) => {
@@ -1558,6 +1585,24 @@ export class WorkerDispatcher extends EventEmitter {
       }
     }
 
+    // DLQ RECOVERY: recover front-matter triggers DLQ recovery for this mesh
+    // Core agent or CLI can send: `recover: true` to trigger auto-recovery
+    if (pendingMessage?.payload?.['recover'] === true || pendingMessage?.payload?.['recover'] === 'true') {
+      if (this.reliability) {
+        const results = this.reliability.recoverForMesh(meshName);
+        const succeeded = results.filter(r => r.success).length;
+        log.info('dispatcher', 'DLQ recovery triggered by front-matter', {
+          meshName, attempted: results.length, succeeded,
+        });
+
+        // Consume the recover message — its purpose is fulfilled
+        this.queue.pollOne(agentId);
+
+        // If entries were recovered, they'll flow through as new messages
+        if (results.length > 0) return;
+      }
+    }
+
     // NEW MESH RUN: Clear stale state when task arrives at entry point
     // This handles crashed/abandoned meshes that never sent task-complete to core
     const entryPoint = meshConfig.entry_point || 'worker';
@@ -3679,6 +3724,18 @@ Please advise the agent or check mesh configuration.`;
         });
       }
 
+      // Safe mode gate: block tools based on current safe mode level
+      if (this.reliability) {
+        const safeModeHook = this.reliability.createSafeModeHook(meshName!, agentId);
+        if (safeModeHook) {
+          preToolUseHooks.push(safeModeHook);
+          log.info('safe-mode', 'Safe mode hook enabled', {
+            agentId,
+            level: this.reliability.safeMode.getLevel(meshName!),
+          });
+        }
+      }
+
       // Orchestrator gate: restrict Write to msgs dir only
       if (agent.orchestrator) {
         const msgsDir = this.config.msgsDir;

From f1d1334e3bf36ed42441184816bf58391bfa1e8a Mon Sep 17 00:00:00 2001
From: Claude <noreply@anthropic.com>
Date: Tue, 10 Mar 2026 00:12:39 +0000
Subject: [PATCH 06/12] feat(reliability): Checkpoint log + rewind-to recovery
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Adds the ability to rewind recovery to any FSM state checkpoint,
not just the crash point. Core agent can now say "rewind-to: build"
to skip failed work and resume from a known-good state.

Checkpoint log (SQLite):
  Saves session IDs at every FSM state transition in the dispatcher's
  onWorkerComplete handler. Key: mesh_name + state_name → session_id.
  Lookup, list, GC, and clear operations.

rewind-to front-matter:
  recover: true + rewind-to: <state> on a message looks up the
  checkpoint for that state and uses its session ID instead of
  the DLQ entry's crash-point session.

Three trigger paths:
  1. CLI: tx mesh recover <mesh> --rewind-to=build
  2. Message: recover: true + rewind-to: build front-matter
  3. SIGUSR2: {"action":"dlq-recover","mesh":"x","rewindTo":"build"}

tx mesh recover now shows available checkpoints before recovering.

Core prompt updated with Reliability & Recovery section teaching
the agent how to use recover, rewind-to, and check health.

mesh-builder skill updated with reliability front-matter fields.
docs/reliability.md updated with checkpoint log docs.

https://claude.ai/code/session_012PTx7bZNNh74rAshCcuSQg
---
 .claude/skills/mesh-builder/SKILL.md   |   6 +
 docs/reliability.md                    |  60 +++++++++-
 src/cli/mesh.ts                        |  32 ++++-
 src/cli/start.ts                       |   2 +-
 src/prompt/core.ts                     |  59 +++++++++
 src/reliability/checkpoint-log.ts      | 160 +++++++++++++++++++++++++
 src/reliability/index.ts               |   1 +
 src/reliability/reliability-manager.ts |  60 ++++++++--
 src/worker/dispatcher.ts               |  23 +++-
 9 files changed, 386 insertions(+), 17 deletions(-)
 create mode 100644 src/reliability/checkpoint-log.ts

diff --git a/.claude/skills/mesh-builder/SKILL.md b/.claude/skills/mesh-builder/SKILL.md
index 8f08ebcf..313dd902 100644
--- a/.claude/skills/mesh-builder/SKILL.md
+++ b/.claude/skills/mesh-builder/SKILL.md
@@ -167,6 +167,12 @@ agents:
 
 **Propagation:** Upstream agents must include the key in their completion message frontmatter for downstream agents to receive it. The consumer maps frontmatter fields to `payload` automatically.
 
+**Reliability front-matter fields** (used by core agent for recovery, not in mesh configs):
+- `recover: true` — triggers DLQ recovery for the target mesh
+- `rewind-to: <state>` — override recovery session with checkpoint from named FSM state
+- `session-id: <id>` — resume a specific SDK session
+- `resume-mesh: true` — preserve mesh state instead of clearing on new entry
+
 ```
 User message:  feature: auth  → prebuild gets "/know:prebuild auth"
 Prebuild msg:  feature: auth  → builder gets "/know:build auth"
diff --git a/docs/reliability.md b/docs/reliability.md
index 9e62dfd0..94f082c1 100644
--- a/docs/reliability.md
+++ b/docs/reliability.md
@@ -123,6 +123,61 @@ tx mesh dlq --json       # Machine-readable output
 tx mesh dlq clear        # GC recovered entries
 ```
 
+### Checkpoint Log & Rewind-To
+
+**What it does**: Saves session IDs at every FSM state transition. Enables rewinding to any completed state instead of just the crash point.
+
+**How checkpoints are saved**:
+- Every time an FSM mesh transitions states, the completing agent's session ID is saved to SQLite
+- Checkpoint key: `mesh_name + state_name` → `session_id`
+- Multiple checkpoints per state are kept (most recent wins on lookup)
+
+**How rewind-to works**:
+
+When recovering from the DLQ, you can specify `rewind-to: <state>` to use a checkpoint's session ID instead of the crash-point session. This means the recovered worker resumes from after that state completed — skipping all the bad work that happened after.
+
+```
+FSM: analyze → build → verify → complete
+                  ↑         ✗ (crashed here)
+                  └── rewind-to: build (resumes from here)
+```
+
+**Three ways to trigger rewind-to**:
+
+1. **CLI**:
+   ```bash
+   tx mesh recover my-mesh --rewind-to=build
+   ```
+
+2. **Front-matter message** (core agent):
+   ```markdown
+   ---
+   to: my-mesh/worker
+   from: core/core
+   recover: true
+   rewind-to: build
+   ---
+   The verify step went wrong. Rewind to after build completed.
+   ```
+
+3. **SIGUSR2 control signal** (programmatic):
+   ```json
+   {"action": "dlq-recover", "mesh": "my-mesh", "rewindTo": "build"}
+   ```
+
+**Viewing available checkpoints**:
+```bash
+tx mesh recover my-mesh    # Lists checkpoints before recovering
+```
+Output:
+```
+Available checkpoints (use --rewind-to=<state>):
+  analyze              sid:a1b2c3d4  agent:my-mesh/analyst  2026-03-10 14:30:00
+  build                sid:e5f6g7h8  agent:my-mesh/builder  2026-03-10 14:31:15
+```
+
+**When checkpoints are cleared**: On mesh completion (`clearMeshState`). Old checkpoints are garbage collected (keeps last 50 per mesh).
+
 ### 4. SLI Tracker
 
 **What it does**: Measures success rate, failure categories, MTTR, and nines level.
@@ -200,6 +255,7 @@ Agents can interact with reliability features via message front-matter:
 | Field | Value | Effect |
 |-------|-------|--------|
 | `recover` | `true` | Triggers DLQ recovery for the target mesh |
+| `rewind-to` | FSM state name | Override recovery session with checkpoint from this state |
 | `session-id` | SDK session ID | Spawns worker resuming that session |
 | `resume-mesh` | `true` | Preserves mesh state instead of clearing on entry |
 
@@ -211,7 +267,8 @@ Agents can interact with reliability features via message front-matter:
 | `tx mesh health --json` | Machine-readable health output |
 | `tx mesh dlq [mesh]` | List dead letter queue entries |
 | `tx mesh dlq clear` | Clear recovered DLQ entries |
-| `tx mesh recover <mesh>` | Trigger DLQ recovery via running dispatcher |
+| `tx mesh recover <mesh>` | Trigger DLQ recovery (shows checkpoints first) |
+| `tx mesh recover <mesh> --rewind-to=<state>` | Recover rewinding to a specific FSM state |
 | `tx mesh recover --all` | Recover all pending DLQ entries |
 
 ## Architecture
@@ -224,6 +281,7 @@ Agents can interact with reliability features via message front-matter:
                     │  ├─ Circuit Breaker  │ ← SQLite persisted
                     │  ├─ Heartbeat Monitor│ ← kills via bindings
                     │  ├─ Dead Letter Queue│ ← SQLite persisted
+                    │  ├─ Checkpoint Log  │ ← SQLite, rewind-to
                     │  └─ Safe Mode       │ ← PreToolUse hook
                     │                      │
                     │  bindDispatcher({    │
diff --git a/src/cli/mesh.ts b/src/cli/mesh.ts
index 93360739..8517c030 100644
--- a/src/cli/mesh.ts
+++ b/src/cli/mesh.ts
@@ -52,6 +52,7 @@ interface MeshFlags {
   next?: boolean;
   all?: boolean;
   verbose?: boolean;
+  rewindTo?: string;
 }
 
 /**
@@ -73,6 +74,14 @@ function parseFlags(args: string[]): MeshFlags {
       flags.all = true;
     } else if (arg === '--verbose') {
       flags.verbose = true;
+    } else if (arg.startsWith('--rewind-to=')) {
+      flags.rewindTo = arg.split('=')[1];
+    } else if (arg === '--rewind-to') {
+      // Next arg will be picked up as a positional, but we handle it here
+      const idx = args.indexOf(arg);
+      if (idx < args.length - 1 && !args[idx + 1].startsWith('-')) {
+        flags.rewindTo = args[idx + 1];
+      }
     }
   }
 
@@ -3673,16 +3682,32 @@ async function meshRecover(meshName: string | undefined, flags: MeshFlags): Prom
     return;
   }
 
+  // Show available checkpoints for rewind-to
+  if (meshName) {
+    const { CheckpointLog } = await import('../reliability/checkpoint-log.ts');
+    const checkpointLog = new CheckpointLog(queue.getDb());
+    const checkpoints = checkpointLog.latestPerState(meshName);
+    if (checkpoints.length > 0) {
+      console.log(`\n${chalk.bold('Available checkpoints')} (use --rewind-to=<state>):`);
+      for (const cp of checkpoints) {
+        console.log(`  ${chalk.cyan(cp.state_name.padEnd(20))} ${chalk.dim('sid:')}${cp.session_id.slice(0, 8)}  ${chalk.dim('agent:')}${cp.agent_id}  ${chalk.dim(cp.created_at)}`);
+      }
+    }
+  }
+
   const resumable = entries.filter(e => e.recovery_mode === 'session_resume');
   const requeueable = entries.filter(e => e.recovery_mode === 'requeue');
-  console.log(`\nRecovering ${entries.length} entries: ${chalk.cyan(String(resumable.length))} session_resume, ${chalk.yellow(String(requeueable.length))} requeue`);
+  const rewindNote = flags.rewindTo ? chalk.magenta(` (rewind-to: ${flags.rewindTo})`) : '';
+  console.log(`\nRecovering ${entries.length} entries: ${chalk.cyan(String(resumable.length))} session_resume, ${chalk.yellow(String(requeueable.length))} requeue${rewindNote}`);
 
   // Try SIGUSR2 to running dispatcher
   if (fs.existsSync(pidFile)) {
     const pid = parseInt(fs.readFileSync(pidFile, 'utf-8').trim(), 10);
     if (!isNaN(pid)) {
       const target = meshName && !flags.all ? meshName : '_all';
-      fs.writeFileSync(controlFile, JSON.stringify({ action: 'dlq-recover', mesh: target }));
+      const ctrl: Record<string, string> = { action: 'dlq-recover', mesh: target };
+      if (flags.rewindTo) ctrl.rewindTo = flags.rewindTo;
+      fs.writeFileSync(controlFile, JSON.stringify(ctrl));
 
       try {
         process.kill(pid, 'SIGUSR2');
@@ -3724,7 +3749,8 @@ async function meshRecover(meshName: string | undefined, flags: MeshFlags): Prom
 
     const timestamp = Date.now();
     const filename = `${timestamp}-task-system-dlq-recovery--${meshName}-${entryPoint}-recover.md`;
-    const content = `---\nto: ${meshName}/${entryPoint}\nfrom: system/dlq-recovery\ntype: task\nheadline: DLQ recovery\nrecover: true\ntimestamp: ${new Date(timestamp).toISOString()}\n---\n\nRecover failed work from dead letter queue.\n`;
+    const rewindLine = flags.rewindTo ? `\nrewind-to: ${flags.rewindTo}` : '';
+    const content = `---\nto: ${meshName}/${entryPoint}\nfrom: system/dlq-recovery\ntype: task\nheadline: DLQ recovery\nrecover: true${rewindLine}\ntimestamp: ${new Date(timestamp).toISOString()}\n---\n\nRecover failed work from dead letter queue.\n`;
     fs.writeFileSync(path.join(msgsDir, filename), content);
     console.log(chalk.cyan(`Recovery message written. Will be processed on next tx start.`));
   } else {
diff --git a/src/cli/start.ts b/src/cli/start.ts
index 2dda9920..3883a416 100644
--- a/src/cli/start.ts
+++ b/src/cli/start.ts
@@ -275,7 +275,7 @@ export async function start(workDir?: string, options?: StartOptions): Promise<v
           if (disp.reliability) {
             const results = ctrl.mesh === '_all'
               ? disp.reliability.recoverAll()
-              : disp.reliability.recoverForMesh(ctrl.mesh);
+              : disp.reliability.recoverForMesh(ctrl.mesh, ctrl.rewindTo);
             const succeeded = results.filter((r: { success: boolean }) => r.success).length;
             log.info('start', 'SIGUSR2: DLQ recovery', {
               mesh: ctrl.mesh, attempted: results.length, succeeded,
diff --git a/src/prompt/core.ts b/src/prompt/core.ts
index 44f390fb..d3ea5d9a 100644
--- a/src/prompt/core.ts
+++ b/src/prompt/core.ts
@@ -237,6 +237,65 @@ tx mesh status narrative-engine  # Find the msg-id
 tx mesh resolve ask-123 "Approved, continue with the plan"
 \`\`\`
 
+## Reliability & Recovery
+
+When mesh work fails, the system captures failures in a Dead Letter Queue (DLQ) with session context. You can recover failed work and rewind to specific checkpoints.
+
+**Check health:**
+\`\`\`bash
+tx mesh health             # SLI, circuits, safe mode, DLQ summary
+tx mesh health <mesh>      # Per-agent stats
+tx mesh dlq                # List failed entries with recovery modes
+\`\`\`
+
+**Recover failed work (CLI):**
+\`\`\`bash
+tx mesh recover <mesh>                    # Resume from crash point
+tx mesh recover <mesh> --rewind-to=build  # Rewind to state checkpoint
+\`\`\`
+
+**Recover via message** (when CLI isn't suitable or you want to trigger from a response):
+
+Simple recovery — resume from crash point:
+\`\`\`markdown
+---
+to: <mesh>/<entry-point>
+from: core/core
+recover: true
+msg-id: recover-${timestampMs}
+headline: Recover failed work
+timestamp: ${timestamp}
+---
+
+Recover failed work from the dead letter queue.
+\`\`\`
+
+Rewind recovery — go back to a known-good state:
+\`\`\`markdown
+---
+to: <mesh>/<entry-point>
+from: core/core
+recover: true
+rewind-to: build
+msg-id: recover-${timestampMs}
+headline: Rewind to build checkpoint
+timestamp: ${timestamp}
+---
+
+The verify step went wrong. Rewind to after build completed and retry.
+\`\`\`
+
+**How rewind-to works:**
+- Every FSM state transition saves a checkpoint (state name → session ID)
+- \`rewind-to: build\` finds the session active when \`build\` completed
+- Recovery resumes that exact session — full conversation history preserved
+- The agent picks up where it left off, skipping the failed work
+
+**When to use:**
+- User says "go back to step X" or "that went wrong"
+- A later state failed but an earlier state was good
+- \`tx mesh recover <mesh>\` shows available checkpoints with state names
+
 ## Message Directory: ${msgsDir}/
 
 ## How to Start Work
diff --git a/src/reliability/checkpoint-log.ts b/src/reliability/checkpoint-log.ts
new file mode 100644
index 00000000..f3b80d3e
--- /dev/null
+++ b/src/reliability/checkpoint-log.ts
@@ -0,0 +1,160 @@
+/**
+ * CheckpointLog - Persisted session state at FSM boundaries
+ *
+ * Saves session IDs at every FSM state transition so recovery can
+ * rewind to any named state, not just the crash point.
+ *
+ * Core agent uses `rewind-to: <state>` front-matter to specify which
+ * checkpoint to recover from. The system looks up the most recent
+ * session ID for that mesh+state.
+ */
+
+import type Database from 'better-sqlite3';
+import { log } from '../shared/logger.ts';
+
+export interface Checkpoint {
+  id: number;
+  mesh_name: string;
+  state_name: string;
+  agent_id: string;
+  session_id: string;
+  from_state: string;
+  context_snapshot: string;  // JSON: FSM context at transition time
+  created_at: string;
+}
+
+export class CheckpointLog {
+  private db: Database.Database;
+
+  constructor(db: Database.Database) {
+    this.db = db;
+    this.ensureSchema();
+  }
+
+  private ensureSchema(): void {
+    this.db.exec(`
+      CREATE TABLE IF NOT EXISTS checkpoint_log (
+        id INTEGER PRIMARY KEY AUTOINCREMENT,
+        mesh_name TEXT NOT NULL,
+        state_name TEXT NOT NULL,
+        agent_id TEXT NOT NULL,
+        session_id TEXT NOT NULL,
+        from_state TEXT NOT NULL,
+        context_snapshot TEXT DEFAULT '{}',
+        created_at TEXT NOT NULL DEFAULT (datetime('now'))
+      );
+      CREATE INDEX IF NOT EXISTS idx_checkpoint_mesh_state
+        ON checkpoint_log(mesh_name, state_name, created_at DESC);
+    `);
+  }
+
+  /**
+   * Save a checkpoint at an FSM state transition.
+   * Called by the dispatcher when fsm:transition fires.
+   */
+  save(opts: {
+    meshName: string;
+    stateName: string;
+    agentId: string;
+    sessionId: string;
+    fromState: string;
+    context?: Record<string, unknown>;
+  }): void {
+    this.db.prepare(`
+      INSERT INTO checkpoint_log (mesh_name, state_name, agent_id, session_id, from_state, context_snapshot, created_at)
+      VALUES (?, ?, ?, ?, ?, ?, datetime('now'))
+    `).run(
+      opts.meshName,
+      opts.stateName,
+      opts.agentId,
+      opts.sessionId,
+      opts.fromState,
+      JSON.stringify(opts.context || {}),
+    );
+
+    log.debug('checkpoint', 'Saved', {
+      mesh: opts.meshName,
+      state: opts.stateName,
+      agent: opts.agentId,
+      session: opts.sessionId.slice(0, 8),
+    });
+  }
+
+  /**
+   * Look up the most recent checkpoint for a mesh+state.
+   * This is what `rewind-to: <state>` resolves against.
+   */
+  lookup(meshName: string, stateName: string): Checkpoint | null {
+    return this.db.prepare(`
+      SELECT * FROM checkpoint_log
+      WHERE mesh_name = ? AND state_name = ?
+      ORDER BY created_at DESC
+      LIMIT 1
+    `).get(meshName, stateName) as Checkpoint | null;
+  }
+
+  /**
+   * Get all checkpoints for a mesh (most recent first).
+   * Used by `tx mesh health` and core agent to see available rewind points.
+   */
+  listForMesh(meshName: string): Checkpoint[] {
+    return this.db.prepare(`
+      SELECT * FROM checkpoint_log
+      WHERE mesh_name = ?
+      ORDER BY created_at DESC
+      LIMIT 20
+    `).all(meshName) as Checkpoint[];
+  }
+
+  /**
+   * Get the latest checkpoint per state for a mesh.
+   * Compact view: one row per state, most recent only.
+   */
+  latestPerState(meshName: string): Checkpoint[] {
+    return this.db.prepare(`
+      SELECT c.* FROM checkpoint_log c
+      INNER JOIN (
+        SELECT mesh_name, state_name, MAX(created_at) as max_created
+        FROM checkpoint_log
+        WHERE mesh_name = ?
+        GROUP BY mesh_name, state_name
+      ) latest ON c.mesh_name = latest.mesh_name
+        AND c.state_name = latest.state_name
+        AND c.created_at = latest.max_created
+      ORDER BY c.created_at DESC
+    `).all(meshName) as Checkpoint[];
+  }
+
+  /**
+   * Clear checkpoints for a mesh (on mesh completion or clear).
+   */
+  clearForMesh(meshName: string): number {
+    const result = this.db.prepare(
+      `DELETE FROM checkpoint_log WHERE mesh_name = ?`
+    ).run(meshName);
+    return result.changes;
+  }
+
+  /**
+   * GC old checkpoints (keep last N per mesh).
+   */
+  gc(keepPerMesh = 50): number {
+    // Find meshes with more than `keepPerMesh` entries
+    const meshes = this.db.prepare(`
+      SELECT mesh_name, COUNT(*) as cnt FROM checkpoint_log
+      GROUP BY mesh_name HAVING cnt > ?
+    `).all(keepPerMesh) as Array<{ mesh_name: string; cnt: number }>;
+
+    let total = 0;
+    for (const { mesh_name } of meshes) {
+      const result = this.db.prepare(`
+        DELETE FROM checkpoint_log WHERE mesh_name = ? AND id NOT IN (
+          SELECT id FROM checkpoint_log WHERE mesh_name = ?
+          ORDER BY created_at DESC LIMIT ?
+        )
+      `).run(mesh_name, mesh_name, keepPerMesh);
+      total += result.changes;
+    }
+    return total;
+  }
+}
diff --git a/src/reliability/index.ts b/src/reliability/index.ts
index ee6861fb..7cb9b693 100644
--- a/src/reliability/index.ts
+++ b/src/reliability/index.ts
@@ -18,3 +18,4 @@ export { CircuitBreaker, type CircuitBreakerConfig, type CircuitBreakerState } f
 export { HeartbeatMonitor, type HeartbeatConfig, type AgentHealth } from './heartbeat-monitor.ts';
 export { SLITracker, type SLIConfig, type SLISnapshot, type FailureCategory } from './sli-tracker.ts';
 export { SafeMode, type SafeModeConfig, type SafeModeState } from './safe-mode.ts';
+export { CheckpointLog, type Checkpoint } from './checkpoint-log.ts';
diff --git a/src/reliability/reliability-manager.ts b/src/reliability/reliability-manager.ts
index 4537b6fe..508f6993 100644
--- a/src/reliability/reliability-manager.ts
+++ b/src/reliability/reliability-manager.ts
@@ -24,6 +24,7 @@ import { CircuitBreaker, type CircuitBreakerState } from './circuit-breaker.ts';
 import { HeartbeatMonitor, type AgentHealth } from './heartbeat-monitor.ts';
 import { SLITracker, type SLISnapshot, type FailureCategory } from './sli-tracker.ts';
 import { SafeMode, type SafeModeLevel, type SafeModeState } from './safe-mode.ts';
+import { CheckpointLog, type Checkpoint } from './checkpoint-log.ts';
 import { log } from '../shared/logger.ts';
 import fs from 'node:fs';
 import path from 'node:path';
@@ -94,6 +95,7 @@ export class ReliabilityManager {
   readonly heartbeat: HeartbeatMonitor;
   readonly sli: SLITracker;
   readonly safeMode: SafeMode;
+  readonly checkpoints: CheckpointLog;
   private workDir: string;
   private bindings?: DispatcherBindings;
 
@@ -109,6 +111,7 @@ export class ReliabilityManager {
     this.heartbeat = new HeartbeatMonitor(merged.heartbeat);
     this.sli = new SLITracker(merged.sli);
     this.safeMode = new SafeMode(merged.safeMode);
+    this.checkpoints = new CheckpointLog(db);
 
     log.info('reliability', 'ReliabilityManager initialized', {
       dlqMaxRetries: merged.dlq?.maxRetries || 3,
@@ -364,16 +367,38 @@ export class ReliabilityManager {
 
   /**
    * Recover DLQ entries for a specific mesh.
+   * If rewindTo is specified, override the session ID with the
+   * checkpoint for that FSM state (instead of the crash-point session).
    */
-  recoverForMesh(meshName: string): RecoveryResult[] {
+  recoverForMesh(meshName: string, rewindTo?: string): RecoveryResult[] {
     if (!this.bindings) return [];
 
+    // If rewind-to is specified, look up the checkpoint for that state
+    let overrideSessionId: string | undefined;
+    if (rewindTo) {
+      const checkpoint = this.checkpoints.lookup(meshName, rewindTo);
+      if (checkpoint) {
+        overrideSessionId = checkpoint.session_id;
+        log.info('reliability', `Rewind-to resolved`, {
+          meshName,
+          state: rewindTo,
+          sessionId: checkpoint.session_id.slice(0, 8),
+          agent: checkpoint.agent_id,
+        });
+      } else {
+        log.warn('reliability', `No checkpoint found for rewind-to state`, {
+          meshName, state: rewindTo,
+          available: this.checkpoints.latestPerState(meshName).map(c => c.state_name),
+        });
+      }
+    }
+
     const entries = this.dlq.getForMesh(meshName);
     const results: RecoveryResult[] = [];
 
     for (const entry of entries) {
       if (entry.recovery_mode === 'manual') continue;
-      results.push(this.recoverEntry(entry));
+      results.push(this.recoverEntry(entry, overrideSessionId));
     }
 
     return results;
@@ -403,33 +428,46 @@ export class ReliabilityManager {
    *
    * requeue: Re-inject the original message from→to with its payload.
    */
-  private recoverEntry(entry: DLQEntry): RecoveryResult {
+  /**
+   * @param overrideSessionId - If set (from rewind-to), use this session
+   *   instead of the DLQ entry's crash-point session.
+   */
+  private recoverEntry(entry: DLQEntry, overrideSessionId?: string): RecoveryResult {
     try {
-      if (entry.recovery_mode === 'session_resume' && entry.session_id) {
+      // Use override session (from rewind-to checkpoint) or the crash-point session
+      const sessionId = overrideSessionId || entry.session_id;
+
+      if ((entry.recovery_mode === 'session_resume' || overrideSessionId) && sessionId) {
         // Write a recovery message with session-id front-matter
         // The dispatcher already handles session-id: spawns worker resuming that session
+        const isRewind = overrideSessionId && overrideSessionId !== entry.session_id;
         this.bindings!.requeueMessage(
           'system/dlq-recovery',
           entry.agent_id,
           'task',
           {
-            headline: `DLQ recovery: resuming session ${entry.session_id.slice(0, 8)}`,
-            body: `Resuming failed work. Original failure: ${entry.failure_reason}`,
+            headline: isRewind
+              ? `DLQ recovery: rewinding to checkpoint ${sessionId.slice(0, 8)}`
+              : `DLQ recovery: resuming session ${sessionId.slice(0, 8)}`,
+            body: isRewind
+              ? `Rewinding past failure. Original failure: ${entry.failure_reason}`
+              : `Resuming failed work. Original failure: ${entry.failure_reason}`,
             'resume-mesh': 'true',
           },
-          { 'session-id': entry.session_id }
+          { 'session-id': sessionId }
         );
 
         this.dlq.markRecovered(entry.id);
-        log.info('reliability', 'DLQ entry recovered via session resume', {
+        log.info('reliability', `DLQ entry recovered via ${isRewind ? 'rewind' : 'session resume'}`, {
           id: entry.id,
           agent: entry.agent_id,
-          sessionId: entry.session_id.slice(0, 8),
+          sessionId: sessionId.slice(0, 8),
+          rewind: isRewind || false,
         });
         log.activity('reliability:recovered', entry.agent_id,
-          `Session resume (sid:${entry.session_id.slice(0, 8)})`);
+          isRewind ? `Rewound to checkpoint (sid:${sessionId.slice(0, 8)})` : `Session resume (sid:${sessionId.slice(0, 8)})`);
 
-        return { id: entry.id, success: true, mode: 'session_resume', sessionId: entry.session_id };
+        return { id: entry.id, success: true, mode: 'session_resume', sessionId };
 
       } else if (entry.recovery_mode === 'requeue') {
         // Re-inject the original message
diff --git a/src/worker/dispatcher.ts b/src/worker/dispatcher.ts
index 5c3820c2..aacc1bfa 100644
--- a/src/worker/dispatcher.ts
+++ b/src/worker/dispatcher.ts
@@ -1589,10 +1589,13 @@ export class WorkerDispatcher extends EventEmitter {
     // Core agent or CLI can send: `recover: true` to trigger auto-recovery
     if (pendingMessage?.payload?.['recover'] === true || pendingMessage?.payload?.['recover'] === 'true') {
       if (this.reliability) {
-        const results = this.reliability.recoverForMesh(meshName);
+        // rewind-to: <state> overrides DLQ session with checkpoint session
+        const rewindTo = pendingMessage?.payload?.['rewind-to'] as string | undefined;
+        const results = this.reliability.recoverForMesh(meshName, rewindTo || undefined);
         const succeeded = results.filter(r => r.success).length;
         log.info('dispatcher', 'DLQ recovery triggered by front-matter', {
           meshName, attempted: results.length, succeeded,
+          rewindTo: rewindTo || null,
         });
 
         // Consume the recover message — its purpose is fulfilled
@@ -4738,6 +4741,24 @@ You are working in an isolated git worktree for feature: **${hookContext.feature
           }
         }
 
+        // Save reliability checkpoint at FSM state boundaries
+        // Every agent completion in an FSM mesh records the session ID
+        // keyed by the current FSM state — enables rewind-to recovery
+        if (this.reliability && data.sessionId && meshName) {
+          const fsm = this.meshFSMs.get(meshName);
+          if (fsm?.isInitialized()) {
+            const fsmState = fsm.getStatus().currentState;
+            this.reliability.checkpoints.save({
+              meshName,
+              stateName: fsmState,
+              agentId,
+              sessionId: data.sessionId,
+              fromState: fsmState,
+              context: fsm.getContext() as Record<string, unknown>,
+            });
+          }
+        }
+
         // Emit quality pass if we had preflight and made it here without errors
         if (workerHookContext.qualityPreflight) {
           this.emit('quality:pass', {

From 0398edabec8fd3943deca2eb19e1fe5988d04fcb Mon Sep 17 00:00:00 2001
From: Claude <noreply@anthropic.com>
Date: Tue, 10 Mar 2026 00:16:51 +0000
Subject: [PATCH 07/12] feat(reliability): Enforce human review before recovery

Core prompt now requires a 4-step workflow for recovery:
1. Diagnose (tx mesh health + tx mesh dlq)
2. Present options to user (resume vs rewind vs drop + checkpoints)
3. Get explicit confirmation
4. Execute chosen recovery

Never triggers recovery silently. The user sees what failed, why,
and which checkpoints are available before approving any action.

docs/reliability.md updated to match: human-initiated recovery
is the preferred path, automatic recovery only on startup crash.

https://claude.ai/code/session_012PTx7bZNNh74rAshCcuSQg
---
 docs/reliability.md | 21 +++++++-----------
 src/prompt/core.ts  | 54 +++++++++++++++++++++++++++++++--------------
 2 files changed, 45 insertions(+), 30 deletions(-)

diff --git a/docs/reliability.md b/docs/reliability.md
index 94f082c1..678da70b 100644
--- a/docs/reliability.md
+++ b/docs/reliability.md
@@ -98,22 +98,17 @@ tx logs --component reliability  # Heartbeat kill events
 
 **How recovery works**:
 
-1. **Automatic on startup**: When `tx start` runs, the dispatcher calls `recoverAll()` — recovers any pending session_resume and requeue entries from the previous run.
+**Important: Recovery requires human review.** The core agent is instructed to always diagnose, present options (resume vs rewind vs drop), and get explicit user confirmation before triggering recovery. This prevents silent re-execution of bad work.
 
-2. **CLI**: `tx mesh recover <mesh>` sends a SIGUSR2 signal to the running dispatcher, triggering recovery for that mesh's DLQ entries.
+1. **Automatic on startup**: When `tx start` runs, the dispatcher calls `recoverAll()` — recovers any pending session_resume and requeue entries from the previous run. (This is the only automatic path — it handles crash recovery between restarts.)
 
-3. **Front-matter message**: An agent (or core) can write a message with `recover: true` to trigger DLQ recovery:
-   ```markdown
-   ---
-   to: reliability-test/planner
-   from: core/core
-   type: task
-   recover: true
-   ---
-   Recover failed work.
-   ```
+2. **Human-initiated via core agent** (preferred): User asks core to investigate. Core runs `tx mesh health` + `tx mesh dlq`, presents findings with available checkpoints, user picks a recovery strategy, core writes the recovery message.
+
+3. **CLI**: `tx mesh recover <mesh>` sends a SIGUSR2 signal to the running dispatcher. Shows available checkpoints first.
+
+4. **Front-matter message**: Core writes a message with `recover: true` (and optionally `rewind-to: <state>`) to trigger DLQ recovery.
 
-4. **Fallback**: If the dispatcher isn't running, `tx mesh recover` writes a recovery message to the msgs dir that will be processed on next start.
+5. **Fallback**: If the dispatcher isn't running, `tx mesh recover` writes a recovery message to the msgs dir that will be processed on next start.
 
 **Observe it**:
 ```bash
diff --git a/src/prompt/core.ts b/src/prompt/core.ts
index d3ea5d9a..8167fb1a 100644
--- a/src/prompt/core.ts
+++ b/src/prompt/core.ts
@@ -241,22 +241,33 @@ tx mesh resolve ask-123 "Approved, continue with the plan"
 
 When mesh work fails, the system captures failures in a Dead Letter Queue (DLQ) with session context. You can recover failed work and rewind to specific checkpoints.
 
-**Check health:**
-\`\`\`bash
-tx mesh health             # SLI, circuits, safe mode, DLQ summary
-tx mesh health <mesh>      # Per-agent stats
-tx mesh dlq                # List failed entries with recovery modes
-\`\`\`
+**CRITICAL: Recovery requires human approval.** Never trigger recovery silently. Always diagnose, present options, and get explicit user confirmation first.
 
-**Recover failed work (CLI):**
+### Recovery Workflow (Always Follow These Steps)
+
+**Step 1: Diagnose** — Run these and present findings to the user:
 \`\`\`bash
-tx mesh recover <mesh>                    # Resume from crash point
-tx mesh recover <mesh> --rewind-to=build  # Rewind to state checkpoint
+tx mesh health <mesh>      # SLI, circuit breakers, safe mode level
+tx mesh dlq <mesh>         # Failed entries: what failed, why, recovery mode
 \`\`\`
 
-**Recover via message** (when CLI isn't suitable or you want to trigger from a response):
+**Step 2: Present options** — Tell the user:
+- What failed and why (failure category, reason)
+- How many DLQ entries exist
+- Recovery modes available (session_resume vs requeue)
+- Available checkpoints if FSM mesh (state names the user can rewind to)
+
+Example: "The verify step failed after 3 retries (model_error). There's 1 DLQ entry with session_resume available. Checkpoints exist for: analyze, build. Options:
+1. Resume from crash point (picks up where verify failed)
+2. Rewind to build (redo verify from scratch with build context)
+3. Rewind to analyze (start over from analysis)
+4. Drop it (clear the DLQ entry)"
+
+**Step 3: Get confirmation** — Wait for the user to choose. Do NOT proceed without explicit approval.
+
+**Step 4: Execute** — Based on user choice:
 
-Simple recovery — resume from crash point:
+Resume from crash point:
 \`\`\`markdown
 ---
 to: <mesh>/<entry-point>
@@ -270,7 +281,7 @@ timestamp: ${timestamp}
 Recover failed work from the dead letter queue.
 \`\`\`
 
-Rewind recovery — go back to a known-good state:
+Rewind to a checkpoint:
 \`\`\`markdown
 ---
 to: <mesh>/<entry-point>
@@ -285,16 +296,25 @@ timestamp: ${timestamp}
 The verify step went wrong. Rewind to after build completed and retry.
 \`\`\`
 
-**How rewind-to works:**
+Drop / clear:
+\`\`\`bash
+tx mesh dlq clear          # Clear recovered entries
+tx mesh clear <mesh>       # Full state reset
+\`\`\`
+
+### How rewind-to works
 - Every FSM state transition saves a checkpoint (state name → session ID)
 - \`rewind-to: build\` finds the session active when \`build\` completed
 - Recovery resumes that exact session — full conversation history preserved
 - The agent picks up where it left off, skipping the failed work
 
-**When to use:**
-- User says "go back to step X" or "that went wrong"
-- A later state failed but an earlier state was good
-- \`tx mesh recover <mesh>\` shows available checkpoints with state names
+### CLI equivalents (for reference)
+\`\`\`bash
+tx mesh recover <mesh>                    # Resume from crash point
+tx mesh recover <mesh> --rewind-to=build  # Rewind to state checkpoint
+tx mesh health                            # Overall reliability dashboard
+tx mesh dlq                               # All DLQ entries
+\`\`\`
 
 ## Message Directory: ${msgsDir}/
 

From 01e5fa84539470f0907f58372169fbc5ed78434c Mon Sep 17 00:00:00 2001
From: Claude <noreply@anthropic.com>
Date: Wed, 11 Mar 2026 05:22:35 +0000
Subject: [PATCH 08/12] docs(reliability): Add human review gates for all 6
 priority items

Each reliability priority now has explicit human review steps:
1. Checkpoints + replay: checkpoint notification, replay approval, post-replay review
2. Metrics + tracking: threshold alerts, safe mode escalation/de-escalation approval
3. Retry-with-variation: failure notification, variation transparency, exhaustion review
4. Schema validation: failure notification, correction approval, partial pass handling
5. Agent classification: classification review, non-critical failure reporting, promotion decisions
6. Observability dashboard: anomaly alerts, trend review, cost gates, weekly digest

Core principle: "The system does work. The human makes decisions."
Core prompt updated with condensed human review gates checklist.

https://claude.ai/code/session_012PTx7bZNNh74rAshCcuSQg
---
 docs/reliability.md | 119 ++++++++++++++++++++++++++++++++++++++++++++
 src/prompt/core.ts  |  12 +++++
 2 files changed, 131 insertions(+)

diff --git a/docs/reliability.md b/docs/reliability.md
index 678da70b..8d126d31 100644
--- a/docs/reliability.md
+++ b/docs/reliability.md
@@ -293,3 +293,122 @@ Agents can interact with reliability features via message front-matter:
         │ + circuit  │   │ + SLI     │   │ + DLQ     │
         └────────────┘   └───────────┘   └───────────┘
 ```
+
+## Reliability Roadmap — Human Review Gates
+
+Every reliability improvement includes human review steps. The system **never** silently changes behavior, retries destructively, or masks failures.
+
+### Priority 1: Default-On Checkpoints + Replay
+
+**Impact**: 10x — turns N-step recovery into 1-step problem
+**Effort**: Medium
+
+**What it does**: Every FSM state transition auto-saves a checkpoint. On failure, the user picks which checkpoint to rewind to and replay from.
+
+**Human review steps**:
+1. **Checkpoint notification**: When a mesh completes a state transition, core can optionally surface it: "Mesh X completed 'build' — checkpoint saved."
+2. **Replay approval**: Before any rewind-to replay, core presents:
+   - Which checkpoint to rewind to
+   - What work will be replayed (states after the checkpoint)
+   - What work will be discarded (failed states)
+3. **Post-replay review**: After replay completes, core presents the result for user approval before the mesh continues to the next state.
+
+**Never automatic**: Replay does not happen without the user choosing a checkpoint.
+
+---
+
+### Priority 2: Reliability Metrics Table + Tracking
+
+**Impact**: Foundation for everything else
+**Effort**: Low
+
+**What it does**: SLI tracker records success rate, failure categories, MTTR, and nines level per mesh and per agent.
+
+**Human review steps**:
+1. **Threshold alerts**: When SLI drops below a configured threshold, core surfaces it: "Mesh X reliability dropped to 94.2% (below 95% cautious threshold). 3 failures in last 10 runs. Categories: 2x model_error, 1x timeout."
+2. **Safe mode escalation approval**: Before escalating safe mode (cautious → restricted → lockdown), core presents the SLI data and asks: "Restrict write access for mesh X? Current SLI: 89%."
+3. **De-escalation approval**: Safe mode never auto-de-escalates. Core presents current metrics and asks: "SLI recovered to 98%. Clear restricted mode for mesh X?"
+4. **Periodic health summary**: On user request (`tx mesh health`), core presents a table of all meshes with SLI, open circuits, DLQ entries, and safe mode level.
+
+**Never automatic**: Safe mode escalation beyond `cautious` requires user confirmation. SLI data is always visible.
+
+---
+
+### Priority 3: Retry-With-Variation on Routing/Protocol Failures
+
+**Impact**: 3-5x improvement on retry success
+**Effort**: Low
+
+**What it does**: When a retry fires, it varies the approach — different prompt framing, model fallback, or simplified task scope — instead of repeating the identical failing request.
+
+**Human review steps**:
+1. **First failure notification**: On first failure, core reports: "Agent X failed (model_error). Retrying with variation: [describe variation]. Retry 1/3."
+2. **Variation transparency**: Each retry logs what changed (e.g., "retry 2: simplified prompt, dropped optional context" or "retry 3: fallback model").
+3. **Retry exhaustion review**: When all retries exhaust, core presents the full retry history: "3 retries failed for agent X. Variations tried: [list]. Recommend: [recovery options]." User decides next step.
+4. **Variation strategy approval**: If a new variation strategy is added to config, core surfaces it for review before it takes effect.
+
+**Never automatic**: Retries within the configured limit are automatic (they're cheap and fast), but the user sees what's happening. Exhausted retries always stop and ask.
+
+---
+
+### Priority 4: Output Schema Validation
+
+**Impact**: Catches semantic failures early
+**Effort**: Medium
+
+**What it does**: Validates agent outputs against expected schemas (front-matter structure, required fields, output format) before passing results downstream.
+
+**Human review steps**:
+1. **Validation failure notification**: When output fails schema validation, core reports: "Agent X output failed validation: missing required field 'summary'. Output was [N] chars."
+2. **Correction approval**: Before asking the agent to retry with validation feedback, core presents: "Ask agent X to fix output? Validation errors: [list]. Or drop this output?"
+3. **Schema change review**: When a mesh config adds or modifies `output_schema`, core surfaces: "Mesh X now requires 'summary' field in output. Existing agents may need prompt updates."
+4. **Partial pass handling**: When output partially validates (some fields valid, some not), core presents what passed and what failed. User decides: accept partial, retry, or drop.
+
+**Never automatic**: Schema validation failures are always surfaced. The system does not silently discard or re-request outputs.
+
+---
+
+### Priority 5: Critical / Non-Critical Agent Classification
+
+**Impact**: Prevents cascade from optional steps
+**Effort**: Low
+
+**What it does**: Agents are classified as `critical` (failure blocks mesh) or `non-critical` (failure is logged but mesh continues). Prevents optional agents from taking down the whole workflow.
+
+**Human review steps**:
+1. **Classification review**: When a mesh is loaded, core can surface agent classifications: "Mesh X: critical=[planner, builder], non-critical=[linter, formatter]."
+2. **Non-critical failure notification**: When a non-critical agent fails, core reports: "Non-critical agent 'linter' failed (timeout). Mesh continues. Output from this step will be missing."
+3. **Promotion decision**: If a non-critical agent fails repeatedly, core asks: "Agent 'linter' has failed 5 times. Should it be promoted to critical (failures block mesh) or disabled?"
+4. **Critical failure escalation**: Critical agent failures always stop the mesh and present recovery options (Priority 1 checkpoints + Priority 3 retry history).
+
+**Never automatic**: Non-critical failures are always reported. The user is never surprised by missing outputs from skipped agents.
+
+---
+
+### Priority 6: Aggregate Observability Dashboard
+
+**Impact**: Needed to find the long-tail 0.01%
+**Effort**: Medium
+
+**What it does**: Unified view across all meshes — SLI trends, failure patterns, cost tracking, and anomaly detection.
+
+**Human review steps**:
+1. **Anomaly alerts**: When the dashboard detects anomalies (sudden SLI drop, unusual failure pattern, cost spike), core surfaces: "Anomaly detected: mesh X failure rate spiked from 2% to 15% in last hour. Failure category: model_error."
+2. **Trend review**: On request, core presents trend data: "Last 24h: 47 mesh runs, 98.3% success, 1 DLQ entry (recovered). Top failure: timeout (3x in mesh Y)."
+3. **Cost review gate**: Before approving expensive recovery (multiple retries, large context replay), core presents estimated cost: "Recovering mesh X with rewind-to will replay ~50k tokens. Proceed?"
+4. **Weekly digest**: Core can present a weekly reliability summary: nines achieved, worst-performing meshes, recurring failure patterns, DLQ utilization.
+
+**Never automatic**: The dashboard is passive — it collects and presents. All actions triggered by dashboard insights go through the standard human review workflow (diagnose → present → confirm → execute).
+
+---
+
+### Human Review Principle
+
+Across all 6 priorities, the same principle applies:
+
+> **The system does work. The human makes decisions.**
+
+- Retries within limits → automatic (but visible)
+- Recovery, replay, escalation → always human-approved
+- Failures → always surfaced with context and options
+- No silent state changes that affect mesh behavior
diff --git a/src/prompt/core.ts b/src/prompt/core.ts
index 8167fb1a..bfeafb84 100644
--- a/src/prompt/core.ts
+++ b/src/prompt/core.ts
@@ -316,6 +316,18 @@ tx mesh health                            # Overall reliability dashboard
 tx mesh dlq                               # All DLQ entries
 \`\`\`
 
+### Human Review Gates (Apply to ALL Reliability Events)
+
+**Principle: The system does work. The human makes decisions.**
+
+- **Safe mode escalation**: Present SLI data and ask before moving to restricted/lockdown
+- **Safe mode de-escalation**: Never auto-de-escalate. Present recovery metrics and ask
+- **Retry exhaustion**: Present retry history (what variations were tried) and ask for next step
+- **Schema validation failures**: Present what failed validation and ask: retry, accept partial, or drop
+- **Non-critical agent failures**: Always report skipped outputs — never silently continue
+- **Anomaly detection**: Surface spikes in failure rates, cost, or unusual patterns immediately
+- **Cost gates**: Before expensive recovery (large context replay), present estimated token cost
+
 ## Message Directory: ${msgsDir}/
 
 ## How to Start Work

From 625d2cd50ff3ba940f84242bef14050ab56327b3 Mon Sep 17 00:00:00 2001
From: Claude <noreply@anthropic.com>
Date: Wed, 11 Mar 2026 05:25:49 +0000
Subject: [PATCH 09/12] docs(reliability): Add March of Nines status table with
 human review gates

Documents all existing reliability features organized by nines level:
- Nine 1 (90%): SQLite WAL, worker retries, injection retries, routing correction
- Nine 2 (99%): Parity gate, FSM validation, mesh validator, identity gate, write gate
- Nine 2.5: Nudge detector, deadlock breaker, stale cleaner, quality iteration loops
- Nine 3 (99.9%): Circuit breaker, heartbeat, DLQ, SLI tracker, safe mode, checkpoints
- Nine 4 (99.99%): Roadmap items with human review gates

Each level includes a feature table (what/where) and explicit human review steps.

https://claude.ai/code/session_012PTx7bZNNh74rAshCcuSQg
---
 docs/reliability.md | 50 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 50 insertions(+)

diff --git a/docs/reliability.md b/docs/reliability.md
index 8d126d31..fa20e765 100644
--- a/docs/reliability.md
+++ b/docs/reliability.md
@@ -2,6 +2,56 @@
 
 TX reliability features organized by Karpathy's "March of Nines" — each nine requires fundamentally new approaches.
 
+## March of Nines — Current Status
+
+| Nines | Technique | TX Status | Human Review |
+|-------|-----------|-----------|--------------|
+| **1 (90%)** | Basic error handling, retries | SQLite WAL, worker retries (3x), injection retries (poll loop), routing correction injection | Retry exhaustion → present failure + options to user |
+| **2 (99%)** | Validation, protocol enforcement | Parity gate, FSM validation, mesh validator, identity gate, write gate | Validation failures → surface to user with context. Identity/routing violations → warn or kill per guardrail mode |
+| **~2.5** | Self-healing / auto-recovery | Nudge detector, deadlock breaker, stale cleaner, quality iteration loops | Nudge fires → logged. Deadlock cycles > `autoBreakDepth` → escalate to human. Quality exhaustion → present feedback history + ask user |
+| **3 (99.9%)** | Monitoring, circuit breaking, DLQ | Circuit breaker, heartbeat monitor, DLQ with session resume, SLI tracker, safe mode, checkpoint log | Circuit open → notify user. Safe mode escalation → user approval. DLQ recovery → diagnose/present/confirm workflow |
+| **4 (99.99%)** | [Roadmap] Retry-with-variation, schema validation, agent classification, observability | Planned — see Reliability Roadmap below | Every action requires human confirmation (see roadmap gates) |
+
+### Nine 1 — Basic Error Handling (90%)
+
+Foundational durability. Nothing silently drops.
+
+| Feature | What It Does | Where |
+|---------|-------------|-------|
+| **SQLite WAL mode** | Write-ahead logging prevents queue corruption on crash | `src/queue/index.ts` — `journal_mode=WAL` on init |
+| **Worker retries (3x)** | Failed workers retry up to 3 times before DLQ | `src/worker/dispatcher.ts` — configurable via `dlq.maxRetries` |
+| **Injection poll loop** | Core message injection retries on next poll if Claude is busy | `src/cli/start.ts` — leaves message at head of queue for next cycle |
+| **Routing correction injection** | Bad routing target → corrective prompt injected back to sender | `src/worker/dispatcher.ts` — `handleRoutingError()`, max retries per guardrail config |
+
+**Human review**: When worker retries exhaust → DLQ entry created → core presents failure to user. When routing retries exhaust → escalated to user with full attempt history.
+
+### Nine 2 — Validation & Protocol Enforcement (99%)
+
+Catch bad outputs and protocol violations before they propagate.
+
+| Feature | What It Does | Where |
+|---------|-------------|-------|
+| **Parity gate** | Ensures completion agents answer all pending asks before completing | `src/worker/dispatcher.ts`, `src/core/consumer.ts` — tracks `pending_asks` table |
+| **FSM validation** | State machine meshes enforce valid transitions, prevent skipped/repeated states | `src/state-machine/` — transition guards + checkpoint persistence |
+| **Mesh validator** | Validates mesh config before loading (required fields, types, routing consistency) | `src/worker/mesh-validator.ts` — errors block load, warnings log |
+| **Identity gate** | PreToolUse hook validates `from:` field matches agent identity | `src/worker/identity-gate.ts` — blocks/warns per guardrail mode, strike system |
+| **Write gate** | Controls which tools agents can use based on safe mode level | `src/worker/guardrail-config.ts` — restricted/lockdown blocks Write/Edit/Bash |
+
+**Human review**: Parity gate violations → reminder injected, if unresolved → surfaced to user. Identity gate kills → logged with reason. Mesh validation errors → block load, user sees what's wrong.
+
+### Nine 2.5 — Self-Healing & Auto-Recovery
+
+Detect stuck states and recover without human intervention where safe.
+
+| Feature | What It Does | Where |
+|---------|-------------|-------|
+| **Nudge detector** | Detects when a completing agent fails to forward work to the next route step; summarizes dead output with Haiku and writes recovery task | `src/worker/nudge-detector.ts` — 15s delay, max 1 nudge/agent |
+| **Deadlock breaker** | DFS cycle detection in ask graph; auto-breaks short cycles, escalates deep ones | `src/queue/deadlock-detector.ts` — scans every 60s, `autoBreakDepth: 3` |
+| **Stale message cleaner** | TTL-based GC for unprocessed queue entries (missing target, crashed worker) | `src/queue/stale-cleaner.ts` — 30min TTL, warn/archive/delete actions |
+| **Quality iteration loops** | Quality hooks evaluate output → inject feedback → agent retries with feedback | `src/hooks/post/quality-evaluate.ts` — configurable gates, max iterations |
+
+**Human review**: Nudges are logged and visible in `tx spy`. Deadlock cycles deeper than `autoBreakDepth` (default 3) → escalated to human with cycle visualization. Quality exhaustion (max iterations hit) → presents feedback history and asks user: retry, accept, or drop. Stale message cleanup → logged, user can audit via `tx spy`.
+
 ## Quick Start
 
 ```bash

From 2ddff31d4408797ce1c7231ad75143b3f765e35f Mon Sep 17 00:00:00 2001
From: Claude <noreply@anthropic.com>
Date: Wed, 11 Mar 2026 05:43:25 +0000
Subject: [PATCH 10/12] docs(reliability): Add all reliability features from
 codebase scan

Adds features found across the codebase organized by nines level:
- Nine 1: graceful shutdown, usage policy recovery, recovery handler escalation
- Nine 2: manifest validator, guardrail config chain
- Nine 2.5: session suspend/resume, FSM persistence + backup, session store backfill
- Nine 3: rate limiter, worker pool backpressure, metrics aggregator, worker lifecycle tracking

https://claude.ai/code/session_012PTx7bZNNh74rAshCcuSQg
---
 docs/reliability.md | 70 ++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 60 insertions(+), 10 deletions(-)

diff --git a/docs/reliability.md b/docs/reliability.md
index fa20e765..898eb34f 100644
--- a/docs/reliability.md
+++ b/docs/reliability.md
@@ -4,13 +4,13 @@ TX reliability features organized by Karpathy's "March of Nines" — each nine r
 
 ## March of Nines — Current Status
 
-| Nines | Technique | TX Status | Human Review |
-|-------|-----------|-----------|--------------|
-| **1 (90%)** | Basic error handling, retries | SQLite WAL, worker retries (3x), injection retries (poll loop), routing correction injection | Retry exhaustion → present failure + options to user |
-| **2 (99%)** | Validation, protocol enforcement | Parity gate, FSM validation, mesh validator, identity gate, write gate | Validation failures → surface to user with context. Identity/routing violations → warn or kill per guardrail mode |
-| **~2.5** | Self-healing / auto-recovery | Nudge detector, deadlock breaker, stale cleaner, quality iteration loops | Nudge fires → logged. Deadlock cycles > `autoBreakDepth` → escalate to human. Quality exhaustion → present feedback history + ask user |
-| **3 (99.9%)** | Monitoring, circuit breaking, DLQ | Circuit breaker, heartbeat monitor, DLQ with session resume, SLI tracker, safe mode, checkpoint log | Circuit open → notify user. Safe mode escalation → user approval. DLQ recovery → diagnose/present/confirm workflow |
-| **4 (99.99%)** | [Roadmap] Retry-with-variation, schema validation, agent classification, observability | Planned — see Reliability Roadmap below | Every action requires human confirmation (see roadmap gates) |
+| Nines | Technique | TX Status |
+|-------|-----------|-----------|
+| **1 (90%)** | Basic error handling, retries | SQLite WAL, worker retries (3x), injection poll loop, routing correction, graceful shutdown, usage policy recovery, recovery handler escalation |
+| **2 (99%)** | Validation, protocol enforcement | Parity gate, FSM validation, mesh validator, identity gate, write gate, manifest validator, guardrail config chain |
+| **~2.5** | Self-healing / auto-recovery | Nudge detector, deadlock breaker, stale cleaner, quality iteration loops, session suspend/resume, FSM state persistence + backup, session store backfill |
+| **3 (99.9%)** | Monitoring, circuit breaking, DLQ | Circuit breaker, heartbeat monitor, DLQ with session resume, SLI tracker, safe mode, checkpoint log, rate limiter, worker pool backpressure, metrics aggregator, worker lifecycle tracking |
+| **4 (99.99%)** | [Roadmap] | Retry-with-variation, schema validation, agent classification, observability dashboard |
 
 ### Nine 1 — Basic Error Handling (90%)
 
@@ -22,8 +22,11 @@ Foundational durability. Nothing silently drops.
 | **Worker retries (3x)** | Failed workers retry up to 3 times before DLQ | `src/worker/dispatcher.ts` — configurable via `dlq.maxRetries` |
 | **Injection poll loop** | Core message injection retries on next poll if Claude is busy | `src/cli/start.ts` — leaves message at head of queue for next cycle |
 | **Routing correction injection** | Bad routing target → corrective prompt injected back to sender | `src/worker/dispatcher.ts` — `handleRoutingError()`, max retries per guardrail config |
+| **Graceful worker pool shutdown** | Drains active workers before terminating pool, prevents orphaned workers | `src/server/worker-pool.ts` |
+| **Usage policy error handling** | Detects Claude API usage policy errors, captures diagnostic context, writes ask-human message instead of crashing | `src/worker/usage-policy-error.ts` |
+| **Recovery handler with escalation** | Tracks recovery requests per agent, provides FSM guidance on first attempt, escalates to human after 3 requests in 60s | `src/core/recovery.ts` |
 
-**Human review**: When worker retries exhaust → DLQ entry created → core presents failure to user. When routing retries exhaust → escalated to user with full attempt history.
+**Human review**: When worker retries exhaust → DLQ entry created → core presents failure to user. When routing retries exhaust → escalated to user with full attempt history. Usage policy errors → human chooses retry/skip/modify-prompt/abort.
 
 ### Nine 2 — Validation & Protocol Enforcement (99%)
 
@@ -36,8 +39,10 @@ Catch bad outputs and protocol violations before they propagate.
 | **Mesh validator** | Validates mesh config before loading (required fields, types, routing consistency) | `src/worker/mesh-validator.ts` — errors block load, warnings log |
 | **Identity gate** | PreToolUse hook validates `from:` field matches agent identity | `src/worker/identity-gate.ts` — blocks/warns per guardrail mode, strike system |
 | **Write gate** | Controls which tools agents can use based on safe mode level | `src/worker/guardrail-config.ts` — restricted/lockdown blocks Write/Edit/Bash |
+| **Manifest validator** | Validates agent output artifacts against declared manifest paths with template variable resolution (5-pass chained substitution) | `src/worker/manifest-validator.ts` |
+| **Guardrail config chain** | Unified strict/warning mode on every guardrail with override chain: agent > mesh > global > hardcoded | `src/worker/guardrail-config.ts` |
 
-**Human review**: Parity gate violations → reminder injected, if unresolved → surfaced to user. Identity gate kills → logged with reason. Mesh validation errors → block load, user sees what's wrong.
+**Human review**: Parity gate violations → reminder injected, if unresolved → surfaced to user. Identity gate kills → logged with reason. Mesh validation errors → block load, user sees what's wrong. Manifest validation failures → surfaced to user with missing/invalid paths.
 
 ### Nine 2.5 — Self-Healing & Auto-Recovery
 
@@ -45,10 +50,13 @@ Detect stuck states and recover without human intervention where safe.
 
 | Feature | What It Does | Where |
 |---------|-------------|-------|
-| **Nudge detector** | Detects when a completing agent fails to forward work to the next route step; summarizes dead output with Haiku and writes recovery task | `src/worker/nudge-detector.ts` — 15s delay, max 1 nudge/agent |
+| **Nudge detector** | Detects when a completing agent fails to forward work; summarizes dead output with Haiku and writes recovery task | `src/worker/nudge-detector.ts` — 15s delay, max 1 nudge/agent |
 | **Deadlock breaker** | DFS cycle detection in ask graph; auto-breaks short cycles, escalates deep ones | `src/queue/deadlock-detector.ts` — scans every 60s, `autoBreakDepth: 3` |
 | **Stale message cleaner** | TTL-based GC for unprocessed queue entries (missing target, crashed worker) | `src/queue/stale-cleaner.ts` — 30min TTL, warn/archive/delete actions |
 | **Quality iteration loops** | Quality hooks evaluate output → inject feedback → agent retries with feedback | `src/hooks/post/quality-evaluate.ts` — configurable gates, max iterations |
+| **Session suspend/resume** | Persists suspended session state to SQLite for crash recovery; re-buffers delivered responses on restart | `src/worker/session-manager.ts` — `restoreFromDatabase()` on startup |
+| **FSM state persistence + backup** | Saves FSM state with atomic backup-before-update; can restore from latest backup on corruption | `src/mesh/fsm-persistence.ts` |
+| **Session store with backfill** | SQLite session persistence with FTS5 search; backfills existing transcripts from filesystem on startup | `src/session/session-store.ts` |
 
 **Human review**: Nudges are logged and visible in `tx spy`. Deadlock cycles deeper than `autoBreakDepth` (default 3) → escalated to human with cycle visualization. Quality exhaustion (max iterations hit) → presents feedback history and asks user: retry, accept, or drop. Stale message cleanup → logged, user can audit via `tx spy`.
 
@@ -272,6 +280,48 @@ tx mesh health           # Shows current safe mode level
 tx spy                   # Watch safe-mode:blocked activity events
 ```
 
+### 6. Rate Limiter
+
+**What it does**: Token bucket rate limiting for server endpoints. Prevents burst overload.
+
+**How it works**:
+- Per-endpoint limits with configurable burst capacity
+- Automatic bucket cleanup every 5 minutes
+- Smooth rate limiting (not hard cutoff)
+
+**Source**: `src/server/rate-limiter.ts`
+
+### 7. Worker Pool Backpressure
+
+**What it does**: Adaptive polling with concurrency limits prevents queue overload.
+
+**How it works**:
+- Polls for work at configurable intervals (default 100ms)
+- Respects concurrency limits — won't spawn beyond capacity
+- Graceful shutdown drains active workers before terminating
+
+**Source**: `src/server/worker-pool.ts`
+
+### 8. Metrics Aggregator
+
+**What it does**: Per-query metrics collection with token cost tracking.
+
+**Tracks**: input/output tokens, duration, cost per query, aggregate totals for worker lifetime, tool call counts.
+
+**Source**: `src/worker/metrics-aggregator.ts`
+
+### 9. Worker Lifecycle Tracking
+
+**What it does**: Tracks parallel worker execution with unique instance IDs for deduplication and debugging.
+
+**How it works**:
+- Generates unique worker IDs (`agentId-uuid`)
+- Tracks parallel execution per agent
+- Persists worker state to disk
+- Tracks nudge counts and completion frontier
+
+**Source**: `src/worker/worker-lifecycle.ts`
+
 ## Test Mesh
 
 The `reliability-test` mesh is configured with tight thresholds for quick testing:

From 6283ee3ea370e008983e53c26e8be442ee3f3852 Mon Sep 17 00:00:00 2001
From: Claude <noreply@anthropic.com>
Date: Wed, 11 Mar 2026 06:20:24 +0000
Subject: [PATCH 11/12] docs(reliability): Add bash guard to Nine 2 reliability
 features

Bash guard (write-gate.ts createBashHook) intercepts Bash redirects
(>, >>, tee) and validates target paths against write manifest.
Strike system: 1-2 errors with paths, 3+ kills worker.

https://claude.ai/code/session_012PTx7bZNNh74rAshCcuSQg
---
 docs/reliability.md | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/docs/reliability.md b/docs/reliability.md
index 898eb34f..55332699 100644
--- a/docs/reliability.md
+++ b/docs/reliability.md
@@ -7,7 +7,7 @@ TX reliability features organized by Karpathy's "March of Nines" — each nine r
 | Nines | Technique | TX Status |
 |-------|-----------|-----------|
 | **1 (90%)** | Basic error handling, retries | SQLite WAL, worker retries (3x), injection poll loop, routing correction, graceful shutdown, usage policy recovery, recovery handler escalation |
-| **2 (99%)** | Validation, protocol enforcement | Parity gate, FSM validation, mesh validator, identity gate, write gate, manifest validator, guardrail config chain |
+| **2 (99%)** | Validation, protocol enforcement | Parity gate, FSM validation, mesh validator, identity gate, write gate, bash guard, manifest validator, guardrail config chain |
 | **~2.5** | Self-healing / auto-recovery | Nudge detector, deadlock breaker, stale cleaner, quality iteration loops, session suspend/resume, FSM state persistence + backup, session store backfill |
 | **3 (99.9%)** | Monitoring, circuit breaking, DLQ | Circuit breaker, heartbeat monitor, DLQ with session resume, SLI tracker, safe mode, checkpoint log, rate limiter, worker pool backpressure, metrics aggregator, worker lifecycle tracking |
 | **4 (99.99%)** | [Roadmap] | Retry-with-variation, schema validation, agent classification, observability dashboard |
@@ -39,10 +39,11 @@ Catch bad outputs and protocol violations before they propagate.
 | **Mesh validator** | Validates mesh config before loading (required fields, types, routing consistency) | `src/worker/mesh-validator.ts` — errors block load, warnings log |
 | **Identity gate** | PreToolUse hook validates `from:` field matches agent identity | `src/worker/identity-gate.ts` — blocks/warns per guardrail mode, strike system |
 | **Write gate** | Controls which tools agents can use based on safe mode level | `src/worker/guardrail-config.ts` — restricted/lockdown blocks Write/Edit/Bash |
+| **Bash guard** | PreToolUse hook intercepts Bash commands with redirects (`>`, `>>`, `tee`), validates target paths against allowed write manifest. Strike system: 1-2 violations → error with allowed paths, 3+ → kill worker | `src/worker/write-gate.ts` — `createBashHook()` |
 | **Manifest validator** | Validates agent output artifacts against declared manifest paths with template variable resolution (5-pass chained substitution) | `src/worker/manifest-validator.ts` |
 | **Guardrail config chain** | Unified strict/warning mode on every guardrail with override chain: agent > mesh > global > hardcoded | `src/worker/guardrail-config.ts` |
 
-**Human review**: Parity gate violations → reminder injected, if unresolved → surfaced to user. Identity gate kills → logged with reason. Mesh validation errors → block load, user sees what's wrong. Manifest validation failures → surfaced to user with missing/invalid paths.
+**Human review**: Parity gate violations → reminder injected, if unresolved → surfaced to user. Identity gate kills → logged with reason. Mesh validation errors → block load, user sees what's wrong. Manifest validation failures → surfaced to user with missing/invalid paths. Bash guard violations → logged for forensics, worker killed after 3 strikes.
 
 ### Nine 2.5 — Self-Healing & Auto-Recovery
 

From 8ee54c17babd306498e32bff98eca5c8d6ca9182 Mon Sep 17 00:00:00 2001
From: Claude <noreply@anthropic.com>
Date: Wed, 11 Mar 2026 06:39:18 +0000
Subject: [PATCH 12/12] docs(reliability): Consistent format across all nines +
 extract human review

- Add summary table to Nine 3 (matching Nine 1/2/2.5 format)
- Add detailed explanations for all Nine 1/2/2.5 features
- Extract all human review gates to dedicated HUMAN_REVIEW.md
- Restructure roadmap into table + explanations

https://claude.ai/code/session_012PTx7bZNNh74rAshCcuSQg
---
 docs/HUMAN_REVIEW.md | 147 ++++++++++++
 docs/reliability.md  | 560 +++++++++++++++++++++++++++----------------
 2 files changed, 497 insertions(+), 210 deletions(-)
 create mode 100644 docs/HUMAN_REVIEW.md

diff --git a/docs/HUMAN_REVIEW.md b/docs/HUMAN_REVIEW.md
new file mode 100644
index 00000000..a6b14dd9
--- /dev/null
+++ b/docs/HUMAN_REVIEW.md
@@ -0,0 +1,147 @@
+# Human Review Gates — Reliability
+
+Every reliability feature includes human review steps. The system **never** silently changes behavior, retries destructively, or masks failures.
+
+> **The system does work. The human makes decisions.**
+
+- Retries within limits → automatic (but visible)
+- Recovery, replay, escalation → always human-approved
+- Failures → always surfaced with context and options
+- No silent state changes that affect mesh behavior
+
+See [reliability.md](./reliability.md) for feature details.
+
+---
+
+## Nine 1 — Basic Error Handling
+
+### Worker Retries
+- When retries exhaust → DLQ entry created → core presents failure to user
+- User decides: retry with variation, recover from checkpoint, or drop
+
+### Injection Poll Loop
+- Stale entries (>5min) are dropped but remain available via `tx inbox`
+- If file-based fallback activates, user sees pending messages on next interaction
+
+### Routing Correction
+- When routing retries exhaust → escalated to user with full attempt history
+- User sees which targets were tried and picks correct one
+
+### Usage Policy Errors
+- Human chooses: retry, skip, modify prompt, or abort
+- Full diagnostic context (triggering prompt, recent history) included in ask-human message
+
+### Recovery Handler
+- First 2 recovery requests: automatic guidance with FSM state and valid routes
+- 3rd+ request in 60s: escalated to human — agent is repeatedly stuck
+
+---
+
+## Nine 2 — Validation & Protocol Enforcement
+
+### Parity Gate
+- Violations → reminder injected to agent
+- If unresolved after reminder → surfaced to user with pending asks list
+
+### Identity Gate
+- Kill events → logged with full reason (agent ID, expected vs actual `from:` field)
+- User can audit identity violations via logs
+
+### Mesh Validator
+- Validation errors → block mesh load, user sees what's wrong and how to fix it
+- Warnings → logged but don't block (user can review in logs)
+
+### Manifest Validator
+- Validation failures → surfaced to user with missing/invalid paths and responsible agents
+
+### Bash Guard
+- 1-2 violations → error response with allowed paths shown to agent
+- 3+ violations → worker killed, logged for forensics
+- User can audit bash guard events in logs
+
+---
+
+## Nine 2.5 — Self-Healing & Auto-Recovery
+
+### Nudge Detector
+- Nudges are logged and visible in `tx spy`
+- Max 1 nudge per agent prevents recovery loops
+
+### Deadlock Breaker
+- Shallow cycles (depth ≤ 3) → auto-broken, logged
+- Deep cycles (depth 5+) → escalated to human with cycle visualization (A→B→C→A)
+- User decides which agent's ask to drop
+
+### Stale Message Cleaner
+- Stale messages archived with reason — no silent deletion
+- User can audit via `tx spy` and review archived messages
+
+### Quality Iteration Loops
+- Max iterations hit → presents feedback history to user
+- User decides: retry, accept current output, or drop
+- Each iteration's feedback is visible for review
+
+---
+
+## Nine 3 — Monitoring, Circuit Breaking, DLQ
+
+### Circuit Breaker
+- Circuit open → agent skipped, logged with failure count
+- Half-open test spawn → user can monitor via `tx mesh health`
+- Circuits don't auto-close silently — health dashboard shows state
+
+### Heartbeat Monitor
+- Warn threshold → logged warning (no action)
+- Stale threshold → logged stale warning
+- Dead threshold → **worker killed**, failure recorded, routed to DLQ
+- All events visible in `tx mesh health` with silence duration
+
+### Dead Letter Queue
+- **Recovery always requires human review** (except crash recovery on restart)
+- Core agent diagnoses, presents options (resume vs rewind vs drop), gets explicit confirmation
+- Available checkpoints shown before any recovery action
+- `tx mesh dlq` shows all pending entries with recovery mode and context
+
+### Checkpoint Log & Rewind-To
+- Checkpoint notification: core can surface "Mesh X completed 'build' — checkpoint saved"
+- Before rewind-to replay, core presents: which checkpoint, what replays, what's discarded
+- Post-replay: result presented for user approval before mesh continues
+- **Replay never happens without user choosing a checkpoint**
+
+### SLI Tracker
+- Threshold alerts: "Mesh X reliability dropped to 94.2% (below 95% cautious threshold). 3 failures in last 10 runs. Categories: 2x model_error, 1x timeout."
+- Periodic health summary available via `tx mesh health`
+- SLI data always visible — never hidden from user
+
+### Safe Mode
+- **Escalation beyond cautious requires user confirmation** when surfaced via core
+- Auto-escalation (if enabled) is logged with reason and SLI data
+- **Never auto-de-escalates** — human must clear via `resetMesh()` or `resetAll()`
+- Core presents: "SLI recovered to 98%. Clear restricted mode for mesh X?"
+
+---
+
+## Roadmap — Nine 4
+
+### Retry-With-Variation
+- First failure: core reports "Agent X failed (model_error). Retrying with variation: [description]. Retry 1/3."
+- Each retry logs what changed (e.g., "retry 2: simplified prompt, dropped optional context")
+- Exhausted retries: core presents full retry history with variations tried — user decides next step
+- New variation strategies require review before taking effect
+
+### Output Schema Validation
+- Validation failure: core reports "Agent X output failed validation: missing required field 'summary'."
+- Before retry with validation feedback, core presents: "Ask agent X to fix? Validation errors: [list]. Or drop?"
+- Schema changes in mesh config: core surfaces impact on existing agents
+- Partial pass: core shows what passed/failed, user decides accept partial, retry, or drop
+
+### Critical/Non-Critical Agent Classification
+- On mesh load, core can surface classifications: "critical=[planner, builder], non-critical=[linter]"
+- Non-critical failure: "Agent 'linter' failed (timeout). Mesh continues. Output from this step missing."
+- Repeated non-critical failures: "Agent 'linter' failed 5 times. Promote to critical or disable?"
+- Critical failures always stop the mesh and present recovery options
+
+### Aggregate Observability Dashboard
+- Anomaly alerts: "Mesh X failure rate spiked from 2% to 15% in last hour. Category: model_error."
+- Cost review gate: "Recovering mesh X with rewind-to will replay ~50k tokens. Proceed?"
+- Dashboard is passive — all actions from insights go through standard human review
diff --git a/docs/reliability.md b/docs/reliability.md
index 55332699..4f5938a5 100644
--- a/docs/reliability.md
+++ b/docs/reliability.md
@@ -2,6 +2,8 @@
 
 TX reliability features organized by Karpathy's "March of Nines" — each nine requires fundamentally new approaches.
 
+Human review gates for all features are documented in [HUMAN_REVIEW.md](./HUMAN_REVIEW.md).
+
 ## March of Nines — Current Status
 
 | Nines | Technique | TX Status |
@@ -12,96 +14,295 @@ TX reliability features organized by Karpathy's "March of Nines" — each nine r
 | **3 (99.9%)** | Monitoring, circuit breaking, DLQ | Circuit breaker, heartbeat monitor, DLQ with session resume, SLI tracker, safe mode, checkpoint log, rate limiter, worker pool backpressure, metrics aggregator, worker lifecycle tracking |
 | **4 (99.99%)** | [Roadmap] | Retry-with-variation, schema validation, agent classification, observability dashboard |
 
-### Nine 1 — Basic Error Handling (90%)
+---
+
+## Nine 1 — Basic Error Handling (90%)
 
 Foundational durability. Nothing silently drops.
 
 | Feature | What It Does | Where |
 |---------|-------------|-------|
-| **SQLite WAL mode** | Write-ahead logging prevents queue corruption on crash | `src/queue/index.ts` — `journal_mode=WAL` on init |
-| **Worker retries (3x)** | Failed workers retry up to 3 times before DLQ | `src/worker/dispatcher.ts` — configurable via `dlq.maxRetries` |
-| **Injection poll loop** | Core message injection retries on next poll if Claude is busy | `src/cli/start.ts` — leaves message at head of queue for next cycle |
-| **Routing correction injection** | Bad routing target → corrective prompt injected back to sender | `src/worker/dispatcher.ts` — `handleRoutingError()`, max retries per guardrail config |
-| **Graceful worker pool shutdown** | Drains active workers before terminating pool, prevents orphaned workers | `src/server/worker-pool.ts` |
-| **Usage policy error handling** | Detects Claude API usage policy errors, captures diagnostic context, writes ask-human message instead of crashing | `src/worker/usage-policy-error.ts` |
-| **Recovery handler with escalation** | Tracks recovery requests per agent, provides FSM guidance on first attempt, escalates to human after 3 requests in 60s | `src/core/recovery.ts` |
+| **SQLite WAL mode** | Write-ahead logging prevents queue corruption on crash | `src/queue/index.ts` |
+| **Worker retries (3x)** | Failed workers retry up to 3 times before DLQ | `src/worker/dispatcher.ts` |
+| **Injection poll loop** | Core message injection retries on next poll if Claude is busy | `src/cli/start.ts` |
+| **Routing correction injection** | Bad routing target → corrective prompt injected back to sender | `src/worker/dispatcher.ts` |
+| **Graceful worker pool shutdown** | Drains active workers before terminating pool | `src/server/worker-pool.ts` |
+| **Usage policy error handling** | Detects Claude API usage policy errors, writes ask-human message instead of crashing | `src/worker/usage-policy-error.ts` |
+| **Recovery handler with escalation** | Tracks recovery requests per agent, escalates to human after 3 requests in 60s | `src/core/recovery.ts` |
+
+### SQLite WAL Mode
+
+**What it does**: Prevents queue corruption on crash via Write-Ahead Logging.
+
+**How it works**:
+- Enables WAL mode (`journal_mode=WAL`) on the SQLite message queue at init
+- All writes are logged to WAL file before committing to main database
+- Guarantees queue state is recoverable even if process crashes mid-write
+- Allows concurrent readers while writes are in flight
+
+### Worker Retries (3x)
+
+**What it does**: Auto-retries failed workers before routing to DLQ.
+
+**How it works**:
+- Each worker has a state machine tracking retry attempts
+- On error, checks `canTransition('retry')` before respawning
+- Differentiates retriable errors (crashes, model overload) vs non-retriable (suspension, max-turns, abort)
+- After max retries exhausted, routes to Dead Letter Queue for recovery
+
+### Injection Poll Loop
+
+**What it does**: Ensures messages reach the core Claude session even when it's busy.
+
+**How it works**:
+- Maintains an in-memory queue of messages waiting for injection into tmux
+- Polls every 2s (`INJECTION_POLL_MS`) checking if Claude is idle, then injects
+- Drops stale entries pending >5 minutes (they're available via `tx inbox`)
+- Falls back to file-based delivery (`pending-for-core.json`) if active injection fails
+
+### Routing Correction Injection
+
+**What it does**: Recovers from bad routing by teaching the agent valid targets.
+
+**How it works**:
+- Detects messages targeting non-existent meshes/agents, increments retry counter per sender→target pair
+- Injects corrective message back to sender listing valid available targets (up to max retries)
+- After max retries exceeded, escalates to human via `ask-human` message
+- Supports strict mode (block immediately) and warning mode (allow + notify) per guardrail config
+
+### Graceful Worker Pool Shutdown
+
+**What it does**: Prevents orphaned workers on shutdown.
+
+**How it works**:
+- Sets `running = false` to prevent new spawns, stops polling loop
+- Collects all active worker promises and awaits completion via `Promise.all()`
+- Logs count of in-flight workers being drained
+
+### Usage Policy Error Handling
+
+**What it does**: Captures false-positive usage policy errors with full diagnostic context.
+
+**How it works**:
+- Detects usage policy errors from Claude API via pattern matching
+- Captures diagnostic context: triggering prompt, recent history, in-progress tool calls, agent/mesh info
+- Writes `ask-human` message to core with full context for human decision (retry, skip, modify prompt, abort)
+- Preserves session ID for potential resume
 
-**Human review**: When worker retries exhaust → DLQ entry created → core presents failure to user. When routing retries exhaust → escalated to user with full attempt history. Usage policy errors → human chooses retry/skip/modify-prompt/abort.
+### Recovery Handler with Escalation
 
-### Nine 2 — Validation & Protocol Enforcement (99%)
+**What it does**: Detects repeatedly stuck agents and escalates to human.
+
+**How it works**:
+- Intercepts messages routed to `system/recovery`
+- Tracks frequency per agent with time window; resets counter outside escalation window
+- First 2 attempts: returns guidance with current FSM state, pending asks, and valid exit routes
+- 3rd+ attempt: escalates to `core/core` for human intervention
+
+---
+
+## Nine 2 — Validation & Protocol Enforcement (99%)
 
 Catch bad outputs and protocol violations before they propagate.
 
 | Feature | What It Does | Where |
 |---------|-------------|-------|
-| **Parity gate** | Ensures completion agents answer all pending asks before completing | `src/worker/dispatcher.ts`, `src/core/consumer.ts` — tracks `pending_asks` table |
-| **FSM validation** | State machine meshes enforce valid transitions, prevent skipped/repeated states | `src/state-machine/` — transition guards + checkpoint persistence |
-| **Mesh validator** | Validates mesh config before loading (required fields, types, routing consistency) | `src/worker/mesh-validator.ts` — errors block load, warnings log |
-| **Identity gate** | PreToolUse hook validates `from:` field matches agent identity | `src/worker/identity-gate.ts` — blocks/warns per guardrail mode, strike system |
-| **Write gate** | Controls which tools agents can use based on safe mode level | `src/worker/guardrail-config.ts` — restricted/lockdown blocks Write/Edit/Bash |
-| **Bash guard** | PreToolUse hook intercepts Bash commands with redirects (`>`, `>>`, `tee`), validates target paths against allowed write manifest. Strike system: 1-2 violations → error with allowed paths, 3+ → kill worker | `src/worker/write-gate.ts` — `createBashHook()` |
-| **Manifest validator** | Validates agent output artifacts against declared manifest paths with template variable resolution (5-pass chained substitution) | `src/worker/manifest-validator.ts` |
-| **Guardrail config chain** | Unified strict/warning mode on every guardrail with override chain: agent > mesh > global > hardcoded | `src/worker/guardrail-config.ts` |
+| **Parity gate** | Ensures completion agents answer all pending asks before completing | `src/worker/dispatcher.ts`, `src/core/consumer.ts` |
+| **FSM validation** | State machine meshes enforce valid transitions, prevent skipped/repeated states | `src/state-machine/` |
+| **Mesh validator** | Validates mesh config before loading (required fields, types, routing consistency) | `src/worker/mesh-validator.ts` |
+| **Identity gate** | PreToolUse hook validates `from:` field matches agent identity | `src/worker/identity-gate.ts` |
+| **Write gate** | Controls which paths agents can write to based on manifest | `src/worker/write-gate.ts` |
+| **Bash guard** | PreToolUse hook blocks dangerous Bash patterns outside project boundary | `src/worker/bash-guard.ts` |
+| **Manifest validator** | Validates agent output artifacts against declared manifest paths | `src/worker/manifest-validator.ts` |
+| **Guardrail config chain** | Unified strict/warning mode with override chain: agent > mesh > global > hardcoded | `src/worker/guardrail-config.ts` |
+
+### Parity Gate
+
+**What it does**: Prevents agents from completing a mesh while unanswered questions remain.
+
+**How it works**:
+- Tracks pending asks (questions sent to human boundary `core/core`) in SQLite queue
+- Validates responses from `core/core` have a matching pending ask by msg-id (fallback to agent-level matching)
+- Blocks `task-complete` messages with unresolved asks; deletes offending file and emits `parity-reminder`
+- Terminal-by-default: asks to `core/core` require parity; agent-to-agent asks don't trigger tracking
+
+### FSM Validation
+
+**What it does**: Enforces state machine rules before message routing.
+
+**How it works**:
+- Type-safe state transitions with guard validation and middleware hooks (pre/post)
+- Consumer calls `validateMessageWithFSM()` on all incoming messages BEFORE type-specific routing
+- Centralized validation ensures all routing respects mesh-defined FSM rules
+- Emits transition history and immutable state snapshots for replay/debugging
+
+### Mesh Validator
 
-**Human review**: Parity gate violations → reminder injected, if unresolved → surfaced to user. Identity gate kills → logged with reason. Mesh validation errors → block load, user sees what's wrong. Manifest validation failures → surfaced to user with missing/invalid paths. Bash guard violations → logged for forensics, worker killed after 3 strikes.
+**What it does**: Catches config errors before a mesh can load.
 
-### Nine 2.5 — Self-Healing & Auto-Recovery
+**How it works**:
+- Static `validate()` checks mesh config structure, required fields, agent definitions, routing rules, FSM definitions, and manifest entries
+- Validates field types, agent presence, entry/exit points, task distribution config, guardrail overrides, and parallelism blocks
+- Returns `ValidationResult` with errors and warnings — errors block load, warnings log
+- Catches typos early (e.g., agent routing to nonexistent agents)
+
+### Identity Gate
+
+**What it does**: Prevents agents from impersonating other agents.
+
+**How it works**:
+- PreToolUse hook intercepts Write tool calls to `.ai/tx/msgs/`
+- Extracts `from:` field from message YAML frontmatter, compares against expected agent identity
+- Enforces fully-qualified names (rejects bare `worker` when agent is `dev/worker`) to prevent cross-mesh routing leaks
+- Strike counter with configurable kill threshold; strict (block) vs warning (allow + feedback) modes
+
+### Write Gate
+
+**What it does**: Restricts file writes to declared manifest paths.
+
+**How it works**:
+- PreToolUse hooks intercept Write/Edit/NotebookEdit tools and Bash redirects (`>`, `>>`, `tee`)
+- Validates target paths against agent's declared allowed paths from manifest
+- Auto-exempts `.ai/tx/msgs/` and `.ai/tx/logs/`; allows `/dev/null`
+- Tracks file-tool and bash-redirect strikes separately; kill threshold on accumulated violations
+
+### Bash Guard
+
+**What it does**: Docker-like isolation — full Bash inside project, can't escape.
+
+**How it works**:
+- Two security layers: workDir boundary enforcement + catastrophic damage prevention
+- Blocks all filesystem operations (read/write/symlink) outside project directory
+- Blocks privilege escalation, root destruction, system service manipulation, raw disk ops
+- Network access explicitly allowed (Docker parity): curl, wget, ssh, npm publish are safe
+
+### Manifest Validator
+
+**What it does**: Validates agent artifacts against declared manifest paths.
+
+**How it works**:
+- Resolves manifest variable references (game-id, campaign-id, etc.) from `session.yaml` with caching
+- Builds path context from mesh workspace config (locations, variables, source mappings)
+- `validateAgentArtifacts()` checks agent reads/writes against declared manifest entries
+- `findWriters()` identifies responsible agents for given file IDs (used in error messages)
+
+### Guardrail Config Chain
+
+**What it does**: Unified enforcement with flexible per-agent overrides.
+
+**How it works**:
+- Loads global guardrails from `.ai/tx/data/config.yaml` and mesh-local overrides from mesh config
+- Resolution chain: agent-level > mesh-level > global agent > global mesh > global default > hardcoded default
+- Each guardrail has `strict` and `warning` flags that resolve independently
+- Supports backward-compatible bare numbers or structured `{strict, warning, limit}` objects
+
+---
+
+## Nine 2.5 — Self-Healing & Auto-Recovery
 
 Detect stuck states and recover without human intervention where safe.
 
 | Feature | What It Does | Where |
 |---------|-------------|-------|
-| **Nudge detector** | Detects when a completing agent fails to forward work; summarizes dead output with Haiku and writes recovery task | `src/worker/nudge-detector.ts` — 15s delay, max 1 nudge/agent |
-| **Deadlock breaker** | DFS cycle detection in ask graph; auto-breaks short cycles, escalates deep ones | `src/queue/deadlock-detector.ts` — scans every 60s, `autoBreakDepth: 3` |
-| **Stale message cleaner** | TTL-based GC for unprocessed queue entries (missing target, crashed worker) | `src/queue/stale-cleaner.ts` — 30min TTL, warn/archive/delete actions |
-| **Quality iteration loops** | Quality hooks evaluate output → inject feedback → agent retries with feedback | `src/hooks/post/quality-evaluate.ts` — configurable gates, max iterations |
-| **Session suspend/resume** | Persists suspended session state to SQLite for crash recovery; re-buffers delivered responses on restart | `src/worker/session-manager.ts` — `restoreFromDatabase()` on startup |
-| **FSM state persistence + backup** | Saves FSM state with atomic backup-before-update; can restore from latest backup on corruption | `src/mesh/fsm-persistence.ts` |
-| **Session store with backfill** | SQLite session persistence with FTS5 search; backfills existing transcripts from filesystem on startup | `src/session/session-store.ts` |
+| **Nudge detector** | Detects when a completing agent fails to forward work; summarizes and writes recovery task | `src/worker/nudge-detector.ts` |
+| **Deadlock breaker** | DFS cycle detection in ask graph; auto-breaks short cycles, escalates deep ones | `src/queue/deadlock-detector.ts` |
+| **Stale message cleaner** | TTL-based GC for unprocessed queue entries (missing target, crashed worker) | `src/queue/stale-cleaner.ts` |
+| **Quality iteration loops** | Quality hooks evaluate output → inject feedback → agent retries with feedback | `src/hooks/post/quality-evaluate.ts` |
+| **Session suspend/resume** | Persists suspended session state to SQLite for crash recovery | `src/worker/session-manager.ts` |
+| **FSM state persistence + backup** | Atomic backup-before-update; auto-restores from backup on corruption | `src/mesh/fsm-persistence.ts` |
+| **Session store with backfill** | SQLite session persistence with FTS5 search; backfills from filesystem on startup | `src/session/session-store.ts` |
 
-**Human review**: Nudges are logged and visible in `tx spy`. Deadlock cycles deeper than `autoBreakDepth` (default 3) → escalated to human with cycle visualization. Quality exhaustion (max iterations hit) → presents feedback history and asks user: retry, accept, or drop. Stale message cleanup → logged, user can audit via `tx spy`.
+### Nudge Detector
 
-## Quick Start
+**What it does**: Auto-recovers from missed route transitions.
 
-```bash
-# View reliability dashboard
-tx mesh health
+**How it works**:
+- Scheduled check runs after agent completion (15s delay), evaluates if routing targets received work
+- Resolves expected targets using `DispatchRouter` with agent's declared routing rules (default outcome = `complete`)
+- Skips terminal agents (core/core targets) and agents with already-sent messages
+- Summarizes dead agent output with Haiku and writes recovery task via SystemMessageWriter
+- Limits nudges per agent to prevent loops
 
-# View per-mesh reliability
-tx mesh health reliability-test
+### Deadlock Breaker
 
-# View dead letter queue
-tx mesh dlq
+**What it does**: Detects and breaks circular wait loops between agents.
 
-# Recover failed work
-tx mesh recover reliability-test
-```
+**How it works**:
+- Periodic DFS-based cycle detection in pending asks graph (~every 60s) using 3-color marking
+- Builds adjacency graph from queue pending asks; identifies circular chains (A→B→C→A)
+- Auto-breaks cycles up to `autoBreakDepth` (default 3)
+- Escalates deeper cycles (5+) to human via SystemMessageWriter with cycle visualization
 
-## Configuration
+### Stale Message Cleaner
 
-Set reliability thresholds in `.ai/tx/data/config.yaml`:
+**What it does**: Garbage collects unprocessed messages from crashed workers or typos.
 
-```yaml
-reliability:
-  circuitBreaker:
-    failureThreshold: 3    # Failures before circuit opens
-    cooldownMs: 30000      # How long circuit stays open
-  heartbeat:
-    warnMs: 60000          # Warn after 60s silence
-    staleMs: 120000        # Stale after 120s
-    deadMs: 300000         # Kill worker after 300s silence
-  safeMode:
-    autoEscalate: true     # Auto-restrict on SLI drop
-    cautiousThreshold: 0.95
-    restrictedThreshold: 0.90
-    lockdownThreshold: 0.80
-  dlq:
-    maxRetries: 3
-```
+**How it works**:
+- Periodic scanner (every 5 minutes) checks queue messages against TTL (30 minutes default)
+- Archives stale messages to `stale_messages` table with reason: `ttl_expired`, `no_target_mesh`, or `manual`
+- Actions configurable: `warn`, `archive`, or `delete`
+- Tracks known meshes to identify messages routed to non-existent targets; preserves audit trail
+
+### Quality Iteration Loops
+
+**What it does**: Validates output quality before routing, with iterative refinement.
 
-## Features
+**How it works**:
+- Post-hook runs quality stack on worker output after message reception
+- Runs gates (required + suggested) on output; returns `{passed, feedback}`
+- Three failure modes: `halt` (stop), `loop` (retry if under max iterations), `skip` (allow through)
+- Injects feedback messages on failure for agent self-correction
+
+### Session Suspend/Resume
 
-### 1. Circuit Breaker
+**What it does**: Non-destructive pause for external input with crash recovery.
+
+**How it works**:
+- Suspends sessions (kills worker, saves state to SQLite) when agent hits ask-human or await-response boundaries
+- Buffers incoming responses while awaiting multiple targets (tracks `pendingResponseCount`)
+- Persists to `suspended_sessions` table with reason, target agents, and hook context
+- Dispatcher handles resume: loading state, creating new runner, wiring event handlers
+
+### FSM State Persistence + Backup
+
+**What it does**: Durable state across crashes with automatic corruption recovery.
+
+**How it works**:
+- SQLite tables: `mesh_state` (current) and `mesh_state_backup` (versioned backups)
+- `saveState()` creates backup of previous state before updating (atomic via transaction)
+- On corruption (JSON parse error), `loadState()` auto-restores from latest backup
+- Indexes on `mesh_name + created_at` for efficient backup lookup
+
+### Session Store with Backfill
+
+**What it does**: Persistent session metadata with full-text search.
+
+**How it works**:
+- SQLite `sessions` table stores metadata: agent_id, mesh_id, timestamps, transcript path, message counts, final status
+- FTS5 virtual table `sessions_fts` enables full-text search on content, headline, tags
+- Prepared statements for fast CRUD; cache for summary types (e.g., `file_changes`, `decisions`)
+- Backfills existing sessions from disk on startup (migration-friendly)
+
+---
+
+## Nine 3 — Monitoring, Circuit Breaking, DLQ (99.9%)
+
+Active monitoring, automatic circuit-breaking, and dead letter recovery.
+
+| Feature | What It Does | Where |
+|---------|-------------|-------|
+| **Circuit breaker** | Stops spawning agents that keep failing; auto-recovers after cooldown | `src/reliability/circuit-breaker.ts` |
+| **Heartbeat monitor** | Detects stuck workers via silence thresholds; kills dead workers | `src/reliability/heartbeat-monitor.ts` |
+| **Dead letter queue** | Captures failed work with session context for recovery | `src/reliability/dead-letter-queue.ts` |
+| **SLI tracker** | Measures success rate, failure categories, MTTR, nines level | `src/reliability/sli-tracker.ts` |
+| **Safe mode** | Restricts agent capabilities when reliability drops | `src/reliability/safe-mode.ts` |
+| **Checkpoint log** | Saves session IDs at FSM transitions; enables rewind-to recovery | `src/reliability/checkpoint-log.ts` |
+| **Rate limiter** | Token bucket rate limiting for server endpoints | `src/server/rate-limiter.ts` |
+| **Worker pool backpressure** | Adaptive polling with concurrency limits | `src/server/worker-pool.ts` |
+| **Metrics aggregator** | Per-query metrics with token cost tracking | `src/worker/metrics-aggregator.ts` |
+| **Worker lifecycle tracking** | Unique instance IDs for deduplication and debugging | `src/worker/worker-lifecycle.ts` |
+
+### Circuit Breaker
 
 **What it does**: Stops spawning an agent that keeps failing. Prevents cascade failures.
 
@@ -122,7 +323,7 @@ tx mesh health           # Shows open/half_open circuits
 tx spy                   # Watch for reliability:blocked activity
 ```
 
-### 2. Heartbeat Monitor
+### Heartbeat Monitor
 
 **What it does**: Detects stuck workers and kills them.
 
@@ -142,7 +343,7 @@ tx mesh health           # Shows unhealthy agents with silence duration
 tx logs --component reliability  # Heartbeat kill events
 ```
 
-### 3. Dead Letter Queue (DLQ)
+### Dead Letter Queue (DLQ)
 
 **What it does**: Captures failed work with enough context to recover it.
 
@@ -152,22 +353,16 @@ tx logs --component reliability  # Heartbeat kill events
 - `manual`: Retries exhausted → needs human decision.
 
 **How entries are created**:
-- Worker exhausts all retries → dispatcher calls `reliability.deadLetter()` with the worker's sessionId, messages sent, and failure category
+- Worker exhausts all retries → dispatcher calls `reliability.deadLetter()` with sessionId, messages sent, and failure category
 - Heartbeat kills a stuck worker → recorded as failure, may generate DLQ entry on next retry exhaustion
 
 **How recovery works**:
 
-**Important: Recovery requires human review.** The core agent is instructed to always diagnose, present options (resume vs rewind vs drop), and get explicit user confirmation before triggering recovery. This prevents silent re-execution of bad work.
-
-1. **Automatic on startup**: When `tx start` runs, the dispatcher calls `recoverAll()` — recovers any pending session_resume and requeue entries from the previous run. (This is the only automatic path — it handles crash recovery between restarts.)
-
-2. **Human-initiated via core agent** (preferred): User asks core to investigate. Core runs `tx mesh health` + `tx mesh dlq`, presents findings with available checkpoints, user picks a recovery strategy, core writes the recovery message.
-
-3. **CLI**: `tx mesh recover <mesh>` sends a SIGUSR2 signal to the running dispatcher. Shows available checkpoints first.
-
+1. **Automatic on startup**: `tx start` calls `recoverAll()` — recovers pending session_resume and requeue entries from the previous run (crash recovery only).
+2. **Human-initiated via core agent** (preferred): User investigates via `tx mesh health` + `tx mesh dlq`, picks recovery strategy, core writes recovery message.
+3. **CLI**: `tx mesh recover <mesh>` sends SIGUSR2 to running dispatcher. Shows available checkpoints first.
 4. **Front-matter message**: Core writes a message with `recover: true` (and optionally `rewind-to: <state>`) to trigger DLQ recovery.
-
-5. **Fallback**: If the dispatcher isn't running, `tx mesh recover` writes a recovery message to the msgs dir that will be processed on next start.
+5. **Fallback**: If dispatcher isn't running, `tx mesh recover` writes a recovery message to msgs dir for next start.
 
 **Observe it**:
 ```bash
@@ -182,13 +377,13 @@ tx mesh dlq clear        # GC recovered entries
 **What it does**: Saves session IDs at every FSM state transition. Enables rewinding to any completed state instead of just the crash point.
 
 **How checkpoints are saved**:
-- Every time an FSM mesh transitions states, the completing agent's session ID is saved to SQLite
+- Every FSM mesh state transition saves the completing agent's session ID to SQLite
 - Checkpoint key: `mesh_name + state_name` → `session_id`
-- Multiple checkpoints per state are kept (most recent wins on lookup)
+- Multiple checkpoints per state kept (most recent wins on lookup)
 
 **How rewind-to works**:
 
-When recovering from the DLQ, you can specify `rewind-to: <state>` to use a checkpoint's session ID instead of the crash-point session. This means the recovered worker resumes from after that state completed — skipping all the bad work that happened after.
+When recovering from the DLQ, specify `rewind-to: <state>` to use a checkpoint's session ID instead of the crash-point session. The recovered worker resumes from after that state completed — skipping all bad work that happened after.
 
 ```
 FSM: analyze → build → verify → complete
@@ -232,7 +427,7 @@ Available checkpoints (use --rewind-to=<state>):
 
 **When checkpoints are cleared**: On mesh completion (`clearMeshState`). Old checkpoints are garbage collected (keeps last 50 per mesh).
 
-### 4. SLI Tracker
+### SLI Tracker
 
 **What it does**: Measures success rate, failure categories, MTTR, and nines level.
 
@@ -254,7 +449,7 @@ tx mesh health my-mesh      # Per-agent success rates
 tx mesh health --json       # Full snapshot
 ```
 
-### 5. Safe Mode
+### Safe Mode
 
 **What it does**: Restricts agent capabilities when reliability drops.
 
@@ -281,7 +476,7 @@ tx mesh health           # Shows current safe mode level
 tx spy                   # Watch safe-mode:blocked activity events
 ```
 
-### 6. Rate Limiter
+### Rate Limiter
 
 **What it does**: Token bucket rate limiting for server endpoints. Prevents burst overload.
 
@@ -290,9 +485,7 @@ tx spy                   # Watch safe-mode:blocked activity events
 - Automatic bucket cleanup every 5 minutes
 - Smooth rate limiting (not hard cutoff)
 
-**Source**: `src/server/rate-limiter.ts`
-
-### 7. Worker Pool Backpressure
+### Worker Pool Backpressure
 
 **What it does**: Adaptive polling with concurrency limits prevents queue overload.
 
@@ -301,19 +494,18 @@ tx spy                   # Watch safe-mode:blocked activity events
 - Respects concurrency limits — won't spawn beyond capacity
 - Graceful shutdown drains active workers before terminating
 
-**Source**: `src/server/worker-pool.ts`
-
-### 8. Metrics Aggregator
+### Metrics Aggregator
 
 **What it does**: Per-query metrics collection with token cost tracking.
 
-**Tracks**: input/output tokens, duration, cost per query, aggregate totals for worker lifetime, tool call counts.
-
-**Source**: `src/worker/metrics-aggregator.ts`
+**How it works**:
+- Tracks input/output tokens, duration, cost per query
+- Aggregate totals for worker lifetime
+- Tool call counts per worker
 
-### 9. Worker Lifecycle Tracking
+### Worker Lifecycle Tracking
 
-**What it does**: Tracks parallel worker execution with unique instance IDs for deduplication and debugging.
+**What it does**: Tracks parallel worker execution with unique instance IDs.
 
 **How it works**:
 - Generates unique worker IDs (`agentId-uuid`)
@@ -321,27 +513,28 @@ tx spy                   # Watch safe-mode:blocked activity events
 - Persists worker state to disk
 - Tracks nudge counts and completion frontier
 
-**Source**: `src/worker/worker-lifecycle.ts`
-
-## Test Mesh
-
-The `reliability-test` mesh is configured with tight thresholds for quick testing:
-- Circuit breaker opens after 2 failures (not 3)
-- Heartbeat kills after 120s (not 300s)
-- Safe mode auto-escalates at 80%/50%/25% (not 95%/90%/80%)
-
-```bash
-# Run the test mesh
-tx msg "Write a hello world function" --to reliability-test/planner
+---
 
-# Monitor reliability during execution
-tx mesh health reliability-test
+## Configuration
 
-# If failures occur, check DLQ
-tx mesh dlq reliability-test
+Set reliability thresholds in `.ai/tx/data/config.yaml`:
 
-# Recover failed work
-tx mesh recover reliability-test
+```yaml
+reliability:
+  circuitBreaker:
+    failureThreshold: 3    # Failures before circuit opens
+    cooldownMs: 30000      # How long circuit stays open
+  heartbeat:
+    warnMs: 60000          # Warn after 60s silence
+    staleMs: 120000        # Stale after 120s
+    deadMs: 300000         # Kill worker after 300s silence
+  safeMode:
+    autoEscalate: true     # Auto-restrict on SLI drop
+    cautiousThreshold: 0.95
+    restrictedThreshold: 0.90
+    lockdownThreshold: 0.80
+  dlq:
+    maxRetries: 3
 ```
 
 ## Front-Matter Options
@@ -367,6 +560,27 @@ Agents can interact with reliability features via message front-matter:
 | `tx mesh recover <mesh> --rewind-to=<state>` | Recover rewinding to a specific FSM state |
 | `tx mesh recover --all` | Recover all pending DLQ entries |
 
+## Test Mesh
+
+The `reliability-test` mesh is configured with tight thresholds for quick testing:
+- Circuit breaker opens after 2 failures (not 3)
+- Heartbeat kills after 120s (not 300s)
+- Safe mode auto-escalates at 80%/50%/25% (not 95%/90%/80%)
+
+```bash
+# Run the test mesh
+tx msg "Write a hello world function" --to reliability-test/planner
+
+# Monitor reliability during execution
+tx mesh health reliability-test
+
+# If failures occur, check DLQ
+tx mesh dlq reliability-test
+
+# Recover failed work
+tx mesh recover reliability-test
+```
+
 ## Architecture
 
 ```
@@ -395,121 +609,47 @@ Agents can interact with reliability features via message front-matter:
         └────────────┘   └───────────┘   └───────────┘
 ```
 
-## Reliability Roadmap — Human Review Gates
-
-Every reliability improvement includes human review steps. The system **never** silently changes behavior, retries destructively, or masks failures.
-
-### Priority 1: Default-On Checkpoints + Replay
-
-**Impact**: 10x — turns N-step recovery into 1-step problem
-**Effort**: Medium
-
-**What it does**: Every FSM state transition auto-saves a checkpoint. On failure, the user picks which checkpoint to rewind to and replay from.
-
-**Human review steps**:
-1. **Checkpoint notification**: When a mesh completes a state transition, core can optionally surface it: "Mesh X completed 'build' — checkpoint saved."
-2. **Replay approval**: Before any rewind-to replay, core presents:
-   - Which checkpoint to rewind to
-   - What work will be replayed (states after the checkpoint)
-   - What work will be discarded (failed states)
-3. **Post-replay review**: After replay completes, core presents the result for user approval before the mesh continues to the next state.
-
-**Never automatic**: Replay does not happen without the user choosing a checkpoint.
-
----
+## Roadmap — Nine 4 (99.99%)
 
-### Priority 2: Reliability Metrics Table + Tracking
+| Priority | Feature | Impact | Effort |
+|----------|---------|--------|--------|
+| 1 | Retry-with-variation | 3-5x retry success improvement | Low |
+| 2 | Output schema validation | Catches semantic failures early | Medium |
+| 3 | Critical/non-critical agent classification | Prevents cascade from optional steps | Low |
+| 4 | Aggregate observability dashboard | Finds the long-tail 0.01% | Medium |
 
-**Impact**: Foundation for everything else
-**Effort**: Low
-
-**What it does**: SLI tracker records success rate, failure categories, MTTR, and nines level per mesh and per agent.
-
-**Human review steps**:
-1. **Threshold alerts**: When SLI drops below a configured threshold, core surfaces it: "Mesh X reliability dropped to 94.2% (below 95% cautious threshold). 3 failures in last 10 runs. Categories: 2x model_error, 1x timeout."
-2. **Safe mode escalation approval**: Before escalating safe mode (cautious → restricted → lockdown), core presents the SLI data and asks: "Restrict write access for mesh X? Current SLI: 89%."
-3. **De-escalation approval**: Safe mode never auto-de-escalates. Core presents current metrics and asks: "SLI recovered to 98%. Clear restricted mode for mesh X?"
-4. **Periodic health summary**: On user request (`tx mesh health`), core presents a table of all meshes with SLI, open circuits, DLQ entries, and safe mode level.
-
-**Never automatic**: Safe mode escalation beyond `cautious` requires user confirmation. SLI data is always visible.
-
----
-
-### Priority 3: Retry-With-Variation on Routing/Protocol Failures
-
-**Impact**: 3-5x improvement on retry success
-**Effort**: Low
+### Retry-With-Variation
 
 **What it does**: When a retry fires, it varies the approach — different prompt framing, model fallback, or simplified task scope — instead of repeating the identical failing request.
 
-**Human review steps**:
-1. **First failure notification**: On first failure, core reports: "Agent X failed (model_error). Retrying with variation: [describe variation]. Retry 1/3."
-2. **Variation transparency**: Each retry logs what changed (e.g., "retry 2: simplified prompt, dropped optional context" or "retry 3: fallback model").
-3. **Retry exhaustion review**: When all retries exhaust, core presents the full retry history: "3 retries failed for agent X. Variations tried: [list]. Recommend: [recovery options]." User decides next step.
-4. **Variation strategy approval**: If a new variation strategy is added to config, core surfaces it for review before it takes effect.
+**How it will work**:
+- First failure retries with variation: simplified prompt, dropped optional context, or fallback model
+- Each retry logs what changed for transparency
+- Exhausted retries present full retry history with variations tried
 
-**Never automatic**: Retries within the configured limit are automatic (they're cheap and fast), but the user sees what's happening. Exhausted retries always stop and ask.
-
----
-
-### Priority 4: Output Schema Validation
-
-**Impact**: Catches semantic failures early
-**Effort**: Medium
+### Output Schema Validation
 
 **What it does**: Validates agent outputs against expected schemas (front-matter structure, required fields, output format) before passing results downstream.
 
-**Human review steps**:
-1. **Validation failure notification**: When output fails schema validation, core reports: "Agent X output failed validation: missing required field 'summary'. Output was [N] chars."
-2. **Correction approval**: Before asking the agent to retry with validation feedback, core presents: "Ask agent X to fix output? Validation errors: [list]. Or drop this output?"
-3. **Schema change review**: When a mesh config adds or modifies `output_schema`, core surfaces: "Mesh X now requires 'summary' field in output. Existing agents may need prompt updates."
-4. **Partial pass handling**: When output partially validates (some fields valid, some not), core presents what passed and what failed. User decides: accept partial, retry, or drop.
-
-**Never automatic**: Schema validation failures are always surfaced. The system does not silently discard or re-request outputs.
-
----
-
-### Priority 5: Critical / Non-Critical Agent Classification
+**How it will work**:
+- Mesh config defines `output_schema` per agent
+- Post-completion hook validates output against schema
+- Partial pass handling: presents what passed and what failed for human decision
 
-**Impact**: Prevents cascade from optional steps
-**Effort**: Low
+### Critical/Non-Critical Agent Classification
 
-**What it does**: Agents are classified as `critical` (failure blocks mesh) or `non-critical` (failure is logged but mesh continues). Prevents optional agents from taking down the whole workflow.
+**What it does**: Agents classified as `critical` (failure blocks mesh) or `non-critical` (failure logged, mesh continues). Prevents optional agents from taking down the whole workflow.
 
-**Human review steps**:
-1. **Classification review**: When a mesh is loaded, core can surface agent classifications: "Mesh X: critical=[planner, builder], non-critical=[linter, formatter]."
-2. **Non-critical failure notification**: When a non-critical agent fails, core reports: "Non-critical agent 'linter' failed (timeout). Mesh continues. Output from this step will be missing."
-3. **Promotion decision**: If a non-critical agent fails repeatedly, core asks: "Agent 'linter' has failed 5 times. Should it be promoted to critical (failures block mesh) or disabled?"
-4. **Critical failure escalation**: Critical agent failures always stop the mesh and present recovery options (Priority 1 checkpoints + Priority 3 retry history).
+**How it will work**:
+- Agent config adds `critical: true|false` field (default: true)
+- Non-critical failures logged and surfaced but don't block mesh
+- Repeated non-critical failures prompt promotion decision
 
-**Never automatic**: Non-critical failures are always reported. The user is never surprised by missing outputs from skipped agents.
-
----
-
-### Priority 6: Aggregate Observability Dashboard
-
-**Impact**: Needed to find the long-tail 0.01%
-**Effort**: Medium
+### Aggregate Observability Dashboard
 
 **What it does**: Unified view across all meshes — SLI trends, failure patterns, cost tracking, and anomaly detection.
 
-**Human review steps**:
-1. **Anomaly alerts**: When the dashboard detects anomalies (sudden SLI drop, unusual failure pattern, cost spike), core surfaces: "Anomaly detected: mesh X failure rate spiked from 2% to 15% in last hour. Failure category: model_error."
-2. **Trend review**: On request, core presents trend data: "Last 24h: 47 mesh runs, 98.3% success, 1 DLQ entry (recovered). Top failure: timeout (3x in mesh Y)."
-3. **Cost review gate**: Before approving expensive recovery (multiple retries, large context replay), core presents estimated cost: "Recovering mesh X with rewind-to will replay ~50k tokens. Proceed?"
-4. **Weekly digest**: Core can present a weekly reliability summary: nines achieved, worst-performing meshes, recurring failure patterns, DLQ utilization.
-
-**Never automatic**: The dashboard is passive — it collects and presents. All actions triggered by dashboard insights go through the standard human review workflow (diagnose → present → confirm → execute).
-
----
-
-### Human Review Principle
-
-Across all 6 priorities, the same principle applies:
-
-> **The system does work. The human makes decisions.**
-
-- Retries within limits → automatic (but visible)
-- Recovery, replay, escalation → always human-approved
-- Failures → always surfaced with context and options
-- No silent state changes that affect mesh behavior
+**How it will work**:
+- Anomaly detection: sudden SLI drops, unusual failure patterns, cost spikes
+- Trend data: success rates, DLQ utilization, MTTR over time
+- Cost estimation before expensive recovery operations