diff --git a/docs/HUMAN_REVIEW.md b/docs/HUMAN_REVIEW.md
new file mode 100644
index 00000000..a6b14dd9
--- /dev/null
+++ b/docs/HUMAN_REVIEW.md
@@ -0,0 +1,147 @@
+# Human Review Gates — Reliability
+
+Every reliability feature includes human review steps. The system **never** silently changes behavior, retries destructively, or masks failures.
+
+> **The system does work. The human makes decisions.**
+
+- Retries within limits → automatic (but visible)
+- Recovery, replay, escalation → always human-approved
+- Failures → always surfaced with context and options
+- No silent state changes that affect mesh behavior
+
+See [reliability.md](./reliability.md) for feature details.
+
+---
+
+## Nine 1 — Basic Error Handling
+
+### Worker Retries
+- When retries exhaust → DLQ entry created → core presents failure to user
+- User decides: retry with variation, recover from checkpoint, or drop
+
+### Injection Poll Loop
+- Stale entries (>5min) are dropped but remain available via `tx inbox`
+- If file-based fallback activates, user sees pending messages on next interaction
+
+### Routing Correction
+- When routing retries exhaust → escalated to user with full attempt history
+- User sees which targets were tried and picks correct one
+
+### Usage Policy Errors
+- Human chooses: retry, skip, modify prompt, or abort
+- Full diagnostic context (triggering prompt, recent history) included in ask-human message
+
+### Recovery Handler
+- First 2 recovery requests: automatic guidance with FSM state and valid routes
+- 3rd+ request in 60s: escalated to human — agent is repeatedly stuck
+
+---
+
+## Nine 2 — Validation & Protocol Enforcement
+
+### Parity Gate
+- Violations → reminder injected to agent
+- If unresolved after reminder → surfaced to user with pending asks list
+
+### Identity Gate
+- Kill events → logged with full reason (agent ID, expected vs actual `from:` field)
+- User can audit identity violations via logs
+
+### Mesh Validator
+- Validation errors → block mesh load, user sees what's wrong and how to fix it
+- Warnings → logged but don't block (user can review in logs)
+
+### Manifest Validator
+- Validation failures → surfaced to user with missing/invalid paths and responsible agents
+
+### Bash Guard
+- 1-2 violations → error response with allowed paths shown to agent
+- 3+ violations → worker killed, logged for forensics
+- User can audit bash guard events in logs
+
+---
+
+## Nine 2.5 — Self-Healing & Auto-Recovery
+
+### Nudge Detector
+- Nudges are logged and visible in `tx spy`
+- Max 1 nudge per agent prevents recovery loops
+
+### Deadlock Breaker
+- Shallow cycles (depth ≤ 3) → auto-broken, logged
+- Deep cycles (depth 5+) → escalated to human with cycle visualization (A→B→C→A)
+- User decides which agent's ask to drop
+
+### Stale Message Cleaner
+- Stale messages archived with reason — no silent deletion
+- User can audit via `tx spy` and review archived messages
+
+### Quality Iteration Loops
+- Max iterations hit → presents feedback history to user
+- User decides: retry, accept current output, or drop
+- Each iteration's feedback is visible for review
+
+---
+
+## Nine 3 — Monitoring, Circuit Breaking, DLQ
+
+### Circuit Breaker
+- Circuit open → agent skipped, logged with failure count
+- Half-open test spawn → user can monitor via `tx mesh health`
+- Circuits don't auto-close silently — health dashboard shows state
+
+### Heartbeat Monitor
+- Warn threshold → logged warning (no action)
+- Stale threshold → logged stale warning
+- Dead threshold → **worker killed**, failure recorded, routed to DLQ
+- All events visible in `tx mesh health` with silence duration
+
+### Dead Letter Queue
+- **Recovery always requires human review** (except crash recovery on restart)
+- Core agent diagnoses, presents options (resume vs rewind vs drop), gets explicit confirmation
+- Available checkpoints shown before any recovery action
+- `tx mesh dlq` shows all pending entries with recovery mode and context
+
+### Checkpoint Log & Rewind-To
+- Checkpoint notification: core can surface "Mesh X completed 'build' — checkpoint saved"
+- Before rewind-to replay, core presents: which checkpoint, what replays, what's discarded
+- Post-replay: result presented for user approval before mesh continues
+- **Replay never happens without user choosing a checkpoint**
+
+### SLI Tracker
+- Threshold alerts: "Mesh X reliability dropped to 94.2% (below 95% cautious threshold). 3 failures in last 10 runs. Categories: 2x model_error, 1x timeout."
+- Periodic health summary available via `tx mesh health`
+- SLI data always visible — never hidden from user
+
+### Safe Mode
+- **Escalation beyond cautious requires user confirmation** when surfaced via core
+- Auto-escalation (if enabled) is logged with reason and SLI data
+- **Never auto-de-escalates** — human must clear via `resetMesh()` or `resetAll()`
+- Core presents: "SLI recovered to 98%. Clear restricted mode for mesh X?"
+
+---
+
+## Roadmap — Nine 4
+
+### Retry-With-Variation
+- First failure: core reports "Agent X failed (model_error). Retrying with variation: [description]. Retry 1/3."
+- Each retry logs what changed (e.g., "retry 2: simplified prompt, dropped optional context")
+- Exhausted retries: core presents full retry history with variations tried — user decides next step
+- New variation strategies require review before taking effect
+
+### Output Schema Validation
+- Validation failure: core reports "Agent X output failed validation: missing required field 'summary'."
+- Before retry with validation feedback, core presents: "Ask agent X to fix? Validation errors: [list]. Or drop?"
+- Schema changes in mesh config: core surfaces impact on existing agents
+- Partial pass: core shows what passed/failed, user decides accept partial, retry, or drop
+
+### Critical/Non-Critical Agent Classification
+- On mesh load, core can surface classifications: "critical=[planner, builder], non-critical=[linter]"
+- Non-critical failure: "Agent 'linter' failed (timeout). Mesh continues. Output from this step missing."
+- Repeated non-critical failures: "Agent 'linter' failed 5 times. Promote to critical or disable?"
+- Critical failures always stop the mesh and present recovery options
+
+### Aggregate Observability Dashboard
+- Anomaly alerts: "Mesh X failure rate spiked from 2% to 15% in last hour. Category: model_error."
+- Cost review gate: "Recovering mesh X with rewind-to will replay ~50k tokens. Proceed?"
+- Dashboard is passive — all actions from insights go through standard human review
diff --git a/docs/reliability.md b/docs/reliability.md
index 55332699..4f5938a5 100644
--- a/docs/reliability.md
+++ b/docs/reliability.md
@@ -2,6 +2,8 @@
 
 TX reliability features organized by Karpathy's "March of Nines" — each nine requires fundamentally new approaches.
 
+Human review gates for all features are documented in [HUMAN_REVIEW.md](./HUMAN_REVIEW.md).
+
 ## March of Nines — Current Status
 
 | Nines | Technique | TX Status |
@@ -12,96 +14,295 @@ TX reliability features organized by Karpathy's "March of Nines" — each nine r
 | **3 (99.9%)** | Monitoring, circuit breaking, DLQ | Circuit breaker, heartbeat monitor, DLQ with session resume, SLI tracker, safe mode, checkpoint log, rate limiter, worker pool backpressure, metrics aggregator, worker lifecycle tracking |
 | **4 (99.99%)** | [Roadmap] | Retry-with-variation, schema validation, agent classification, observability dashboard |
 
-### Nine 1 — Basic Error Handling (90%)
+---
+
+## Nine 1 — Basic Error Handling (90%)
 
 Foundational durability. Nothing silently drops.
 
 | Feature | What It Does | Where |
 |---------|-------------|-------|
-| **SQLite WAL mode** | Write-ahead logging prevents queue corruption on crash | `src/queue/index.ts` — `journal_mode=WAL` on init |
-| **Worker retries (3x)** | Failed workers retry up to 3 times before DLQ | `src/worker/dispatcher.ts` — configurable via `dlq.maxRetries` |
-| **Injection poll loop** | Core message injection retries on next poll if Claude is busy | `src/cli/start.ts` — leaves message at head of queue for next cycle |
-| **Routing correction injection** | Bad routing target → corrective prompt injected back to sender | `src/worker/dispatcher.ts` — `handleRoutingError()`, max retries per guardrail config |
-| **Graceful worker pool shutdown** | Drains active workers before terminating pool, prevents orphaned workers | `src/server/worker-pool.ts` |
-| **Usage policy error handling** | Detects Claude API usage policy errors, captures diagnostic context, writes ask-human message instead of crashing | `src/worker/usage-policy-error.ts` |
-| **Recovery handler with escalation** | Tracks recovery requests per agent, provides FSM guidance on first attempt, escalates to human after 3 requests in 60s | `src/core/recovery.ts` |
+| **SQLite WAL mode** | Write-ahead logging prevents queue corruption on crash | `src/queue/index.ts` |
+| **Worker retries (3x)** | Failed workers retry up to 3 times before DLQ | `src/worker/dispatcher.ts` |
+| **Injection poll loop** | Core message injection retries on next poll if Claude is busy | `src/cli/start.ts` |
+| **Routing correction injection** | Bad routing target → corrective prompt injected back to sender | `src/worker/dispatcher.ts` |
+| **Graceful worker pool shutdown** | Drains active workers before terminating pool | `src/server/worker-pool.ts` |
+| **Usage policy error handling** | Detects Claude API usage policy errors, writes ask-human message instead of crashing | `src/worker/usage-policy-error.ts` |
+| **Recovery handler with escalation** | Tracks recovery requests per agent, escalates to human after 3 requests in 60s | `src/core/recovery.ts` |
+
+### SQLite WAL Mode
+
+**What it does**: Prevents queue corruption on crash via Write-Ahead Logging.
+
+**How it works**:
+- Enables WAL mode (`journal_mode=WAL`) on the SQLite message queue at init
+- All writes are logged to WAL file before committing to main database
+- Guarantees queue state is recoverable even if process crashes mid-write
+- Allows concurrent readers while writes are in flight
+
+### Worker Retries (3x)
+
+**What it does**: Auto-retries failed workers before routing to DLQ.
+
+**How it works**:
+- Each worker has a state machine tracking retry attempts
+- On error, checks `canTransition('retry')` before respawning
+- Differentiates retriable errors (crashes, model overload) vs non-retriable (suspension, max-turns, abort)
+- After max retries exhausted, routes to Dead Letter Queue for recovery
+
+### Injection Poll Loop
+
+**What it does**: Ensures messages reach the core Claude session even when it's busy.
+
+**How it works**:
+- Maintains an in-memory queue of messages waiting for injection into tmux
+- Polls every 2s (`INJECTION_POLL_MS`) checking if Claude is idle, then injects
+- Drops stale entries pending >5 minutes (they're available via `tx inbox`)
+- Falls back to file-based delivery (`pending-for-core.json`) if active injection fails
+
+### Routing Correction Injection
+
+**What it does**: Recovers from bad routing by teaching the agent valid targets.
+
+**How it works**:
+- Detects messages targeting non-existent meshes/agents, increments retry counter per sender→target pair
+- Injects corrective message back to sender listing valid available targets (up to max retries)
+- After max retries exceeded, escalates to human via `ask-human` message
+- Supports strict mode (block immediately) and warning mode (allow + notify) per guardrail config
+
+### Graceful Worker Pool Shutdown
+
+**What it does**: Prevents orphaned workers on shutdown.
+
+**How it works**:
+- Sets `running = false` to prevent new spawns, stops polling loop
+- Collects all active worker promises and awaits completion via `Promise.all()`
+- Logs count of in-flight workers being drained
+
+### Usage Policy Error Handling
+
+**What it does**: Captures false-positive usage policy errors with full diagnostic context.
+
+**How it works**:
+- Detects usage policy errors from Claude API via pattern matching
+- Captures diagnostic context: triggering prompt, recent history, in-progress tool calls, agent/mesh info
+- Writes `ask-human` message to core with full context for human decision (retry, skip, modify prompt, abort)
+- Preserves session ID for potential resume
 
-**Human review**: When worker retries exhaust → DLQ entry created → core presents failure to user. When routing retries exhaust → escalated to user with full attempt history. Usage policy errors → human chooses retry/skip/modify-prompt/abort.
+### Recovery Handler with Escalation
 
-### Nine 2 — Validation & Protocol Enforcement (99%)
+**What it does**: Detects repeatedly stuck agents and escalates to human.
+
+**How it works**:
+- Intercepts messages routed to `system/recovery`
+- Tracks frequency per agent with time window; resets counter outside escalation window
+- First 2 attempts: returns guidance with current FSM state, pending asks, and valid exit routes
+- 3rd+ attempt: escalates to `core/core` for human intervention
+
+---
+
+## Nine 2 — Validation & Protocol Enforcement (99%)
 
 Catch bad outputs and protocol violations before they propagate.
 
 | Feature | What It Does | Where |
 |---------|-------------|-------|
-| **Parity gate** | Ensures completion agents answer all pending asks before completing | `src/worker/dispatcher.ts`, `src/core/consumer.ts` — tracks `pending_asks` table |
-| **FSM validation** | State machine meshes enforce valid transitions, prevent skipped/repeated states | `src/state-machine/` — transition guards + checkpoint persistence |
-| **Mesh validator** | Validates mesh config before loading (required fields, types, routing consistency) | `src/worker/mesh-validator.ts` — errors block load, warnings log |
-| **Identity gate** | PreToolUse hook validates `from:` field matches agent identity | `src/worker/identity-gate.ts` — blocks/warns per guardrail mode, strike system |
-| **Write gate** | Controls which tools agents can use based on safe mode level | `src/worker/guardrail-config.ts` — restricted/lockdown blocks Write/Edit/Bash |
-| **Bash guard** | PreToolUse hook intercepts Bash commands with redirects (`>`, `>>`, `tee`), validates target paths against allowed write manifest. Strike system: 1-2 violations → error with allowed paths, 3+ → kill worker | `src/worker/write-gate.ts` — `createBashHook()` |
-| **Manifest validator** | Validates agent output artifacts against declared manifest paths with template variable resolution (5-pass chained substitution) | `src/worker/manifest-validator.ts` |
-| **Guardrail config chain** | Unified strict/warning mode on every guardrail with override chain: agent > mesh > global > hardcoded | `src/worker/guardrail-config.ts` |
+| **Parity gate** | Ensures completion agents answer all pending asks before completing | `src/worker/dispatcher.ts`, `src/core/consumer.ts` |
+| **FSM validation** | State machine meshes enforce valid transitions, prevent skipped/repeated states | `src/state-machine/` |
+| **Mesh validator** | Validates mesh config before loading (required fields, types, routing consistency) | `src/worker/mesh-validator.ts` |
+| **Identity gate** | PreToolUse hook validates `from:` field matches agent identity | `src/worker/identity-gate.ts` |
+| **Write gate** | Controls which paths agents can write to based on manifest | `src/worker/write-gate.ts` |
+| **Bash guard** | PreToolUse hook blocks dangerous Bash patterns outside project boundary | `src/worker/bash-guard.ts` |
+| **Manifest validator** | Validates agent output artifacts against declared manifest paths | `src/worker/manifest-validator.ts` |
+| **Guardrail config chain** | Unified strict/warning mode with override chain: agent > mesh > global > hardcoded | `src/worker/guardrail-config.ts` |
+
+### Parity Gate
+
+**What it does**: Prevents agents from completing a mesh while unanswered questions remain.
+
+**How it works**:
+- Tracks pending asks (questions sent to human boundary `core/core`) in SQLite queue
+- Validates responses from `core/core` have a matching pending ask by msg-id (fallback to agent-level matching)
+- Blocks `task-complete` messages with unresolved asks; deletes offending file and emits `parity-reminder`
+- Terminal-by-default: asks to `core/core` require parity; agent-to-agent asks don't trigger tracking
+
+### FSM Validation
+
+**What it does**: Enforces state machine rules before message routing.
+
+**How it works**:
+- Type-safe state transitions with guard validation and middleware hooks (pre/post)
+- Consumer calls `validateMessageWithFSM()` on all incoming messages BEFORE type-specific routing
+- Centralized validation ensures all routing respects mesh-defined FSM rules
+- Emits transition history and immutable state snapshots for replay/debugging
+
+### Mesh Validator
 
-**Human review**: Parity gate violations → reminder injected, if unresolved → surfaced to user. Identity gate kills → logged with reason. Mesh validation errors → block load, user sees what's wrong. Manifest validation failures → surfaced to user with missing/invalid paths. Bash guard violations → logged for forensics, worker killed after 3 strikes.
+**What it does**: Catches config errors before a mesh can load.
 
-### Nine 2.5 — Self-Healing & Auto-Recovery
+**How it works**:
+- Static `validate()` checks mesh config structure, required fields, agent definitions, routing rules, FSM definitions, and manifest entries
+- Validates field types, agent presence, entry/exit points, task distribution config, guardrail overrides, and parallelism blocks
+- Returns `ValidationResult` with errors and warnings — errors block load, warnings log
+- Catches typos early (e.g., agent routing to nonexistent agents)
+
+### Identity Gate
+
+**What it does**: Prevents agents from impersonating other agents.
+
+**How it works**:
+- PreToolUse hook intercepts Write tool calls to `.ai/tx/msgs/`
+- Extracts `from:` field from message YAML frontmatter, compares against expected agent identity
+- Enforces fully-qualified names (rejects bare `worker` when agent is `dev/worker`) to prevent cross-mesh routing leaks
+- Strike counter with configurable kill threshold; strict (block) vs warning (allow + feedback) modes
+
+### Write Gate
+
+**What it does**: Restricts file writes to declared manifest paths.
+
+**How it works**:
+- PreToolUse hooks intercept Write/Edit/NotebookEdit tools and Bash redirects (`>`, `>>`, `tee`)
+- Validates target paths against agent's declared allowed paths from manifest
+- Auto-exempts `.ai/tx/msgs/` and `.ai/tx/logs/`; allows `/dev/null`
+- Tracks file-tool and bash-redirect strikes separately; kill threshold on accumulated violations
+
+### Bash Guard
+
+**What it does**: Docker-like isolation — full Bash inside project, can't escape.
+
+**How it works**:
+- Two security layers: workDir boundary enforcement + catastrophic damage prevention
+- Blocks all filesystem operations (read/write/symlink) outside project directory
+- Blocks privilege escalation, root destruction, system service manipulation, raw disk ops
+- Network access explicitly allowed (Docker parity): curl, wget, ssh, npm publish are safe
+
+### Manifest Validator
+
+**What it does**: Validates agent artifacts against declared manifest paths.
+
+**How it works**:
+- Resolves manifest variable references (game-id, campaign-id, etc.) from `session.yaml` with caching
+- Builds path context from mesh workspace config (locations, variables, source mappings)
+- `validateAgentArtifacts()` checks agent reads/writes against declared manifest entries
+- `findWriters()` identifies responsible agents for given file IDs (used in error messages)
+
+### Guardrail Config Chain
+
+**What it does**: Unified enforcement with flexible per-agent overrides.
+
+**How it works**:
+- Loads global guardrails from `.ai/tx/data/config.yaml` and mesh-local overrides from mesh config
+- Resolution chain: agent-level > mesh-level > global agent > global mesh > global default > hardcoded default
+- Each guardrail has `strict` and `warning` flags that resolve independently
+- Supports backward-compatible bare numbers or structured `{strict, warning, limit}` objects
+
+---
+
+## Nine 2.5 — Self-Healing & Auto-Recovery
 
 Detect stuck states and recover without human intervention where safe.
 
 | Feature | What It Does | Where |
 |---------|-------------|-------|
-| **Nudge detector** | Detects when a completing agent fails to forward work; summarizes dead output with Haiku and writes recovery task | `src/worker/nudge-detector.ts` — 15s delay, max 1 nudge/agent |
-| **Deadlock breaker** | DFS cycle detection in ask graph; auto-breaks short cycles, escalates deep ones | `src/queue/deadlock-detector.ts` — scans every 60s, `autoBreakDepth: 3` |
-| **Stale message cleaner** | TTL-based GC for unprocessed queue entries (missing target, crashed worker) | `src/queue/stale-cleaner.ts` — 30min TTL, warn/archive/delete actions |
-| **Quality iteration loops** | Quality hooks evaluate output → inject feedback → agent retries with feedback | `src/hooks/post/quality-evaluate.ts` — configurable gates, max iterations |
-| **Session suspend/resume** | Persists suspended session state to SQLite for crash recovery; re-buffers delivered responses on restart | `src/worker/session-manager.ts` — `restoreFromDatabase()` on startup |
-| **FSM state persistence + backup** | Saves FSM state with atomic backup-before-update; can restore from latest backup on corruption | `src/mesh/fsm-persistence.ts` |
-| **Session store with backfill** | SQLite session persistence with FTS5 search; backfills existing transcripts from filesystem on startup | `src/session/session-store.ts` |
+| **Nudge detector** | Detects when a completing agent fails to forward work; summarizes and writes recovery task | `src/worker/nudge-detector.ts` |
+| **Deadlock breaker** | DFS cycle detection in ask graph; auto-breaks short cycles, escalates deep ones | `src/queue/deadlock-detector.ts` |
+| **Stale message cleaner** | TTL-based GC for unprocessed queue entries (missing target, crashed worker) | `src/queue/stale-cleaner.ts` |
+| **Quality iteration loops** | Quality hooks evaluate output → inject feedback → agent retries with feedback | `src/hooks/post/quality-evaluate.ts` |
+| **Session suspend/resume** | Persists suspended session state to SQLite for crash recovery | `src/worker/session-manager.ts` |
+| **FSM state persistence + backup** | Atomic backup-before-update; auto-restores from backup on corruption | `src/mesh/fsm-persistence.ts` |
+| **Session store with backfill** | SQLite session persistence with FTS5 search; backfills from filesystem on startup | `src/session/session-store.ts` |
 
-**Human review**: Nudges are logged and visible in `tx spy`. Deadlock cycles deeper than `autoBreakDepth` (default 3) → escalated to human with cycle visualization. Quality exhaustion (max iterations hit) → presents feedback history and asks user: retry, accept, or drop. Stale message cleanup → logged, user can audit via `tx spy`.
+### Nudge Detector
 
-## Quick Start
+**What it does**: Auto-recovers from missed route transitions.
 
-```bash
-# View reliability dashboard
-tx mesh health
+**How it works**:
+- Scheduled check runs after agent completion (15s delay), evaluates if routing targets received work
+- Resolves expected targets using `DispatchRouter` with agent's declared routing rules (default outcome = `complete`)
+- Skips terminal agents (core/core targets) and agents with already-sent messages
+- Summarizes dead agent output with Haiku and writes recovery task via SystemMessageWriter
+- Limits nudges per agent to prevent loops
 
-# View per-mesh reliability
-tx mesh health reliability-test
+### Deadlock Breaker
 
-# View dead letter queue
-tx mesh dlq
+**What it does**: Detects and breaks circular wait loops between agents.
 
-# Recover failed work
-tx mesh recover reliability-test
-```
+**How it works**:
+- Periodic DFS-based cycle detection in pending asks graph (~every 60s) using 3-color marking
+- Builds adjacency graph from queue pending asks; identifies circular chains (A→B→C→A)
+- Auto-breaks cycles up to `autoBreakDepth` (default 3)
+- Escalates deeper cycles (5+) to human via SystemMessageWriter with cycle visualization
 
-## Configuration
+### Stale Message Cleaner
 
-Set reliability thresholds in `.ai/tx/data/config.yaml`:
+**What it does**: Garbage collects unprocessed messages from crashed workers or typos.
 
-```yaml
-reliability:
-  circuitBreaker:
-    failureThreshold: 3    # Failures before circuit opens
-    cooldownMs: 30000      # How long circuit stays open
-  heartbeat:
-    warnMs: 60000          # Warn after 60s silence
-    staleMs: 120000        # Stale after 120s
-    deadMs: 300000         # Kill worker after 300s silence
-  safeMode:
-    autoEscalate: true     # Auto-restrict on SLI drop
-    cautiousThreshold: 0.95
-    restrictedThreshold: 0.90
-    lockdownThreshold: 0.80
-  dlq:
-    maxRetries: 3
-```
+**How it works**:
+- Periodic scanner (every 5 minutes) checks queue messages against TTL (30 minutes default)
+- Archives stale messages to `stale_messages` table with reason: `ttl_expired`, `no_target_mesh`, or `manual`
+- Actions configurable: `warn`, `archive`, or `delete`
+- Tracks known meshes to identify messages routed to non-existent targets; preserves audit trail
+
+### Quality Iteration Loops
+
+**What it does**: Validates output quality before routing, with iterative refinement.
 
-## Features
+**How it works**:
+- Post-hook runs quality stack on worker output after message reception
+- Runs gates (required + suggested) on output; returns `{passed, feedback}`
+- Three failure modes: `halt` (stop), `loop` (retry if under max iterations), `skip` (allow through)
+- Injects feedback messages on failure for agent self-correction
+
+### Session Suspend/Resume
 
-### 1. Circuit Breaker
+**What it does**: Non-destructive pause for external input with crash recovery.
+
+**How it works**:
+- Suspends sessions (kills worker, saves state to SQLite) when agent hits ask-human or await-response boundaries
+- Buffers incoming responses while awaiting multiple targets (tracks `pendingResponseCount`)
+- Persists to `suspended_sessions` table with reason, target agents, and hook context
+- Dispatcher handles resume: loading state, creating new runner, wiring event handlers
+
+### FSM State Persistence + Backup
+
+**What it does**: Durable state across crashes with automatic corruption recovery.
+
+**How it works**:
+- SQLite tables: `mesh_state` (current) and `mesh_state_backup` (versioned backups)
+- `saveState()` creates backup of previous state before updating (atomic via transaction)
+- On corruption (JSON parse error), `loadState()` auto-restores from latest backup
+- Indexes on `mesh_name + created_at` for efficient backup lookup
+
+### Session Store with Backfill
+
+**What it does**: Persistent session metadata with full-text search.
+
+**How it works**:
+- SQLite `sessions` table stores metadata: agent_id, mesh_id, timestamps, transcript path, message counts, final status
+- FTS5 virtual table `sessions_fts` enables full-text search on content, headline, tags
+- Prepared statements for fast CRUD; cache for summary types (e.g., `file_changes`, `decisions`)
+- Backfills existing sessions from disk on startup (migration-friendly)
+
+---
+
+## Nine 3 — Monitoring, Circuit Breaking, DLQ (99.9%)
+
+Active monitoring, automatic circuit-breaking, and dead letter recovery.
+
+| Feature | What It Does | Where |
+|---------|-------------|-------|
+| **Circuit breaker** | Stops spawning agents that keep failing; auto-recovers after cooldown | `src/reliability/circuit-breaker.ts` |
+| **Heartbeat monitor** | Detects stuck workers via silence thresholds; kills dead workers | `src/reliability/heartbeat-monitor.ts` |
+| **Dead letter queue** | Captures failed work with session context for recovery | `src/reliability/dead-letter-queue.ts` |
+| **SLI tracker** | Measures success rate, failure categories, MTTR, nines level | `src/reliability/sli-tracker.ts` |
+| **Safe mode** | Restricts agent capabilities when reliability drops | `src/reliability/safe-mode.ts` |
+| **Checkpoint log** | Saves session IDs at FSM transitions; enables rewind-to recovery | `src/reliability/checkpoint-log.ts` |
+| **Rate limiter** | Token bucket rate limiting for server endpoints | `src/server/rate-limiter.ts` |
+| **Worker pool backpressure** | Adaptive polling with concurrency limits | `src/server/worker-pool.ts` |
+| **Metrics aggregator** | Per-query metrics with token cost tracking | `src/worker/metrics-aggregator.ts` |
+| **Worker lifecycle tracking** | Unique instance IDs for deduplication and debugging | `src/worker/worker-lifecycle.ts` |
+
+### Circuit Breaker
 
 **What it does**: Stops spawning an agent that keeps failing. Prevents cascade failures.
 
@@ -122,7 +323,7 @@ tx mesh health           # Shows open/half_open circuits
 tx spy                   # Watch for reliability:blocked activity
 ```
 
-### 2. Heartbeat Monitor
+### Heartbeat Monitor
 
 **What it does**: Detects stuck workers and kills them.
 
@@ -142,7 +343,7 @@ tx mesh health           # Shows unhealthy agents with silence duration
 tx logs --component reliability  # Heartbeat kill events
 ```
 
-### 3. Dead Letter Queue (DLQ)
+### Dead Letter Queue (DLQ)
 
 **What it does**: Captures failed work with enough context to recover it.
 
@@ -152,22 +353,16 @@ tx logs --component reliability  # Heartbeat kill events
 - `manual`: Retries exhausted → needs human decision.
 
 **How entries are created**:
-- Worker exhausts all retries → dispatcher calls `reliability.deadLetter()` with the worker's sessionId, messages sent, and failure category
+- Worker exhausts all retries → dispatcher calls `reliability.deadLetter()` with sessionId, messages sent, and failure category
 - Heartbeat kills a stuck worker → recorded as failure, may generate DLQ entry on next retry exhaustion
 
 **How recovery works**:
 
-**Important: Recovery requires human review.** The core agent is instructed to always diagnose, present options (resume vs rewind vs drop), and get explicit user confirmation before triggering recovery. This prevents silent re-execution of bad work.
-
-1. **Automatic on startup**: When `tx start` runs, the dispatcher calls `recoverAll()` — recovers any pending session_resume and requeue entries from the previous run. (This is the only automatic path — it handles crash recovery between restarts.)
-
-2. **Human-initiated via core agent** (preferred): User asks core to investigate. Core runs `tx mesh health` + `tx mesh dlq`, presents findings with available checkpoints, user picks a recovery strategy, core writes the recovery message.
-
-3. **CLI**: `tx mesh recover <mesh>` sends a SIGUSR2 signal to the running dispatcher. Shows available checkpoints first.
-
+1. **Automatic on startup**: `tx start` calls `recoverAll()` — recovers pending session_resume and requeue entries from the previous run (crash recovery only).
+2. **Human-initiated via core agent** (preferred): User investigates via `tx mesh health` + `tx mesh dlq`, picks recovery strategy, core writes recovery message.
+3. **CLI**: `tx mesh recover <mesh>` sends SIGUSR2 to running dispatcher. Shows available checkpoints first.
 4. **Front-matter message**: Core writes a message with `recover: true` (and optionally `rewind-to: <state>`) to trigger DLQ recovery.
-
-5. **Fallback**: If the dispatcher isn't running, `tx mesh recover` writes a recovery message to the msgs dir that will be processed on next start.
+5. **Fallback**: If dispatcher isn't running, `tx mesh recover` writes a recovery message to msgs dir for next start.
 
 **Observe it**:
 ```bash
@@ -182,13 +377,13 @@ tx mesh dlq clear        # GC recovered entries
 **What it does**: Saves session IDs at every FSM state transition. Enables rewinding to any completed state instead of just the crash point.
 
 **How checkpoints are saved**:
-- Every time an FSM mesh transitions states, the completing agent's session ID is saved to SQLite
+- Every FSM mesh state transition saves the completing agent's session ID to SQLite
 - Checkpoint key: `mesh_name + state_name` → `session_id`
-- Multiple checkpoints per state are kept (most recent wins on lookup)
+- Multiple checkpoints per state kept (most recent wins on lookup)
 
 **How rewind-to works**:
 
-When recovering from the DLQ, you can specify `rewind-to: <state>` to use a checkpoint's session ID instead of the crash-point session. This means the recovered worker resumes from after that state completed — skipping all the bad work that happened after.
+When recovering from the DLQ, specify `rewind-to: <state>` to use a checkpoint's session ID instead of the crash-point session. The recovered worker resumes from after that state completed — skipping all bad work that happened after.
 
 ```
 FSM: analyze → build → verify → complete
@@ -232,7 +427,7 @@ Available checkpoints (use --rewind-to=<state>):
 
 **When checkpoints are cleared**: On mesh completion (`clearMeshState`). Old checkpoints are garbage collected (keeps last 50 per mesh).
 
-### 4. SLI Tracker
+### SLI Tracker
 
 **What it does**: Measures success rate, failure categories, MTTR, and nines level.
 
@@ -254,7 +449,7 @@ tx mesh health my-mesh      # Per-agent success rates
 tx mesh health --json       # Full snapshot
 ```
 
-### 5. Safe Mode
+### Safe Mode
 
 **What it does**: Restricts agent capabilities when reliability drops.
 
@@ -281,7 +476,7 @@ tx mesh health           # Shows current safe mode level
 tx spy                   # Watch safe-mode:blocked activity events
 ```
 
-### 6. Rate Limiter
+### Rate Limiter
 
 **What it does**: Token bucket rate limiting for server endpoints. Prevents burst overload.
 
@@ -290,9 +485,7 @@ tx spy                   # Watch safe-mode:blocked activity events
 - Automatic bucket cleanup every 5 minutes
 - Smooth rate limiting (not hard cutoff)
 
-**Source**: `src/server/rate-limiter.ts`
-
-### 7. Worker Pool Backpressure
+### Worker Pool Backpressure
 
 **What it does**: Adaptive polling with concurrency limits prevents queue overload.
 
@@ -301,19 +494,18 @@ tx spy                   # Watch safe-mode:blocked activity events
 - Respects concurrency limits — won't spawn beyond capacity
 - Graceful shutdown drains active workers before terminating
 
-**Source**: `src/server/worker-pool.ts`
-
-### 8. Metrics Aggregator
+### Metrics Aggregator
 
 **What it does**: Per-query metrics collection with token cost tracking.
 
-**Tracks**: input/output tokens, duration, cost per query, aggregate totals for worker lifetime, tool call counts.
-
-**Source**: `src/worker/metrics-aggregator.ts`
+**How it works**:
+- Tracks input/output tokens, duration, cost per query
+- Aggregate totals for worker lifetime
+- Tool call counts per worker
 
-### 9. Worker Lifecycle Tracking
+### Worker Lifecycle Tracking
 
-**What it does**: Tracks parallel worker execution with unique instance IDs for deduplication and debugging.
+**What it does**: Tracks parallel worker execution with unique instance IDs.
 
 **How it works**:
 - Generates unique worker IDs (`agentId-uuid`)
@@ -321,27 +513,28 @@ tx spy                   # Watch safe-mode:blocked activity events
 - Persists worker state to disk
 - Tracks nudge counts and completion frontier
 
-**Source**: `src/worker/worker-lifecycle.ts`
-
-## Test Mesh
-
-The `reliability-test` mesh is configured with tight thresholds for quick testing:
-- Circuit breaker opens after 2 failures (not 3)
-- Heartbeat kills after 120s (not 300s)
-- Safe mode auto-escalates at 80%/50%/25% (not 95%/90%/80%)
-
-```bash
-# Run the test mesh
-tx msg "Write a hello world function" --to reliability-test/planner
+---
 
-# Monitor reliability during execution
-tx mesh health reliability-test
+## Configuration
 
-# If failures occur, check DLQ
-tx mesh dlq reliability-test
+Set reliability thresholds in `.ai/tx/data/config.yaml`:
 
-# Recover failed work
-tx mesh recover reliability-test
+```yaml
+reliability:
+  circuitBreaker:
+    failureThreshold: 3    # Failures before circuit opens
+    cooldownMs: 30000      # How long circuit stays open
+  heartbeat:
+    warnMs: 60000          # Warn after 60s silence
+    staleMs: 120000        # Stale after 120s
+    deadMs: 300000         # Kill worker after 300s silence
+  safeMode:
+    autoEscalate: true     # Auto-restrict on SLI drop
+    cautiousThreshold: 0.95
+    restrictedThreshold: 0.90
+    lockdownThreshold: 0.80
+  dlq:
+    maxRetries: 3
 ```
 
 ## Front-Matter Options
@@ -367,6 +560,27 @@ Agents can interact with reliability features via message front-matter:
 | `tx mesh recover <mesh> --rewind-to=<state>` | Recover rewinding to a specific FSM state |
 | `tx mesh recover --all` | Recover all pending DLQ entries |
 
+## Test Mesh
+
+The `reliability-test` mesh is configured with tight thresholds for quick testing:
+- Circuit breaker opens after 2 failures (not 3)
+- Heartbeat kills after 120s (not 300s)
+- Safe mode auto-escalates at 80%/50%/25% (not 95%/90%/80%)
+
+```bash
+# Run the test mesh
+tx msg "Write a hello world function" --to reliability-test/planner
+
+# Monitor reliability during execution
+tx mesh health reliability-test
+
+# If failures occur, check DLQ
+tx mesh dlq reliability-test
+
+# Recover failed work
+tx mesh recover reliability-test
+```
+
 ## Architecture
 
 ```
@@ -395,121 +609,47 @@ Agents can interact with reliability features via message front-matter:
         └────────────┘   └───────────┘   └───────────┘
 ```
 
-## Reliability Roadmap — Human Review Gates
-
-Every reliability improvement includes human review steps. The system **never** silently changes behavior, retries destructively, or masks failures.
-
-### Priority 1: Default-On Checkpoints + Replay
-
-**Impact**: 10x — turns N-step recovery into 1-step problem
-**Effort**: Medium
-
-**What it does**: Every FSM state transition auto-saves a checkpoint. On failure, the user picks which checkpoint to rewind to and replay from.
-
-**Human review steps**:
-1. **Checkpoint notification**: When a mesh completes a state transition, core can optionally surface it: "Mesh X completed 'build' — checkpoint saved."
-2. **Replay approval**: Before any rewind-to replay, core presents:
-   - Which checkpoint to rewind to
-   - What work will be replayed (states after the checkpoint)
-   - What work will be discarded (failed states)
-3. **Post-replay review**: After replay completes, core presents the result for user approval before the mesh continues to the next state.
-
-**Never automatic**: Replay does not happen without the user choosing a checkpoint.
-
----
+## Roadmap — Nine 4 (99.99%)
 
-### Priority 2: Reliability Metrics Table + Tracking
+| Priority | Feature | Impact | Effort |
+|----------|---------|--------|--------|
+| 1 | Retry-with-variation | 3-5x retry success improvement | Low |
+| 2 | Output schema validation | Catches semantic failures early | Medium |
+| 3 | Critical/non-critical agent classification | Prevents cascade from optional steps | Low |
+| 4 | Aggregate observability dashboard | Finds the long-tail 0.01% | Medium |
 
-**Impact**: Foundation for everything else
-**Effort**: Low
-
-**What it does**: SLI tracker records success rate, failure categories, MTTR, and nines level per mesh and per agent.
-
-**Human review steps**:
-1. **Threshold alerts**: When SLI drops below a configured threshold, core surfaces it: "Mesh X reliability dropped to 94.2% (below 95% cautious threshold). 3 failures in last 10 runs. Categories: 2x model_error, 1x timeout."
-2. **Safe mode escalation approval**: Before escalating safe mode (cautious → restricted → lockdown), core presents the SLI data and asks: "Restrict write access for mesh X? Current SLI: 89%."
-3. **De-escalation approval**: Safe mode never auto-de-escalates. Core presents current metrics and asks: "SLI recovered to 98%. Clear restricted mode for mesh X?"
-4. **Periodic health summary**: On user request (`tx mesh health`), core presents a table of all meshes with SLI, open circuits, DLQ entries, and safe mode level.
-
-**Never automatic**: Safe mode escalation beyond `cautious` requires user confirmation. SLI data is always visible.
-
----
-
-### Priority 3: Retry-With-Variation on Routing/Protocol Failures
-
-**Impact**: 3-5x improvement on retry success
-**Effort**: Low
+### Retry-With-Variation
 
 **What it does**: When a retry fires, it varies the approach — different prompt framing, model fallback, or simplified task scope — instead of repeating the identical failing request.
 
-**Human review steps**:
-1. **First failure notification**: On first failure, core reports: "Agent X failed (model_error). Retrying with variation: [describe variation]. Retry 1/3."
-2. **Variation transparency**: Each retry logs what changed (e.g., "retry 2: simplified prompt, dropped optional context" or "retry 3: fallback model").
-3. **Retry exhaustion review**: When all retries exhaust, core presents the full retry history: "3 retries failed for agent X. Variations tried: [list]. Recommend: [recovery options]." User decides next step.
-4. **Variation strategy approval**: If a new variation strategy is added to config, core surfaces it for review before it takes effect.
+**How it will work**:
+- First failure retries with variation: simplified prompt, dropped optional context, or fallback model
+- Each retry logs what changed for transparency
+- Exhausted retries present full retry history with variations tried
 
-**Never automatic**: Retries within the configured limit are automatic (they're cheap and fast), but the user sees what's happening. Exhausted retries always stop and ask.
-
----
-
-### Priority 4: Output Schema Validation
-
-**Impact**: Catches semantic failures early
-**Effort**: Medium
+### Output Schema Validation
 
 **What it does**: Validates agent outputs against expected schemas (front-matter structure, required fields, output format) before passing results downstream.
 
-**Human review steps**:
-1. **Validation failure notification**: When output fails schema validation, core reports: "Agent X output failed validation: missing required field 'summary'. Output was [N] chars."
-2. **Correction approval**: Before asking the agent to retry with validation feedback, core presents: "Ask agent X to fix output? Validation errors: [list]. Or drop this output?"
-3. **Schema change review**: When a mesh config adds or modifies `output_schema`, core surfaces: "Mesh X now requires 'summary' field in output. Existing agents may need prompt updates."
-4. **Partial pass handling**: When output partially validates (some fields valid, some not), core presents what passed and what failed. User decides: accept partial, retry, or drop.
-
-**Never automatic**: Schema validation failures are always surfaced. The system does not silently discard or re-request outputs.
-
----
-
-### Priority 5: Critical / Non-Critical Agent Classification
+**How it will work**:
+- Mesh config defines `output_schema` per agent
+- Post-completion hook validates output against schema
+- Partial pass handling: presents what passed and what failed for human decision
 
-**Impact**: Prevents cascade from optional steps
-**Effort**: Low
+### Critical/Non-Critical Agent Classification
 
-**What it does**: Agents are classified as `critical` (failure blocks mesh) or `non-critical` (failure is logged but mesh continues). Prevents optional agents from taking down the whole workflow.
+**What it does**: Agents classified as `critical` (failure blocks mesh) or `non-critical` (failure logged, mesh continues). Prevents optional agents from taking down the whole workflow.
 
-**Human review steps**:
-1. **Classification review**: When a mesh is loaded, core can surface agent classifications: "Mesh X: critical=[planner, builder], non-critical=[linter, formatter]."
-2. **Non-critical failure notification**: When a non-critical agent fails, core reports: "Non-critical agent 'linter' failed (timeout). Mesh continues. Output from this step will be missing."
-3. **Promotion decision**: If a non-critical agent fails repeatedly, core asks: "Agent 'linter' has failed 5 times. Should it be promoted to critical (failures block mesh) or disabled?"
-4. **Critical failure escalation**: Critical agent failures always stop the mesh and present recovery options (Priority 1 checkpoints + Priority 3 retry history).
+**How it will work**:
+- Agent config adds `critical: true|false` field (default: true)
+- Non-critical failures logged and surfaced but don't block mesh
+- Repeated non-critical failures prompt promotion decision
 
-**Never automatic**: Non-critical failures are always reported. The user is never surprised by missing outputs from skipped agents.
-
----
-
-### Priority 6: Aggregate Observability Dashboard
-
-**Impact**: Needed to find the long-tail 0.01%
-**Effort**: Medium
+### Aggregate Observability Dashboard
 
 **What it does**: Unified view across all meshes — SLI trends, failure patterns, cost tracking, and anomaly detection.
 
-**Human review steps**:
-1. **Anomaly alerts**: When the dashboard detects anomalies (sudden SLI drop, unusual failure pattern, cost spike), core surfaces: "Anomaly detected: mesh X failure rate spiked from 2% to 15% in last hour. Failure category: model_error."
-2. **Trend review**: On request, core presents trend data: "Last 24h: 47 mesh runs, 98.3% success, 1 DLQ entry (recovered). Top failure: timeout (3x in mesh Y)."
-3. **Cost review gate**: Before approving expensive recovery (multiple retries, large context replay), core presents estimated cost: "Recovering mesh X with rewind-to will replay ~50k tokens. Proceed?"
-4. **Weekly digest**: Core can present a weekly reliability summary: nines achieved, worst-performing meshes, recurring failure patterns, DLQ utilization.
-
-**Never automatic**: The dashboard is passive — it collects and presents. All actions triggered by dashboard insights go through the standard human review workflow (diagnose → present → confirm → execute).
-
----
-
-### Human Review Principle
-
-Across all 6 priorities, the same principle applies:
-
-> **The system does work. The human makes decisions.**
-
-- Retries within limits → automatic (but visible)
-- Recovery, replay, escalation → always human-approved
-- Failures → always surfaced with context and options
-- No silent state changes that affect mesh behavior
+**How it will work**:
+- Anomaly detection: sudden SLI drops, unusual failure patterns, cost spikes
+- Trend data: success rates, DLQ utilization, MTTR over time
+- Cost estimation before expensive recovery operations