diff --git a/docs/HUMAN_REVIEW.md b/docs/HUMAN_REVIEW.md new file mode 100644 index 00000000..a6b14dd9 --- /dev/null +++ b/docs/HUMAN_REVIEW.md @@ -0,0 +1,147 @@ +# Human Review Gates — Reliability + +Every reliability feature includes human review steps. The system **never** silently changes behavior, retries destructively, or masks failures. + +> **The system does work. The human makes decisions.** + +- Retries within limits → automatic (but visible) +- Recovery, replay, escalation → always human-approved +- Failures → always surfaced with context and options +- No silent state changes that affect mesh behavior + +See [reliability.md](./reliability.md) for feature details. + +--- + +## Nine 1 — Basic Error Handling + +### Worker Retries +- When retries exhaust → DLQ entry created → core presents failure to user +- User decides: retry with variation, recover from checkpoint, or drop + +### Injection Poll Loop +- Stale entries (>5min) are dropped but remain available via `tx inbox` +- If file-based fallback activates, user sees pending messages on next interaction + +### Routing Correction +- When routing retries exhaust → escalated to user with full attempt history +- User sees which targets were tried and picks correct one + +### Usage Policy Errors +- Human chooses: retry, skip, modify prompt, or abort +- Full diagnostic context (triggering prompt, recent history) included in ask-human message + +### Recovery Handler +- First 2 recovery requests: automatic guidance with FSM state and valid routes +- 3rd+ request in 60s: escalated to human — agent is repeatedly stuck + +--- + +## Nine 2 — Validation & Protocol Enforcement + +### Parity Gate +- Violations → reminder injected to agent +- If unresolved after reminder → surfaced to user with pending asks list + +### Identity Gate +- Kill events → logged with full reason (agent ID, expected vs actual `from:` field) +- User can audit identity violations via logs + +### Mesh Validator +- Validation errors → block mesh load, user sees what's wrong and how to fix it +- Warnings → logged but don't block (user can review in logs) + +### Manifest Validator +- Validation failures → surfaced to user with missing/invalid paths and responsible agents + +### Bash Guard +- 1-2 violations → error response with allowed paths shown to agent +- 3+ violations → worker killed, logged for forensics +- User can audit bash guard events in logs + +--- + +## Nine 2.5 — Self-Healing & Auto-Recovery + +### Nudge Detector +- Nudges are logged and visible in `tx spy` +- Max 1 nudge per agent prevents recovery loops + +### Deadlock Breaker +- Shallow cycles (depth ≤ 3) → auto-broken, logged +- Deep cycles (depth 5+) → escalated to human with cycle visualization (A→B→C→A) +- User decides which agent's ask to drop + +### Stale Message Cleaner +- Stale messages archived with reason — no silent deletion +- User can audit via `tx spy` and review archived messages + +### Quality Iteration Loops +- Max iterations hit → presents feedback history to user +- User decides: retry, accept current output, or drop +- Each iteration's feedback is visible for review + +--- + +## Nine 3 — Monitoring, Circuit Breaking, DLQ + +### Circuit Breaker +- Circuit open → agent skipped, logged with failure count +- Half-open test spawn → user can monitor via `tx mesh health` +- Circuits don't auto-close silently — health dashboard shows state + +### Heartbeat Monitor +- Warn threshold → logged warning (no action) +- Stale threshold → logged stale warning +- Dead threshold → **worker killed**, failure recorded, routed to DLQ +- All events visible in `tx mesh health` with silence duration + +### Dead Letter Queue +- **Recovery always requires human review** (except crash recovery on restart) +- Core agent diagnoses, presents options (resume vs rewind vs drop), gets explicit confirmation +- Available checkpoints shown before any recovery action +- `tx mesh dlq` shows all pending entries with recovery mode and context + +### Checkpoint Log & Rewind-To +- Checkpoint notification: core can surface "Mesh X completed 'build' — checkpoint saved" +- Before rewind-to replay, core presents: which checkpoint, what replays, what's discarded +- Post-replay: result presented for user approval before mesh continues +- **Replay never happens without user choosing a checkpoint** + +### SLI Tracker +- Threshold alerts: "Mesh X reliability dropped to 94.2% (below 95% cautious threshold). 3 failures in last 10 runs. Categories: 2x model_error, 1x timeout." +- Periodic health summary available via `tx mesh health` +- SLI data always visible — never hidden from user + +### Safe Mode +- **Escalation beyond cautious requires user confirmation** when surfaced via core +- Auto-escalation (if enabled) is logged with reason and SLI data +- **Never auto-de-escalates** — human must clear via `resetMesh()` or `resetAll()` +- Core presents: "SLI recovered to 98%. Clear restricted mode for mesh X?" + +--- + +## Roadmap — Nine 4 + +### Retry-With-Variation +- First failure: core reports "Agent X failed (model_error). Retrying with variation: [description]. Retry 1/3." +- Each retry logs what changed (e.g., "retry 2: simplified prompt, dropped optional context") +- Exhausted retries: core presents full retry history with variations tried — user decides next step +- New variation strategies require review before taking effect + +### Output Schema Validation +- Validation failure: core reports "Agent X output failed validation: missing required field 'summary'." +- Before retry with validation feedback, core presents: "Ask agent X to fix? Validation errors: [list]. Or drop?" +- Schema changes in mesh config: core surfaces impact on existing agents +- Partial pass: core shows what passed/failed, user decides accept partial, retry, or drop + +### Critical/Non-Critical Agent Classification +- On mesh load, core can surface classifications: "critical=[planner, builder], non-critical=[linter]" +- Non-critical failure: "Agent 'linter' failed (timeout). Mesh continues. Output from this step missing." +- Repeated non-critical failures: "Agent 'linter' failed 5 times. Promote to critical or disable?" +- Critical failures always stop the mesh and present recovery options + +### Aggregate Observability Dashboard +- Anomaly alerts: "Mesh X failure rate spiked from 2% to 15% in last hour. Category: model_error." +- Cost review gate: "Recovering mesh X with rewind-to will replay ~50k tokens. Proceed?" +- Dashboard is passive — all actions from insights go through standard human review diff --git a/docs/reliability.md b/docs/reliability.md index 55332699..4f5938a5 100644 --- a/docs/reliability.md +++ b/docs/reliability.md @@ -2,6 +2,8 @@ TX reliability features organized by Karpathy's "March of Nines" — each nine requires fundamentally new approaches. +Human review gates for all features are documented in [HUMAN_REVIEW.md](./HUMAN_REVIEW.md). + ## March of Nines — Current Status | Nines | Technique | TX Status | @@ -12,96 +14,295 @@ TX reliability features organized by Karpathy's "March of Nines" — each nine r | **3 (99.9%)** | Monitoring, circuit breaking, DLQ | Circuit breaker, heartbeat monitor, DLQ with session resume, SLI tracker, safe mode, checkpoint log, rate limiter, worker pool backpressure, metrics aggregator, worker lifecycle tracking | | **4 (99.99%)** | [Roadmap] | Retry-with-variation, schema validation, agent classification, observability dashboard | -### Nine 1 — Basic Error Handling (90%) +--- + +## Nine 1 — Basic Error Handling (90%) Foundational durability. Nothing silently drops. | Feature | What It Does | Where | |---------|-------------|-------| -| **SQLite WAL mode** | Write-ahead logging prevents queue corruption on crash | `src/queue/index.ts` — `journal_mode=WAL` on init | -| **Worker retries (3x)** | Failed workers retry up to 3 times before DLQ | `src/worker/dispatcher.ts` — configurable via `dlq.maxRetries` | -| **Injection poll loop** | Core message injection retries on next poll if Claude is busy | `src/cli/start.ts` — leaves message at head of queue for next cycle | -| **Routing correction injection** | Bad routing target → corrective prompt injected back to sender | `src/worker/dispatcher.ts` — `handleRoutingError()`, max retries per guardrail config | -| **Graceful worker pool shutdown** | Drains active workers before terminating pool, prevents orphaned workers | `src/server/worker-pool.ts` | -| **Usage policy error handling** | Detects Claude API usage policy errors, captures diagnostic context, writes ask-human message instead of crashing | `src/worker/usage-policy-error.ts` | -| **Recovery handler with escalation** | Tracks recovery requests per agent, provides FSM guidance on first attempt, escalates to human after 3 requests in 60s | `src/core/recovery.ts` | +| **SQLite WAL mode** | Write-ahead logging prevents queue corruption on crash | `src/queue/index.ts` | +| **Worker retries (3x)** | Failed workers retry up to 3 times before DLQ | `src/worker/dispatcher.ts` | +| **Injection poll loop** | Core message injection retries on next poll if Claude is busy | `src/cli/start.ts` | +| **Routing correction injection** | Bad routing target → corrective prompt injected back to sender | `src/worker/dispatcher.ts` | +| **Graceful worker pool shutdown** | Drains active workers before terminating pool | `src/server/worker-pool.ts` | +| **Usage policy error handling** | Detects Claude API usage policy errors, writes ask-human message instead of crashing | `src/worker/usage-policy-error.ts` | +| **Recovery handler with escalation** | Tracks recovery requests per agent, escalates to human after 3 requests in 60s | `src/core/recovery.ts` | + +### SQLite WAL Mode + +**What it does**: Prevents queue corruption on crash via Write-Ahead Logging. + +**How it works**: +- Enables WAL mode (`journal_mode=WAL`) on the SQLite message queue at init +- All writes are logged to WAL file before committing to main database +- Guarantees queue state is recoverable even if process crashes mid-write +- Allows concurrent readers while writes are in flight + +### Worker Retries (3x) + +**What it does**: Auto-retries failed workers before routing to DLQ. + +**How it works**: +- Each worker has a state machine tracking retry attempts +- On error, checks `canTransition('retry')` before respawning +- Differentiates retriable errors (crashes, model overload) vs non-retriable (suspension, max-turns, abort) +- After max retries exhausted, routes to Dead Letter Queue for recovery + +### Injection Poll Loop + +**What it does**: Ensures messages reach the core Claude session even when it's busy. + +**How it works**: +- Maintains an in-memory queue of messages waiting for injection into tmux +- Polls every 2s (`INJECTION_POLL_MS`) checking if Claude is idle, then injects +- Drops stale entries pending >5 minutes (they're available via `tx inbox`) +- Falls back to file-based delivery (`pending-for-core.json`) if active injection fails + +### Routing Correction Injection + +**What it does**: Recovers from bad routing by teaching the agent valid targets. + +**How it works**: +- Detects messages targeting non-existent meshes/agents, increments retry counter per sender→target pair +- Injects corrective message back to sender listing valid available targets (up to max retries) +- After max retries exceeded, escalates to human via `ask-human` message +- Supports strict mode (block immediately) and warning mode (allow + notify) per guardrail config + +### Graceful Worker Pool Shutdown + +**What it does**: Prevents orphaned workers on shutdown. + +**How it works**: +- Sets `running = false` to prevent new spawns, stops polling loop +- Collects all active worker promises and awaits completion via `Promise.all()` +- Logs count of in-flight workers being drained + +### Usage Policy Error Handling + +**What it does**: Captures false-positive usage policy errors with full diagnostic context. + +**How it works**: +- Detects usage policy errors from Claude API via pattern matching +- Captures diagnostic context: triggering prompt, recent history, in-progress tool calls, agent/mesh info +- Writes `ask-human` message to core with full context for human decision (retry, skip, modify prompt, abort) +- Preserves session ID for potential resume -**Human review**: When worker retries exhaust → DLQ entry created → core presents failure to user. When routing retries exhaust → escalated to user with full attempt history. Usage policy errors → human chooses retry/skip/modify-prompt/abort. +### Recovery Handler with Escalation -### Nine 2 — Validation & Protocol Enforcement (99%) +**What it does**: Detects repeatedly stuck agents and escalates to human. + +**How it works**: +- Intercepts messages routed to `system/recovery` +- Tracks frequency per agent with time window; resets counter outside escalation window +- First 2 attempts: returns guidance with current FSM state, pending asks, and valid exit routes +- 3rd+ attempt: escalates to `core/core` for human intervention + +--- + +## Nine 2 — Validation & Protocol Enforcement (99%) Catch bad outputs and protocol violations before they propagate. | Feature | What It Does | Where | |---------|-------------|-------| -| **Parity gate** | Ensures completion agents answer all pending asks before completing | `src/worker/dispatcher.ts`, `src/core/consumer.ts` — tracks `pending_asks` table | -| **FSM validation** | State machine meshes enforce valid transitions, prevent skipped/repeated states | `src/state-machine/` — transition guards + checkpoint persistence | -| **Mesh validator** | Validates mesh config before loading (required fields, types, routing consistency) | `src/worker/mesh-validator.ts` — errors block load, warnings log | -| **Identity gate** | PreToolUse hook validates `from:` field matches agent identity | `src/worker/identity-gate.ts` — blocks/warns per guardrail mode, strike system | -| **Write gate** | Controls which tools agents can use based on safe mode level | `src/worker/guardrail-config.ts` — restricted/lockdown blocks Write/Edit/Bash | -| **Bash guard** | PreToolUse hook intercepts Bash commands with redirects (`>`, `>>`, `tee`), validates target paths against allowed write manifest. Strike system: 1-2 violations → error with allowed paths, 3+ → kill worker | `src/worker/write-gate.ts` — `createBashHook()` | -| **Manifest validator** | Validates agent output artifacts against declared manifest paths with template variable resolution (5-pass chained substitution) | `src/worker/manifest-validator.ts` | -| **Guardrail config chain** | Unified strict/warning mode on every guardrail with override chain: agent > mesh > global > hardcoded | `src/worker/guardrail-config.ts` | +| **Parity gate** | Ensures completion agents answer all pending asks before completing | `src/worker/dispatcher.ts`, `src/core/consumer.ts` | +| **FSM validation** | State machine meshes enforce valid transitions, prevent skipped/repeated states | `src/state-machine/` | +| **Mesh validator** | Validates mesh config before loading (required fields, types, routing consistency) | `src/worker/mesh-validator.ts` | +| **Identity gate** | PreToolUse hook validates `from:` field matches agent identity | `src/worker/identity-gate.ts` | +| **Write gate** | Controls which paths agents can write to based on manifest | `src/worker/write-gate.ts` | +| **Bash guard** | PreToolUse hook blocks dangerous Bash patterns outside project boundary | `src/worker/bash-guard.ts` | +| **Manifest validator** | Validates agent output artifacts against declared manifest paths | `src/worker/manifest-validator.ts` | +| **Guardrail config chain** | Unified strict/warning mode with override chain: agent > mesh > global > hardcoded | `src/worker/guardrail-config.ts` | + +### Parity Gate + +**What it does**: Prevents agents from completing a mesh while unanswered questions remain. + +**How it works**: +- Tracks pending asks (questions sent to human boundary `core/core`) in SQLite queue +- Validates responses from `core/core` have a matching pending ask by msg-id (fallback to agent-level matching) +- Blocks `task-complete` messages with unresolved asks; deletes offending file and emits `parity-reminder` +- Terminal-by-default: asks to `core/core` require parity; agent-to-agent asks don't trigger tracking + +### FSM Validation + +**What it does**: Enforces state machine rules before message routing. + +**How it works**: +- Type-safe state transitions with guard validation and middleware hooks (pre/post) +- Consumer calls `validateMessageWithFSM()` on all incoming messages BEFORE type-specific routing +- Centralized validation ensures all routing respects mesh-defined FSM rules +- Emits transition history and immutable state snapshots for replay/debugging + +### Mesh Validator -**Human review**: Parity gate violations → reminder injected, if unresolved → surfaced to user. Identity gate kills → logged with reason. Mesh validation errors → block load, user sees what's wrong. Manifest validation failures → surfaced to user with missing/invalid paths. Bash guard violations → logged for forensics, worker killed after 3 strikes. +**What it does**: Catches config errors before a mesh can load. -### Nine 2.5 — Self-Healing & Auto-Recovery +**How it works**: +- Static `validate()` checks mesh config structure, required fields, agent definitions, routing rules, FSM definitions, and manifest entries +- Validates field types, agent presence, entry/exit points, task distribution config, guardrail overrides, and parallelism blocks +- Returns `ValidationResult` with errors and warnings — errors block load, warnings log +- Catches typos early (e.g., agent routing to nonexistent agents) + +### Identity Gate + +**What it does**: Prevents agents from impersonating other agents. + +**How it works**: +- PreToolUse hook intercepts Write tool calls to `.ai/tx/msgs/` +- Extracts `from:` field from message YAML frontmatter, compares against expected agent identity +- Enforces fully-qualified names (rejects bare `worker` when agent is `dev/worker`) to prevent cross-mesh routing leaks +- Strike counter with configurable kill threshold; strict (block) vs warning (allow + feedback) modes + +### Write Gate + +**What it does**: Restricts file writes to declared manifest paths. + +**How it works**: +- PreToolUse hooks intercept Write/Edit/NotebookEdit tools and Bash redirects (`>`, `>>`, `tee`) +- Validates target paths against agent's declared allowed paths from manifest +- Auto-exempts `.ai/tx/msgs/` and `.ai/tx/logs/`; allows `/dev/null` +- Tracks file-tool and bash-redirect strikes separately; kill threshold on accumulated violations + +### Bash Guard + +**What it does**: Docker-like isolation — full Bash inside project, can't escape. + +**How it works**: +- Two security layers: workDir boundary enforcement + catastrophic damage prevention +- Blocks all filesystem operations (read/write/symlink) outside project directory +- Blocks privilege escalation, root destruction, system service manipulation, raw disk ops +- Network access explicitly allowed (Docker parity): curl, wget, ssh, npm publish are safe + +### Manifest Validator + +**What it does**: Validates agent artifacts against declared manifest paths. + +**How it works**: +- Resolves manifest variable references (game-id, campaign-id, etc.) from `session.yaml` with caching +- Builds path context from mesh workspace config (locations, variables, source mappings) +- `validateAgentArtifacts()` checks agent reads/writes against declared manifest entries +- `findWriters()` identifies responsible agents for given file IDs (used in error messages) + +### Guardrail Config Chain + +**What it does**: Unified enforcement with flexible per-agent overrides. + +**How it works**: +- Loads global guardrails from `.ai/tx/data/config.yaml` and mesh-local overrides from mesh config +- Resolution chain: agent-level > mesh-level > global agent > global mesh > global default > hardcoded default +- Each guardrail has `strict` and `warning` flags that resolve independently +- Supports backward-compatible bare numbers or structured `{strict, warning, limit}` objects + +--- + +## Nine 2.5 — Self-Healing & Auto-Recovery Detect stuck states and recover without human intervention where safe. | Feature | What It Does | Where | |---------|-------------|-------| -| **Nudge detector** | Detects when a completing agent fails to forward work; summarizes dead output with Haiku and writes recovery task | `src/worker/nudge-detector.ts` — 15s delay, max 1 nudge/agent | -| **Deadlock breaker** | DFS cycle detection in ask graph; auto-breaks short cycles, escalates deep ones | `src/queue/deadlock-detector.ts` — scans every 60s, `autoBreakDepth: 3` | -| **Stale message cleaner** | TTL-based GC for unprocessed queue entries (missing target, crashed worker) | `src/queue/stale-cleaner.ts` — 30min TTL, warn/archive/delete actions | -| **Quality iteration loops** | Quality hooks evaluate output → inject feedback → agent retries with feedback | `src/hooks/post/quality-evaluate.ts` — configurable gates, max iterations | -| **Session suspend/resume** | Persists suspended session state to SQLite for crash recovery; re-buffers delivered responses on restart | `src/worker/session-manager.ts` — `restoreFromDatabase()` on startup | -| **FSM state persistence + backup** | Saves FSM state with atomic backup-before-update; can restore from latest backup on corruption | `src/mesh/fsm-persistence.ts` | -| **Session store with backfill** | SQLite session persistence with FTS5 search; backfills existing transcripts from filesystem on startup | `src/session/session-store.ts` | +| **Nudge detector** | Detects when a completing agent fails to forward work; summarizes and writes recovery task | `src/worker/nudge-detector.ts` | +| **Deadlock breaker** | DFS cycle detection in ask graph; auto-breaks short cycles, escalates deep ones | `src/queue/deadlock-detector.ts` | +| **Stale message cleaner** | TTL-based GC for unprocessed queue entries (missing target, crashed worker) | `src/queue/stale-cleaner.ts` | +| **Quality iteration loops** | Quality hooks evaluate output → inject feedback → agent retries with feedback | `src/hooks/post/quality-evaluate.ts` | +| **Session suspend/resume** | Persists suspended session state to SQLite for crash recovery | `src/worker/session-manager.ts` | +| **FSM state persistence + backup** | Atomic backup-before-update; auto-restores from backup on corruption | `src/mesh/fsm-persistence.ts` | +| **Session store with backfill** | SQLite session persistence with FTS5 search; backfills from filesystem on startup | `src/session/session-store.ts` | -**Human review**: Nudges are logged and visible in `tx spy`. Deadlock cycles deeper than `autoBreakDepth` (default 3) → escalated to human with cycle visualization. Quality exhaustion (max iterations hit) → presents feedback history and asks user: retry, accept, or drop. Stale message cleanup → logged, user can audit via `tx spy`. +### Nudge Detector -## Quick Start +**What it does**: Auto-recovers from missed route transitions. -```bash -# View reliability dashboard -tx mesh health +**How it works**: +- Scheduled check runs after agent completion (15s delay), evaluates if routing targets received work +- Resolves expected targets using `DispatchRouter` with agent's declared routing rules (default outcome = `complete`) +- Skips terminal agents (core/core targets) and agents with already-sent messages +- Summarizes dead agent output with Haiku and writes recovery task via SystemMessageWriter +- Limits nudges per agent to prevent loops -# View per-mesh reliability -tx mesh health reliability-test +### Deadlock Breaker -# View dead letter queue -tx mesh dlq +**What it does**: Detects and breaks circular wait loops between agents. -# Recover failed work -tx mesh recover reliability-test -``` +**How it works**: +- Periodic DFS-based cycle detection in pending asks graph (~every 60s) using 3-color marking +- Builds adjacency graph from queue pending asks; identifies circular chains (A→B→C→A) +- Auto-breaks cycles up to `autoBreakDepth` (default 3) +- Escalates deeper cycles (5+) to human via SystemMessageWriter with cycle visualization -## Configuration +### Stale Message Cleaner -Set reliability thresholds in `.ai/tx/data/config.yaml`: +**What it does**: Garbage collects unprocessed messages from crashed workers or typos. -```yaml -reliability: - circuitBreaker: - failureThreshold: 3 # Failures before circuit opens - cooldownMs: 30000 # How long circuit stays open - heartbeat: - warnMs: 60000 # Warn after 60s silence - staleMs: 120000 # Stale after 120s - deadMs: 300000 # Kill worker after 300s silence - safeMode: - autoEscalate: true # Auto-restrict on SLI drop - cautiousThreshold: 0.95 - restrictedThreshold: 0.90 - lockdownThreshold: 0.80 - dlq: - maxRetries: 3 -``` +**How it works**: +- Periodic scanner (every 5 minutes) checks queue messages against TTL (30 minutes default) +- Archives stale messages to `stale_messages` table with reason: `ttl_expired`, `no_target_mesh`, or `manual` +- Actions configurable: `warn`, `archive`, or `delete` +- Tracks known meshes to identify messages routed to non-existent targets; preserves audit trail + +### Quality Iteration Loops + +**What it does**: Validates output quality before routing, with iterative refinement. -## Features +**How it works**: +- Post-hook runs quality stack on worker output after message reception +- Runs gates (required + suggested) on output; returns `{passed, feedback}` +- Three failure modes: `halt` (stop), `loop` (retry if under max iterations), `skip` (allow through) +- Injects feedback messages on failure for agent self-correction + +### Session Suspend/Resume -### 1. Circuit Breaker +**What it does**: Non-destructive pause for external input with crash recovery. + +**How it works**: +- Suspends sessions (kills worker, saves state to SQLite) when agent hits ask-human or await-response boundaries +- Buffers incoming responses while awaiting multiple targets (tracks `pendingResponseCount`) +- Persists to `suspended_sessions` table with reason, target agents, and hook context +- Dispatcher handles resume: loading state, creating new runner, wiring event handlers + +### FSM State Persistence + Backup + +**What it does**: Durable state across crashes with automatic corruption recovery. + +**How it works**: +- SQLite tables: `mesh_state` (current) and `mesh_state_backup` (versioned backups) +- `saveState()` creates backup of previous state before updating (atomic via transaction) +- On corruption (JSON parse error), `loadState()` auto-restores from latest backup +- Indexes on `mesh_name + created_at` for efficient backup lookup + +### Session Store with Backfill + +**What it does**: Persistent session metadata with full-text search. + +**How it works**: +- SQLite `sessions` table stores metadata: agent_id, mesh_id, timestamps, transcript path, message counts, final status +- FTS5 virtual table `sessions_fts` enables full-text search on content, headline, tags +- Prepared statements for fast CRUD; cache for summary types (e.g., `file_changes`, `decisions`) +- Backfills existing sessions from disk on startup (migration-friendly) + +--- + +## Nine 3 — Monitoring, Circuit Breaking, DLQ (99.9%) + +Active monitoring, automatic circuit-breaking, and dead letter recovery. + +| Feature | What It Does | Where | +|---------|-------------|-------| +| **Circuit breaker** | Stops spawning agents that keep failing; auto-recovers after cooldown | `src/reliability/circuit-breaker.ts` | +| **Heartbeat monitor** | Detects stuck workers via silence thresholds; kills dead workers | `src/reliability/heartbeat-monitor.ts` | +| **Dead letter queue** | Captures failed work with session context for recovery | `src/reliability/dead-letter-queue.ts` | +| **SLI tracker** | Measures success rate, failure categories, MTTR, nines level | `src/reliability/sli-tracker.ts` | +| **Safe mode** | Restricts agent capabilities when reliability drops | `src/reliability/safe-mode.ts` | +| **Checkpoint log** | Saves session IDs at FSM transitions; enables rewind-to recovery | `src/reliability/checkpoint-log.ts` | +| **Rate limiter** | Token bucket rate limiting for server endpoints | `src/server/rate-limiter.ts` | +| **Worker pool backpressure** | Adaptive polling with concurrency limits | `src/server/worker-pool.ts` | +| **Metrics aggregator** | Per-query metrics with token cost tracking | `src/worker/metrics-aggregator.ts` | +| **Worker lifecycle tracking** | Unique instance IDs for deduplication and debugging | `src/worker/worker-lifecycle.ts` | + +### Circuit Breaker **What it does**: Stops spawning an agent that keeps failing. Prevents cascade failures. @@ -122,7 +323,7 @@ tx mesh health # Shows open/half_open circuits tx spy # Watch for reliability:blocked activity ``` -### 2. Heartbeat Monitor +### Heartbeat Monitor **What it does**: Detects stuck workers and kills them. @@ -142,7 +343,7 @@ tx mesh health # Shows unhealthy agents with silence duration tx logs --component reliability # Heartbeat kill events ``` -### 3. Dead Letter Queue (DLQ) +### Dead Letter Queue (DLQ) **What it does**: Captures failed work with enough context to recover it. @@ -152,22 +353,16 @@ tx logs --component reliability # Heartbeat kill events - `manual`: Retries exhausted → needs human decision. **How entries are created**: -- Worker exhausts all retries → dispatcher calls `reliability.deadLetter()` with the worker's sessionId, messages sent, and failure category +- Worker exhausts all retries → dispatcher calls `reliability.deadLetter()` with sessionId, messages sent, and failure category - Heartbeat kills a stuck worker → recorded as failure, may generate DLQ entry on next retry exhaustion **How recovery works**: -**Important: Recovery requires human review.** The core agent is instructed to always diagnose, present options (resume vs rewind vs drop), and get explicit user confirmation before triggering recovery. This prevents silent re-execution of bad work. - -1. **Automatic on startup**: When `tx start` runs, the dispatcher calls `recoverAll()` — recovers any pending session_resume and requeue entries from the previous run. (This is the only automatic path — it handles crash recovery between restarts.) - -2. **Human-initiated via core agent** (preferred): User asks core to investigate. Core runs `tx mesh health` + `tx mesh dlq`, presents findings with available checkpoints, user picks a recovery strategy, core writes the recovery message. - -3. **CLI**: `tx mesh recover ` sends a SIGUSR2 signal to the running dispatcher. Shows available checkpoints first. - +1. **Automatic on startup**: `tx start` calls `recoverAll()` — recovers pending session_resume and requeue entries from the previous run (crash recovery only). +2. **Human-initiated via core agent** (preferred): User investigates via `tx mesh health` + `tx mesh dlq`, picks recovery strategy, core writes recovery message. +3. **CLI**: `tx mesh recover ` sends SIGUSR2 to running dispatcher. Shows available checkpoints first. 4. **Front-matter message**: Core writes a message with `recover: true` (and optionally `rewind-to: `) to trigger DLQ recovery. - -5. **Fallback**: If the dispatcher isn't running, `tx mesh recover` writes a recovery message to the msgs dir that will be processed on next start. +5. **Fallback**: If dispatcher isn't running, `tx mesh recover` writes a recovery message to msgs dir for next start. **Observe it**: ```bash @@ -182,13 +377,13 @@ tx mesh dlq clear # GC recovered entries **What it does**: Saves session IDs at every FSM state transition. Enables rewinding to any completed state instead of just the crash point. **How checkpoints are saved**: -- Every time an FSM mesh transitions states, the completing agent's session ID is saved to SQLite +- Every FSM mesh state transition saves the completing agent's session ID to SQLite - Checkpoint key: `mesh_name + state_name` → `session_id` -- Multiple checkpoints per state are kept (most recent wins on lookup) +- Multiple checkpoints per state kept (most recent wins on lookup) **How rewind-to works**: -When recovering from the DLQ, you can specify `rewind-to: ` to use a checkpoint's session ID instead of the crash-point session. This means the recovered worker resumes from after that state completed — skipping all the bad work that happened after. +When recovering from the DLQ, specify `rewind-to: ` to use a checkpoint's session ID instead of the crash-point session. The recovered worker resumes from after that state completed — skipping all bad work that happened after. ``` FSM: analyze → build → verify → complete @@ -232,7 +427,7 @@ Available checkpoints (use --rewind-to=): **When checkpoints are cleared**: On mesh completion (`clearMeshState`). Old checkpoints are garbage collected (keeps last 50 per mesh). -### 4. SLI Tracker +### SLI Tracker **What it does**: Measures success rate, failure categories, MTTR, and nines level. @@ -254,7 +449,7 @@ tx mesh health my-mesh # Per-agent success rates tx mesh health --json # Full snapshot ``` -### 5. Safe Mode +### Safe Mode **What it does**: Restricts agent capabilities when reliability drops. @@ -281,7 +476,7 @@ tx mesh health # Shows current safe mode level tx spy # Watch safe-mode:blocked activity events ``` -### 6. Rate Limiter +### Rate Limiter **What it does**: Token bucket rate limiting for server endpoints. Prevents burst overload. @@ -290,9 +485,7 @@ tx spy # Watch safe-mode:blocked activity events - Automatic bucket cleanup every 5 minutes - Smooth rate limiting (not hard cutoff) -**Source**: `src/server/rate-limiter.ts` - -### 7. Worker Pool Backpressure +### Worker Pool Backpressure **What it does**: Adaptive polling with concurrency limits prevents queue overload. @@ -301,19 +494,18 @@ tx spy # Watch safe-mode:blocked activity events - Respects concurrency limits — won't spawn beyond capacity - Graceful shutdown drains active workers before terminating -**Source**: `src/server/worker-pool.ts` - -### 8. Metrics Aggregator +### Metrics Aggregator **What it does**: Per-query metrics collection with token cost tracking. -**Tracks**: input/output tokens, duration, cost per query, aggregate totals for worker lifetime, tool call counts. - -**Source**: `src/worker/metrics-aggregator.ts` +**How it works**: +- Tracks input/output tokens, duration, cost per query +- Aggregate totals for worker lifetime +- Tool call counts per worker -### 9. Worker Lifecycle Tracking +### Worker Lifecycle Tracking -**What it does**: Tracks parallel worker execution with unique instance IDs for deduplication and debugging. +**What it does**: Tracks parallel worker execution with unique instance IDs. **How it works**: - Generates unique worker IDs (`agentId-uuid`) @@ -321,27 +513,28 @@ tx spy # Watch safe-mode:blocked activity events - Persists worker state to disk - Tracks nudge counts and completion frontier -**Source**: `src/worker/worker-lifecycle.ts` - -## Test Mesh - -The `reliability-test` mesh is configured with tight thresholds for quick testing: -- Circuit breaker opens after 2 failures (not 3) -- Heartbeat kills after 120s (not 300s) -- Safe mode auto-escalates at 80%/50%/25% (not 95%/90%/80%) - -```bash -# Run the test mesh -tx msg "Write a hello world function" --to reliability-test/planner +--- -# Monitor reliability during execution -tx mesh health reliability-test +## Configuration -# If failures occur, check DLQ -tx mesh dlq reliability-test +Set reliability thresholds in `.ai/tx/data/config.yaml`: -# Recover failed work -tx mesh recover reliability-test +```yaml +reliability: + circuitBreaker: + failureThreshold: 3 # Failures before circuit opens + cooldownMs: 30000 # How long circuit stays open + heartbeat: + warnMs: 60000 # Warn after 60s silence + staleMs: 120000 # Stale after 120s + deadMs: 300000 # Kill worker after 300s silence + safeMode: + autoEscalate: true # Auto-restrict on SLI drop + cautiousThreshold: 0.95 + restrictedThreshold: 0.90 + lockdownThreshold: 0.80 + dlq: + maxRetries: 3 ``` ## Front-Matter Options @@ -367,6 +560,27 @@ Agents can interact with reliability features via message front-matter: | `tx mesh recover --rewind-to=` | Recover rewinding to a specific FSM state | | `tx mesh recover --all` | Recover all pending DLQ entries | +## Test Mesh + +The `reliability-test` mesh is configured with tight thresholds for quick testing: +- Circuit breaker opens after 2 failures (not 3) +- Heartbeat kills after 120s (not 300s) +- Safe mode auto-escalates at 80%/50%/25% (not 95%/90%/80%) + +```bash +# Run the test mesh +tx msg "Write a hello world function" --to reliability-test/planner + +# Monitor reliability during execution +tx mesh health reliability-test + +# If failures occur, check DLQ +tx mesh dlq reliability-test + +# Recover failed work +tx mesh recover reliability-test +``` + ## Architecture ``` @@ -395,121 +609,47 @@ Agents can interact with reliability features via message front-matter: └────────────┘ └───────────┘ └───────────┘ ``` -## Reliability Roadmap — Human Review Gates - -Every reliability improvement includes human review steps. The system **never** silently changes behavior, retries destructively, or masks failures. - -### Priority 1: Default-On Checkpoints + Replay - -**Impact**: 10x — turns N-step recovery into 1-step problem -**Effort**: Medium - -**What it does**: Every FSM state transition auto-saves a checkpoint. On failure, the user picks which checkpoint to rewind to and replay from. - -**Human review steps**: -1. **Checkpoint notification**: When a mesh completes a state transition, core can optionally surface it: "Mesh X completed 'build' — checkpoint saved." -2. **Replay approval**: Before any rewind-to replay, core presents: - - Which checkpoint to rewind to - - What work will be replayed (states after the checkpoint) - - What work will be discarded (failed states) -3. **Post-replay review**: After replay completes, core presents the result for user approval before the mesh continues to the next state. - -**Never automatic**: Replay does not happen without the user choosing a checkpoint. - ---- +## Roadmap — Nine 4 (99.99%) -### Priority 2: Reliability Metrics Table + Tracking +| Priority | Feature | Impact | Effort | +|----------|---------|--------|--------| +| 1 | Retry-with-variation | 3-5x retry success improvement | Low | +| 2 | Output schema validation | Catches semantic failures early | Medium | +| 3 | Critical/non-critical agent classification | Prevents cascade from optional steps | Low | +| 4 | Aggregate observability dashboard | Finds the long-tail 0.01% | Medium | -**Impact**: Foundation for everything else -**Effort**: Low - -**What it does**: SLI tracker records success rate, failure categories, MTTR, and nines level per mesh and per agent. - -**Human review steps**: -1. **Threshold alerts**: When SLI drops below a configured threshold, core surfaces it: "Mesh X reliability dropped to 94.2% (below 95% cautious threshold). 3 failures in last 10 runs. Categories: 2x model_error, 1x timeout." -2. **Safe mode escalation approval**: Before escalating safe mode (cautious → restricted → lockdown), core presents the SLI data and asks: "Restrict write access for mesh X? Current SLI: 89%." -3. **De-escalation approval**: Safe mode never auto-de-escalates. Core presents current metrics and asks: "SLI recovered to 98%. Clear restricted mode for mesh X?" -4. **Periodic health summary**: On user request (`tx mesh health`), core presents a table of all meshes with SLI, open circuits, DLQ entries, and safe mode level. - -**Never automatic**: Safe mode escalation beyond `cautious` requires user confirmation. SLI data is always visible. - ---- - -### Priority 3: Retry-With-Variation on Routing/Protocol Failures - -**Impact**: 3-5x improvement on retry success -**Effort**: Low +### Retry-With-Variation **What it does**: When a retry fires, it varies the approach — different prompt framing, model fallback, or simplified task scope — instead of repeating the identical failing request. -**Human review steps**: -1. **First failure notification**: On first failure, core reports: "Agent X failed (model_error). Retrying with variation: [describe variation]. Retry 1/3." -2. **Variation transparency**: Each retry logs what changed (e.g., "retry 2: simplified prompt, dropped optional context" or "retry 3: fallback model"). -3. **Retry exhaustion review**: When all retries exhaust, core presents the full retry history: "3 retries failed for agent X. Variations tried: [list]. Recommend: [recovery options]." User decides next step. -4. **Variation strategy approval**: If a new variation strategy is added to config, core surfaces it for review before it takes effect. +**How it will work**: +- First failure retries with variation: simplified prompt, dropped optional context, or fallback model +- Each retry logs what changed for transparency +- Exhausted retries present full retry history with variations tried -**Never automatic**: Retries within the configured limit are automatic (they're cheap and fast), but the user sees what's happening. Exhausted retries always stop and ask. - ---- - -### Priority 4: Output Schema Validation - -**Impact**: Catches semantic failures early -**Effort**: Medium +### Output Schema Validation **What it does**: Validates agent outputs against expected schemas (front-matter structure, required fields, output format) before passing results downstream. -**Human review steps**: -1. **Validation failure notification**: When output fails schema validation, core reports: "Agent X output failed validation: missing required field 'summary'. Output was [N] chars." -2. **Correction approval**: Before asking the agent to retry with validation feedback, core presents: "Ask agent X to fix output? Validation errors: [list]. Or drop this output?" -3. **Schema change review**: When a mesh config adds or modifies `output_schema`, core surfaces: "Mesh X now requires 'summary' field in output. Existing agents may need prompt updates." -4. **Partial pass handling**: When output partially validates (some fields valid, some not), core presents what passed and what failed. User decides: accept partial, retry, or drop. - -**Never automatic**: Schema validation failures are always surfaced. The system does not silently discard or re-request outputs. - ---- - -### Priority 5: Critical / Non-Critical Agent Classification +**How it will work**: +- Mesh config defines `output_schema` per agent +- Post-completion hook validates output against schema +- Partial pass handling: presents what passed and what failed for human decision -**Impact**: Prevents cascade from optional steps -**Effort**: Low +### Critical/Non-Critical Agent Classification -**What it does**: Agents are classified as `critical` (failure blocks mesh) or `non-critical` (failure is logged but mesh continues). Prevents optional agents from taking down the whole workflow. +**What it does**: Agents classified as `critical` (failure blocks mesh) or `non-critical` (failure logged, mesh continues). Prevents optional agents from taking down the whole workflow. -**Human review steps**: -1. **Classification review**: When a mesh is loaded, core can surface agent classifications: "Mesh X: critical=[planner, builder], non-critical=[linter, formatter]." -2. **Non-critical failure notification**: When a non-critical agent fails, core reports: "Non-critical agent 'linter' failed (timeout). Mesh continues. Output from this step will be missing." -3. **Promotion decision**: If a non-critical agent fails repeatedly, core asks: "Agent 'linter' has failed 5 times. Should it be promoted to critical (failures block mesh) or disabled?" -4. **Critical failure escalation**: Critical agent failures always stop the mesh and present recovery options (Priority 1 checkpoints + Priority 3 retry history). +**How it will work**: +- Agent config adds `critical: true|false` field (default: true) +- Non-critical failures logged and surfaced but don't block mesh +- Repeated non-critical failures prompt promotion decision -**Never automatic**: Non-critical failures are always reported. The user is never surprised by missing outputs from skipped agents. - ---- - -### Priority 6: Aggregate Observability Dashboard - -**Impact**: Needed to find the long-tail 0.01% -**Effort**: Medium +### Aggregate Observability Dashboard **What it does**: Unified view across all meshes — SLI trends, failure patterns, cost tracking, and anomaly detection. -**Human review steps**: -1. **Anomaly alerts**: When the dashboard detects anomalies (sudden SLI drop, unusual failure pattern, cost spike), core surfaces: "Anomaly detected: mesh X failure rate spiked from 2% to 15% in last hour. Failure category: model_error." -2. **Trend review**: On request, core presents trend data: "Last 24h: 47 mesh runs, 98.3% success, 1 DLQ entry (recovered). Top failure: timeout (3x in mesh Y)." -3. **Cost review gate**: Before approving expensive recovery (multiple retries, large context replay), core presents estimated cost: "Recovering mesh X with rewind-to will replay ~50k tokens. Proceed?" -4. **Weekly digest**: Core can present a weekly reliability summary: nines achieved, worst-performing meshes, recurring failure patterns, DLQ utilization. - -**Never automatic**: The dashboard is passive — it collects and presents. All actions triggered by dashboard insights go through the standard human review workflow (diagnose → present → confirm → execute). - ---- - -### Human Review Principle - -Across all 6 priorities, the same principle applies: - -> **The system does work. The human makes decisions.** - -- Retries within limits → automatic (but visible) -- Recovery, replay, escalation → always human-approved -- Failures → always surfaced with context and options -- No silent state changes that affect mesh behavior +**How it will work**: +- Anomaly detection: sudden SLI drops, unusual failure patterns, cost spikes +- Trend data: success rates, DLQ utilization, MTTR over time +- Cost estimation before expensive recovery operations