From 8cda832e4cb64e6388799c24edcd0ee0866da7a4 Mon Sep 17 00:00:00 2001 From: Claude Date: Sun, 8 Mar 2026 22:10:54 +0000 Subject: [PATCH 01/12] feat(reliability): Add four-nines reliability module with DLQ, circuit breakers, SLI tracking, and safe mode Implements Karpathy's "March of Nines" patterns for TX mesh reliability: - Dead Letter Queue: Failed messages persist for replay instead of silent drops - Circuit Breaker: Per-agent failure isolation prevents cascading failures - Heartbeat Monitor: Detects stalled workers at warn/stale/dead thresholds - SLI Tracker: Measures success rates, MTTR, and failure taxonomy per mesh - Safe Mode: Gradual autonomy control (normal/cautious/restricted/lockdown) - ReliabilityManager: Single integration point wired into dispatcher Includes two test meshes (reliability-test, reliability-fsm) and updated guardrails docs. https://claude.ai/code/session_012PTx7bZNNh74rAshCcuSQg --- .gitignore | 2 + docs/guardrails.md | 115 +++++++++ meshes/reliability-fsm/analyst/prompt.md | 20 ++ meshes/reliability-fsm/builder/prompt.md | 14 ++ meshes/reliability-fsm/config.yaml | 135 +++++++++++ meshes/reliability-fsm/verifier/prompt.md | 14 ++ meshes/reliability-test/checker/prompt.md | 21 ++ meshes/reliability-test/config.yaml | 86 +++++++ meshes/reliability-test/planner/prompt.md | 19 ++ meshes/reliability-test/worker/prompt.md | 20 ++ src/reliability/circuit-breaker.ts | 193 +++++++++++++++ src/reliability/dead-letter-queue.ts | 188 +++++++++++++++ src/reliability/heartbeat-monitor.ts | 221 +++++++++++++++++ src/reliability/index.ts | 19 ++ src/reliability/reliability-manager.ts | 276 ++++++++++++++++++++++ src/reliability/safe-mode.ts | 235 ++++++++++++++++++ src/reliability/sli-tracker.ts | 239 +++++++++++++++++++ src/worker/dispatcher.ts | 37 +++ 18 files changed, 1854 insertions(+) create mode 100644 meshes/reliability-fsm/analyst/prompt.md create mode 100644 meshes/reliability-fsm/builder/prompt.md create mode 100644 meshes/reliability-fsm/config.yaml create mode 100644 meshes/reliability-fsm/verifier/prompt.md create mode 100644 meshes/reliability-test/checker/prompt.md create mode 100644 meshes/reliability-test/config.yaml create mode 100644 meshes/reliability-test/planner/prompt.md create mode 100644 meshes/reliability-test/worker/prompt.md create mode 100644 src/reliability/circuit-breaker.ts create mode 100644 src/reliability/dead-letter-queue.ts create mode 100644 src/reliability/heartbeat-monitor.ts create mode 100644 src/reliability/index.ts create mode 100644 src/reliability/reliability-manager.ts create mode 100644 src/reliability/safe-mode.ts create mode 100644 src/reliability/sli-tracker.ts diff --git a/.gitignore b/.gitignore index 7a39346e..1be296ef 100644 --- a/.gitignore +++ b/.gitignore @@ -28,3 +28,5 @@ meshes/* !meshes/structured-thinking !meshes/narrative-engine/ !meshes/narrative-engine-v2/ +!meshes/reliability-test/ +!meshes/reliability-fsm/ diff --git a/docs/guardrails.md b/docs/guardrails.md index 59b93ae7..986f49dc 100644 --- a/docs/guardrails.md +++ b/docs/guardrails.md @@ -358,3 +358,118 @@ max_turns: warning: true limit: 50 ``` + +## Reliability (Four Nines) + +The reliability module (`src/reliability/`) provides four-nines (99.99%) patterns inspired by Karpathy's "March of Nines". Each nine requires fundamentally new approaches: + +| Nine | Target | TX Mechanism | +|------|--------|-------------| +| 1 (90%) | Basic error handling | Logging, guardrails, FSM validation | +| 2 (99%) | Message recovery | Dead Letter Queue, retry with backoff | +| 3 (99.9%) | Failure isolation | Circuit breakers, heartbeat monitoring | +| 4 (99.99%) | Proactive safety | SLI tracking, safe mode, failure taxonomy | + +### Configuration + +Add to `.ai/tx/data/config.yaml`: + +```yaml +reliability: + circuitBreaker: + failureThreshold: 3 # Failures before circuit opens + cooldownMs: 60000 # Wait before probe request + windowMs: 300000 # Failure counting window (5 min) + heartbeat: + warnMs: 60000 # Silence before warning (1 min) + staleMs: 120000 # Silence before stale (2 min) + deadMs: 300000 # Silence before dead (5 min) + checkIntervalMs: 15000 # Check interval (15s) + safeMode: + defaultLevel: normal # normal | cautious | restricted | lockdown + autoEscalate: false # Auto-escalate based on SLI + cautiousThreshold: 0.95 # SLI rate triggering cautious mode + restrictedThreshold: 0.90 # SLI rate triggering restricted mode + lockdownThreshold: 0.80 # SLI rate triggering lockdown + dlq: + maxRetries: 3 # Max retries before DLQ + sli: + retentionMs: 604800000 # SLI data retention (7 days) +``` + +### Dead Letter Queue (DLQ) + +Messages that fail delivery after max retries are routed to the DLQ instead of being silently dropped. DLQ entries persist in SQLite and can be replayed manually. + +- Automatic retry with exponential backoff +- Failure reason tracking for taxonomy +- Replay capability for manual recovery +- Stats available via `reliability.dlq.getStats()` + +### Circuit Breaker + +Prevents cascading failures when an agent repeatedly fails. Three states: + +| State | Behavior | +|-------|----------| +| **Closed** | Normal — requests pass through | +| **Open** | Failures exceeded threshold — requests fail immediately | +| **Half-Open** | After cooldown — single probe request allowed | + +Applied per-agent (`mesh/agent`). Resets on mesh completion. + +### Heartbeat Monitor + +Detects stalled/hung workers by monitoring output timestamps: + +| Level | Default | Action | +|-------|---------|--------| +| Warn | 60s silence | Log warning | +| Stale | 120s silence | Inject nudge to worker | +| Dead | 300s silence | Record failure, trigger circuit breaker | + +### SLI Tracker + +Tracks success rates, latencies, and failure categories per mesh: + +- **Success rate**: Per-mesh and per-agent (target: 99.99%) +- **MTTR**: Mean time to recovery (failure → next success) +- **Failure taxonomy**: Categorized failures for targeted fixes +- **Nines level**: Human-readable "99.9% (3 nines)" display + +Failure categories: `model_error`, `routing_error`, `timeout`, `guardrail_kill`, `crash`, `stuck`, `policy_violation`, `gate_failure`, `circuit_open`, `unknown` + +### Safe Mode + +Treat autonomy as a knob, not a switch. Four levels: + +| Level | Tools Disabled | Actions Blocked | +|-------|---------------|-----------------| +| **normal** | None | None | +| **cautious** | None | Destructive bash, git push, file delete | +| **restricted** | Write, Edit, Bash | All writes, all bash, git operations | +| **lockdown** | All tools | All operations (stops agent execution) | + +Safe mode can be: +- Set manually per-mesh or globally +- Auto-escalated based on SLI thresholds (when `autoEscalate: true`) +- Only escalates automatically; human must clear/de-escalate + +### Test Meshes + +Two meshes for testing reliability features: + +- **`reliability-test`**: Simple 3-agent linear mesh (planner → worker → checker) with tight guardrails +- **`reliability-fsm`**: FSM-based mesh with gate scripts, iteration tracking, and state transitions + +### Implementation + +| File | Role | +|------|------| +| `src/reliability/index.ts` | Module exports | +| `src/reliability/reliability-manager.ts` | Central coordinator (single integration point) | +| `src/reliability/dead-letter-queue.ts` | DLQ with SQLite persistence | +| `src/reliability/circuit-breaker.ts` | Per-agent circuit breaker | +| `src/reliability/heartbeat-monitor.ts` | Stalled worker detection | +| `src/reliability/sli-tracker.ts` | SLI measurement and nines calculation | +| `src/reliability/safe-mode.ts` | Gradual autonomy control | diff --git a/meshes/reliability-fsm/analyst/prompt.md b/meshes/reliability-fsm/analyst/prompt.md new file mode 100644 index 00000000..13a9c388 --- /dev/null +++ b/meshes/reliability-fsm/analyst/prompt.md @@ -0,0 +1,20 @@ +# Analyst Agent + +You are the coordinator of a reliability test FSM mesh. You analyze tasks, coordinate work, and finalize results. + +## Responsibilities + +- **analyze state**: Break down the incoming task into clear requirements +- **complete state**: Synthesize results and report completion + +## Guidelines + +- Keep analysis brief and focused +- Forward clear requirements to the builder +- On completion, summarize what was accomplished + +## Routing + +- When analysis is ready: route `complete` (FSM handles transition to build) +- When task is complete: route `complete` → core +- When user input needed: route `blocked` → core diff --git a/meshes/reliability-fsm/builder/prompt.md b/meshes/reliability-fsm/builder/prompt.md new file mode 100644 index 00000000..5dc1fb63 --- /dev/null +++ b/meshes/reliability-fsm/builder/prompt.md @@ -0,0 +1,14 @@ +# Builder Agent + +You are a builder agent in an FSM reliability test mesh. You implement what the analyst specifies. + +## Responsibilities + +- Execute the implementation plan from the analyst +- Write clean, functional code +- Report completion for verification + +## Routing + +- When build is done: route `complete` (FSM transitions to verify) +- When blocked: route `blocked` → analyst diff --git a/meshes/reliability-fsm/config.yaml b/meshes/reliability-fsm/config.yaml new file mode 100644 index 00000000..61dd1adf --- /dev/null +++ b/meshes/reliability-fsm/config.yaml @@ -0,0 +1,135 @@ +# reliability-fsm/config.yaml +# FSM-based reliability test mesh +# +# Tests reliability features with state machine transitions: +# - Gate scripts that can fail (tests circuit breaker recovery) +# - Multi-step workflow (tests per-step SLI tracking) +# - Iteration loop (tests heartbeat monitoring across retries) +# - Safe mode integration (tests tool restriction under degraded SLI) + +mesh: reliability-fsm +description: "FSM reliability test: state gates, iteration tracking, safe-mode integration" + +agents: + - name: analyst + model: haiku + prompt: analyst/prompt.md + + - name: builder + model: haiku + prompt: builder/prompt.md + + - name: verifier + model: haiku + prompt: verifier/prompt.md + +entry_point: analyst +completion_agent: analyst +continuation: [analyst] + +routing: + analyst: + complete: + core: "Task completed successfully" + blocked: + core: "Need user input" + builder: + complete: + analyst: "Build complete, ready for next step" + blocked: + analyst: "Build blocked, need guidance" + verifier: + complete: + analyst: "Verification passed" + blocked: + builder: "Verification failed, rework needed" + +injectOriginalMessage: true + +# Reliability-specific guardrails +guardrails: + max_messages: + limit: 30 + strict: true + warning: true + max_turns: + limit: 20 + strict: false + warning: true + +# FSM: analyze → build → verify → complete (with retry loop) +fsm: + initial: analyze + + context: + iteration: 0 + max_iterations: 3 + build_attempts: 0 + + states: + analyze: + agents: [analyst] + exit: + set: + iteration: "0" + when: + - condition: "true" + target: build + default: build + + build: + agents: [builder] + entry: + gates: + builder: + - build-ready + exit: + run: increment-build + when: + - condition: "true" + target: verify + default: verify + + verify: + agents: [verifier] + entry: + gates: + verifier: + - verify-ready + exit: + when: + - condition: "build_attempts >= max_iterations" + target: complete + - condition: "true" + target: complete + default: complete + + complete: + agents: [analyst] + + scripts: + build-ready: | + echo "Build gate: checking readiness" + exit 0 + + verify-ready: | + echo "Verify gate: checking build artifacts" + exit 0 + + increment-build: | + echo "Build iteration incremented" + exit 0 + +workspace: + path: ".ai/output/{task-id}/" + +playbook_notes: | + FSM reliability test mesh — exercises: + + 1. Gate scripts at state entry (tests gate failure → circuit breaker) + 2. Iteration counting (tests SLI per-step tracking) + 3. Multi-agent handoff (tests heartbeat during transitions) + 4. Build retry loop (tests recovery patterns) + + The analyze→build→verify→complete flow mirrors real dev workflows + while being lightweight enough for reliability testing. diff --git a/meshes/reliability-fsm/verifier/prompt.md b/meshes/reliability-fsm/verifier/prompt.md new file mode 100644 index 00000000..979d7afd --- /dev/null +++ b/meshes/reliability-fsm/verifier/prompt.md @@ -0,0 +1,14 @@ +# Verifier Agent + +You are a verification agent in an FSM reliability test mesh. You validate the builder's output. + +## Responsibilities + +- Check the builder's implementation against requirements +- Verify correctness and completeness +- Approve or reject with specific feedback + +## Routing + +- When verification passes: route `complete` (FSM transitions to complete) +- When rework needed: route `blocked` → builder diff --git a/meshes/reliability-test/checker/prompt.md b/meshes/reliability-test/checker/prompt.md new file mode 100644 index 00000000..ec582087 --- /dev/null +++ b/meshes/reliability-test/checker/prompt.md @@ -0,0 +1,21 @@ +# Checker Agent + +You are a checker agent in a reliability test mesh. Your job is to verify the worker's output. + +## Responsibilities + +1. Review the implementation from the worker +2. Verify it meets the original task requirements +3. Check for obvious errors or omissions +4. Approve or send back for rework + +## Verification Checklist + +- [ ] Code compiles/runs without errors +- [ ] Meets the requirements from the plan +- [ ] No obvious bugs or missing edge cases + +## Routing + +- When all checks pass: route `complete` → core (task done) +- When rework needed: route `blocked` → worker diff --git a/meshes/reliability-test/config.yaml b/meshes/reliability-test/config.yaml new file mode 100644 index 00000000..584b4233 --- /dev/null +++ b/meshes/reliability-test/config.yaml @@ -0,0 +1,86 @@ +# reliability-test/config.yaml +# Test mesh for validating four-nines reliability features +# +# Exercises: circuit breakers, heartbeat monitoring, SLI tracking, +# dead letter queue, safe mode, and failure recovery. +# +# This mesh has an intentionally fragile agent (chaos-agent) that may +# produce routing errors or slow output to test reliability detection. + +mesh: reliability-test +description: "Test mesh for four-nines reliability features: circuit breakers, heartbeat, SLI, DLQ, safe mode" + +agents: + - name: planner + model: haiku + prompt: planner/prompt.md + + - name: worker + model: haiku + prompt: worker/prompt.md + + - name: checker + model: haiku + prompt: checker/prompt.md + +entry_point: planner +completion_agent: checker + +routing: + planner: + complete: + worker: "Plan ready, execute implementation" + blocked: + core: "Need clarification from user" + + worker: + complete: + checker: "Implementation done, verify results" + blocked: + planner: "Need to revise plan" + + checker: + complete: + core: "All checks passed, task complete" + blocked: + worker: "Checks failed, rework needed" + +# Reliability-specific guardrails for testing +guardrails: + max_messages: + strict: true + warning: true + limit: 20 + max_turns: + strict: false + warning: true + limit: 15 + routing_error: + strict: false + warning: true + max_retries: 2 + +# Workspace for output +workspace: + path: ".ai/output/{task-id}/" + +lifecycle: + post: + - commit:auto + +playbook_notes: | + Reliability test mesh for exercising four-nines patterns: + + 1. PLANNER: Breaks down the task into steps + 2. WORKER: Executes the implementation + 3. CHECKER: Validates the output + + This mesh is configured with tight guardrails to exercise: + - Circuit breaker trips on repeated failures + - Heartbeat detection on stalled agents + - SLI tracking for success/failure rates + - DLQ routing for undeliverable messages + - Safe mode escalation when SLI drops + + Run with: tx msg "Implement a simple hello world function" + Monitor with: tx status (shows reliability metrics) diff --git a/meshes/reliability-test/planner/prompt.md b/meshes/reliability-test/planner/prompt.md new file mode 100644 index 00000000..e0c9d914 --- /dev/null +++ b/meshes/reliability-test/planner/prompt.md @@ -0,0 +1,19 @@ +# Planner Agent + +You are a planning agent in a reliability test mesh. Your job is to break tasks into clear, actionable steps. + +## Responsibilities + +1. Analyze the incoming task +2. Break it into 2-3 concrete implementation steps +3. Forward the plan to the worker agent + +## Output Format + +Write a clear plan with numbered steps. Each step should be specific and actionable. +Keep plans simple — this is a reliability test, not a complex project. + +## Routing + +- When plan is ready: route `complete` → worker +- When you need human input: route `blocked` → core diff --git a/meshes/reliability-test/worker/prompt.md b/meshes/reliability-test/worker/prompt.md new file mode 100644 index 00000000..a1e85127 --- /dev/null +++ b/meshes/reliability-test/worker/prompt.md @@ -0,0 +1,20 @@ +# Worker Agent + +You are a worker agent in a reliability test mesh. Your job is to execute the plan from the planner. + +## Responsibilities + +1. Read the plan from the planner +2. Execute each step (write code, create files, etc.) +3. Forward completed work to the checker + +## Guidelines + +- Follow the plan step by step +- Write clean, working code +- Report any issues back to the planner + +## Routing + +- When implementation is done: route `complete` → checker +- When plan needs revision: route `blocked` → planner diff --git a/src/reliability/circuit-breaker.ts b/src/reliability/circuit-breaker.ts new file mode 100644 index 00000000..14a19a5c --- /dev/null +++ b/src/reliability/circuit-breaker.ts @@ -0,0 +1,193 @@ +/** + * CircuitBreaker - Prevent cascading failures in mesh execution + * + * Nine 3 pattern: When an agent or model repeatedly fails, stop + * sending it work and fail fast instead of wasting tokens. + * + * States: + * - CLOSED: Normal operation, requests pass through + * - OPEN: Failures exceeded threshold, requests fail immediately + * - HALF_OPEN: After cooldown, allow one probe request + * + * Applied per agent (mesh/agent) to isolate failures. + */ + +import { log } from '../shared/logger.ts'; + +export interface CircuitBreakerConfig { + /** Number of failures before opening circuit (default: 3) */ + failureThreshold: number; + /** Time in ms before trying again after opening (default: 60000) */ + cooldownMs: number; + /** Time window for counting failures in ms (default: 300000 = 5 min) */ + windowMs: number; +} + +export type CircuitBreakerState = 'closed' | 'open' | 'half_open'; + +interface CircuitState { + state: CircuitBreakerState; + failures: number; + lastFailureAt: number; + openedAt: number; + successesSinceHalfOpen: number; +} + +const DEFAULT_CONFIG: CircuitBreakerConfig = { + failureThreshold: 3, + cooldownMs: 60_000, + windowMs: 300_000, +}; + +export class CircuitBreaker { + private circuits: Map = new Map(); + private config: CircuitBreakerConfig; + + constructor(config?: Partial) { + this.config = { ...DEFAULT_CONFIG, ...config }; + } + + /** + * Check if a request should be allowed through + */ + canExecute(agentId: string): boolean { + const circuit = this.circuits.get(agentId); + if (!circuit) return true; + + switch (circuit.state) { + case 'closed': + return true; + + case 'open': { + const elapsed = Date.now() - circuit.openedAt; + if (elapsed >= this.config.cooldownMs) { + // Transition to half-open + circuit.state = 'half_open'; + circuit.successesSinceHalfOpen = 0; + log.info('circuit-breaker', 'Circuit half-open (probe allowed)', { + agentId, + cooldownMs: this.config.cooldownMs, + }); + return true; + } + return false; + } + + case 'half_open': + // Allow single probe request + return true; + } + } + + /** + * Record a successful execution + */ + recordSuccess(agentId: string): void { + const circuit = this.circuits.get(agentId); + if (!circuit) return; + + if (circuit.state === 'half_open') { + // Close the circuit on success in half-open state + circuit.state = 'closed'; + circuit.failures = 0; + log.info('circuit-breaker', 'Circuit closed (recovery successful)', { agentId }); + } + } + + /** + * Record a failed execution + */ + recordFailure(agentId: string, reason: string): void { + const now = Date.now(); + let circuit = this.circuits.get(agentId); + + if (!circuit) { + circuit = { + state: 'closed', + failures: 0, + lastFailureAt: 0, + openedAt: 0, + successesSinceHalfOpen: 0, + }; + this.circuits.set(agentId, circuit); + } + + // Reset failure count if outside window + if (now - circuit.lastFailureAt > this.config.windowMs) { + circuit.failures = 0; + } + + circuit.failures++; + circuit.lastFailureAt = now; + + log.warn('circuit-breaker', 'Failure recorded', { + agentId, + failures: circuit.failures, + threshold: this.config.failureThreshold, + reason, + }); + + if (circuit.state === 'half_open') { + // Failed during probe - reopen + circuit.state = 'open'; + circuit.openedAt = now; + log.error('circuit-breaker', 'Circuit reopened (probe failed)', { agentId, reason }); + return; + } + + if (circuit.failures >= this.config.failureThreshold) { + circuit.state = 'open'; + circuit.openedAt = now; + log.error('circuit-breaker', 'Circuit opened (threshold exceeded)', { + agentId, + failures: circuit.failures, + threshold: this.config.failureThreshold, + reason, + }); + } + } + + /** + * Get the current state of a circuit + */ + getState(agentId: string): CircuitBreakerState { + return this.circuits.get(agentId)?.state || 'closed'; + } + + /** + * Get all circuit states (for status display) + */ + getAllStates(): Map { + const result = new Map(); + for (const [id, circuit] of this.circuits) { + result.set(id, { state: circuit.state, failures: circuit.failures }); + } + return result; + } + + /** + * Reset a specific circuit (manual recovery) + */ + reset(agentId: string): void { + this.circuits.delete(agentId); + log.info('circuit-breaker', 'Circuit manually reset', { agentId }); + } + + /** + * Reset all circuits (e.g., on mesh restart) + */ + resetAll(): void { + this.circuits.clear(); + } + + /** + * Reset circuits for a specific mesh + */ + resetForMesh(meshName: string): void { + for (const agentId of this.circuits.keys()) { + if (agentId.startsWith(`${meshName}/`)) { + this.circuits.delete(agentId); + } + } + } +} diff --git a/src/reliability/dead-letter-queue.ts b/src/reliability/dead-letter-queue.ts new file mode 100644 index 00000000..98b6ec06 --- /dev/null +++ b/src/reliability/dead-letter-queue.ts @@ -0,0 +1,188 @@ +/** + * DeadLetterQueue - Messages that failed delivery after max retries + * + * Nine 2 pattern: Instead of silently dropping failed messages, + * route them to a DLQ for inspection and manual replay. + * + * Features: + * - Automatic retry with exponential backoff (up to maxRetries) + * - DLQ storage in SQLite for persistence across restarts + * - Replay capability for manual recovery + * - Failure reason tracking for taxonomy + */ + +import type Database from 'better-sqlite3'; +import { log } from '../shared/logger.ts'; + +export interface DLQEntry { + id: number; + from_agent: string; + to_agent: string; + type: string; + payload: string; + source_file: string | null; + failure_reason: string; + retry_count: number; + max_retries: number; + first_failed_at: number; + last_failed_at: number; + replayed_at: number | null; +} + +export interface DLQStats { + total: number; + pending: number; // Not yet replayed + replayed: number; // Successfully replayed + byReason: Record; + byAgent: Record; +} + +export class DeadLetterQueue { + private db: Database.Database; + private maxRetries: number; + + constructor(db: Database.Database, maxRetries = 3) { + this.db = db; + this.maxRetries = maxRetries; + this.ensureSchema(); + } + + private ensureSchema(): void { + this.db.exec(` + CREATE TABLE IF NOT EXISTS dead_letter_queue ( + id INTEGER PRIMARY KEY AUTOINCREMENT, + from_agent TEXT NOT NULL, + to_agent TEXT NOT NULL, + type TEXT NOT NULL, + payload TEXT NOT NULL, + source_file TEXT, + failure_reason TEXT NOT NULL, + retry_count INTEGER DEFAULT 0, + max_retries INTEGER NOT NULL, + first_failed_at INTEGER NOT NULL, + last_failed_at INTEGER NOT NULL, + replayed_at INTEGER + ); + CREATE INDEX IF NOT EXISTS idx_dlq_agent ON dead_letter_queue(to_agent, replayed_at); + CREATE INDEX IF NOT EXISTS idx_dlq_reason ON dead_letter_queue(failure_reason); + `); + } + + /** + * Add a failed message to the DLQ + */ + add(entry: { + from_agent: string; + to_agent: string; + type: string; + payload: Record; + source_file?: string; + failure_reason: string; + retry_count?: number; + }): number { + const now = Date.now(); + const result = this.db.prepare(` + INSERT INTO dead_letter_queue + (from_agent, to_agent, type, payload, source_file, failure_reason, + retry_count, max_retries, first_failed_at, last_failed_at) + VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?) + `).run( + entry.from_agent, + entry.to_agent, + entry.type, + JSON.stringify(entry.payload), + entry.source_file || null, + entry.failure_reason, + entry.retry_count || 0, + this.maxRetries, + now, + now + ); + + log.warn('dlq', 'Message added to dead letter queue', { + id: result.lastInsertRowid, + from: entry.from_agent, + to: entry.to_agent, + reason: entry.failure_reason, + retries: entry.retry_count || 0, + }); + + return result.lastInsertRowid as number; + } + + /** + * Get all unreplayed DLQ entries + */ + getPending(): DLQEntry[] { + return this.db.prepare(` + SELECT * FROM dead_letter_queue + WHERE replayed_at IS NULL + ORDER BY last_failed_at DESC + `).all() as DLQEntry[]; + } + + /** + * Get DLQ entries for a specific agent + */ + getForAgent(agentId: string): DLQEntry[] { + return this.db.prepare(` + SELECT * FROM dead_letter_queue + WHERE to_agent = ? AND replayed_at IS NULL + ORDER BY last_failed_at DESC + `).all(agentId) as DLQEntry[]; + } + + /** + * Mark a DLQ entry as replayed + */ + markReplayed(id: number): void { + this.db.prepare(` + UPDATE dead_letter_queue SET replayed_at = ? WHERE id = ? + `).run(Date.now(), id); + + log.info('dlq', 'DLQ entry replayed', { id }); + } + + /** + * Get DLQ statistics + */ + getStats(): DLQStats { + const total = (this.db.prepare( + 'SELECT COUNT(*) as c FROM dead_letter_queue' + ).get() as { c: number }).c; + + const pending = (this.db.prepare( + 'SELECT COUNT(*) as c FROM dead_letter_queue WHERE replayed_at IS NULL' + ).get() as { c: number }).c; + + const byReasonRows = this.db.prepare(` + SELECT failure_reason, COUNT(*) as c FROM dead_letter_queue + WHERE replayed_at IS NULL GROUP BY failure_reason + `).all() as Array<{ failure_reason: string; c: number }>; + + const byAgentRows = this.db.prepare(` + SELECT to_agent, COUNT(*) as c FROM dead_letter_queue + WHERE replayed_at IS NULL GROUP BY to_agent + `).all() as Array<{ to_agent: string; c: number }>; + + const byReason: Record = {}; + for (const row of byReasonRows) byReason[row.failure_reason] = row.c; + + const byAgent: Record = {}; + for (const row of byAgentRows) byAgent[row.to_agent] = row.c; + + return { total, pending, replayed: total - pending, byReason, byAgent }; + } + + /** + * Clear old replayed entries (garbage collection) + */ + clearReplayed(olderThanMs = 24 * 60 * 60 * 1000): number { + const cutoff = Date.now() - olderThanMs; + const result = this.db.prepare(` + DELETE FROM dead_letter_queue + WHERE replayed_at IS NOT NULL AND replayed_at < ? + `).run(cutoff); + return result.changes; + } +} diff --git a/src/reliability/heartbeat-monitor.ts b/src/reliability/heartbeat-monitor.ts new file mode 100644 index 00000000..8bfde400 --- /dev/null +++ b/src/reliability/heartbeat-monitor.ts @@ -0,0 +1,221 @@ +/** + * HeartbeatMonitor - Detect stalled/hung workers + * + * Nine 3 pattern: Workers that stop producing output are likely stuck. + * Monitor last output timestamps and escalate when stale. + * + * Stale detection levels: + * 1. Warning (60s no output): Log, could be thinking + * 2. Stale (120s no output): Inject nudge to worker + * 3. Dead (300s no output): Kill worker, route to DLQ + */ + +import { log } from '../shared/logger.ts'; + +export interface HeartbeatConfig { + /** Warn threshold in ms (default: 60000 = 1 min) */ + warnMs: number; + /** Stale threshold in ms (default: 120000 = 2 min) */ + staleMs: number; + /** Dead threshold in ms (default: 300000 = 5 min) */ + deadMs: number; + /** Check interval in ms (default: 15000 = 15s) */ + checkIntervalMs: number; +} + +export type HealthStatus = 'healthy' | 'warn' | 'stale' | 'dead'; + +export interface AgentHealth { + agentId: string; + status: HealthStatus; + lastOutputAt: number; + silenceMs: number; + startedAt: number; +} + +const DEFAULT_CONFIG: HeartbeatConfig = { + warnMs: 60_000, + staleMs: 120_000, + deadMs: 300_000, + checkIntervalMs: 15_000, +}; + +type HeartbeatCallback = (health: AgentHealth) => void; + +export class HeartbeatMonitor { + private agents: Map = new Map(); + private config: HeartbeatConfig; + private checkInterval: NodeJS.Timeout | null = null; + private onStale: HeartbeatCallback | null = null; + private onDead: HeartbeatCallback | null = null; + private onWarn: HeartbeatCallback | null = null; + /** Track which agents have already been notified at each level to avoid spam */ + private notified: Map = new Map(); + + constructor(config?: Partial) { + this.config = { ...DEFAULT_CONFIG, ...config }; + } + + /** + * Register callbacks for health state changes + */ + on(event: 'warn' | 'stale' | 'dead', callback: HeartbeatCallback): void { + switch (event) { + case 'warn': this.onWarn = callback; break; + case 'stale': this.onStale = callback; break; + case 'dead': this.onDead = callback; break; + } + } + + /** + * Register an agent for monitoring + */ + register(agentId: string): void { + const now = Date.now(); + this.agents.set(agentId, { lastOutputAt: now, startedAt: now }); + this.notified.delete(agentId); + } + + /** + * Record output from an agent (heartbeat) + */ + heartbeat(agentId: string): void { + const entry = this.agents.get(agentId); + if (entry) { + entry.lastOutputAt = Date.now(); + // Reset notification level on activity + this.notified.delete(agentId); + } + } + + /** + * Unregister an agent (worker completed/killed) + */ + unregister(agentId: string): void { + this.agents.delete(agentId); + this.notified.delete(agentId); + } + + /** + * Start periodic health checks + */ + start(): void { + if (this.checkInterval) return; + + this.checkInterval = setInterval(() => { + this.checkAll(); + }, this.config.checkIntervalMs); + + log.debug('heartbeat', 'Monitor started', { + checkIntervalMs: this.config.checkIntervalMs, + }); + } + + /** + * Stop periodic health checks + */ + stop(): void { + if (this.checkInterval) { + clearInterval(this.checkInterval); + this.checkInterval = null; + } + } + + /** + * Check health of all registered agents + */ + checkAll(): AgentHealth[] { + const now = Date.now(); + const results: AgentHealth[] = []; + + for (const [agentId, entry] of this.agents) { + const silenceMs = now - entry.lastOutputAt; + const health = this.classify(agentId, silenceMs, entry); + results.push(health); + + // Fire callbacks only on state escalation (don't re-notify at same level) + const prevLevel = this.notified.get(agentId); + if (health.status !== 'healthy' && health.status !== prevLevel) { + this.notified.set(agentId, health.status); + this.fireCallback(health); + } + } + + return results; + } + + /** + * Get health for a specific agent + */ + getHealth(agentId: string): AgentHealth | null { + const entry = this.agents.get(agentId); + if (!entry) return null; + const silenceMs = Date.now() - entry.lastOutputAt; + return this.classify(agentId, silenceMs, entry); + } + + private classify( + agentId: string, + silenceMs: number, + entry: { lastOutputAt: number; startedAt: number } + ): AgentHealth { + let status: HealthStatus = 'healthy'; + if (silenceMs >= this.config.deadMs) status = 'dead'; + else if (silenceMs >= this.config.staleMs) status = 'stale'; + else if (silenceMs >= this.config.warnMs) status = 'warn'; + + return { + agentId, + status, + lastOutputAt: entry.lastOutputAt, + silenceMs, + startedAt: entry.startedAt, + }; + } + + private fireCallback(health: AgentHealth): void { + switch (health.status) { + case 'warn': + log.warn('heartbeat', 'Agent quiet', { + agentId: health.agentId, + silenceMs: health.silenceMs, + }); + this.onWarn?.(health); + break; + case 'stale': + log.warn('heartbeat', 'Agent stale', { + agentId: health.agentId, + silenceMs: health.silenceMs, + }); + this.onStale?.(health); + break; + case 'dead': + log.error('heartbeat', 'Agent presumed dead', { + agentId: health.agentId, + silenceMs: health.silenceMs, + }); + this.onDead?.(health); + break; + } + } + + /** + * Clear all monitoring state (mesh reset) + */ + clear(): void { + this.agents.clear(); + this.notified.clear(); + } + + /** + * Clear monitoring for a specific mesh + */ + clearForMesh(meshName: string): void { + for (const agentId of this.agents.keys()) { + if (agentId.startsWith(`${meshName}/`)) { + this.agents.delete(agentId); + this.notified.delete(agentId); + } + } + } +} diff --git a/src/reliability/index.ts b/src/reliability/index.ts new file mode 100644 index 00000000..b1989b53 --- /dev/null +++ b/src/reliability/index.ts @@ -0,0 +1,19 @@ +/** + * Reliability Module - March of Nines + * + * Implements four-nines (99.99%) reliability patterns for TX mesh execution: + * + * Nine 1 (90%): Basic error handling, logging ✓ (existing) + * Nine 2 (99%): Dead letter queue, message retry, idempotency + * Nine 3 (99.9%): Circuit breakers, heartbeat detection, structured traces + * Nine 4 (99.99%): SLI tracking, failure taxonomy, safe-mode, canary checks + * + * Reference: Karpathy's "March of Nines" - each nine requires new approaches, + * not just more of what got you the previous nine. + */ + +export { DeadLetterQueue, type DLQEntry, type DLQStats } from './dead-letter-queue.ts'; +export { CircuitBreaker, type CircuitBreakerConfig, type CircuitBreakerState } from './circuit-breaker.ts'; +export { HeartbeatMonitor, type HeartbeatConfig, type AgentHealth } from './heartbeat-monitor.ts'; +export { SLITracker, type SLIConfig, type SLISnapshot, type FailureCategory } from './sli-tracker.ts'; +export { SafeMode, type SafeModeConfig, type SafeModeState } from './safe-mode.ts'; diff --git a/src/reliability/reliability-manager.ts b/src/reliability/reliability-manager.ts new file mode 100644 index 00000000..012177bf --- /dev/null +++ b/src/reliability/reliability-manager.ts @@ -0,0 +1,276 @@ +/** + * ReliabilityManager - Central coordinator for all reliability features + * + * Provides a single integration point for the dispatcher to wire up: + * - Dead letter queue (failed message recovery) + * - Circuit breakers (cascading failure prevention) + * - Heartbeat monitoring (stalled worker detection) + * - SLI tracking (reliability measurement) + * - Safe mode (gradual autonomy control) + * + * Usage in dispatcher.start(): + * this.reliability = new ReliabilityManager(this.queue.getDb(), this.config.workDir); + * this.reliability.start(); + * + * Wire events: + * // On worker complete + * this.reliability.recordSuccess(meshName, agentId, durationMs); + * // On worker error + * this.reliability.recordFailure(meshName, agentId, 'crash', error.message); + * // On worker output (heartbeat) + * this.reliability.heartbeat(agentId); + */ + +import type Database from 'better-sqlite3'; +import { DeadLetterQueue, type DLQStats } from './dead-letter-queue.ts'; +import { CircuitBreaker, type CircuitBreakerState } from './circuit-breaker.ts'; +import { HeartbeatMonitor, type AgentHealth } from './heartbeat-monitor.ts'; +import { SLITracker, type SLISnapshot, type FailureCategory } from './sli-tracker.ts'; +import { SafeMode, type SafeModeLevel, type SafeModeState } from './safe-mode.ts'; +import { log } from '../shared/logger.ts'; +import fs from 'node:fs'; +import path from 'node:path'; +import YAML from 'yaml'; + +export interface ReliabilityConfig { + circuitBreaker?: { + failureThreshold?: number; + cooldownMs?: number; + windowMs?: number; + }; + heartbeat?: { + warnMs?: number; + staleMs?: number; + deadMs?: number; + checkIntervalMs?: number; + }; + safeMode?: { + defaultLevel?: SafeModeLevel; + autoEscalate?: boolean; + cautiousThreshold?: number; + restrictedThreshold?: number; + lockdownThreshold?: number; + }; + dlq?: { + maxRetries?: number; + }; + sli?: { + retentionMs?: number; + }; +} + +export interface ReliabilityStatus { + sli: SLISnapshot; + dlq: DLQStats; + safeMode: SafeModeState; + circuitBreakers: Array<{ agentId: string; state: CircuitBreakerState; failures: number }>; + agentHealth: AgentHealth[]; +} + +export class ReliabilityManager { + readonly dlq: DeadLetterQueue; + readonly circuitBreaker: CircuitBreaker; + readonly heartbeat: HeartbeatMonitor; + readonly sli: SLITracker; + readonly safeMode: SafeMode; + private workDir: string; + + constructor(db: Database.Database, workDir: string, config?: ReliabilityConfig) { + this.workDir = workDir; + + // Load config from config.yaml if exists + const fileConfig = this.loadConfigFromFile(workDir); + const merged = { ...fileConfig, ...config }; + + this.dlq = new DeadLetterQueue(db, merged.dlq?.maxRetries); + this.circuitBreaker = new CircuitBreaker(merged.circuitBreaker); + this.heartbeat = new HeartbeatMonitor(merged.heartbeat); + this.sli = new SLITracker(merged.sli); + this.safeMode = new SafeMode(merged.safeMode); + + // Wire heartbeat callbacks + this.heartbeat.on('stale', (health) => { + log.warn('reliability', `Agent stale: ${health.agentId}`, { + silenceMs: health.silenceMs, + }); + }); + + this.heartbeat.on('dead', (health) => { + this.recordFailure( + health.agentId.split('/')[0], + health.agentId, + 'stuck', + `No output for ${Math.round(health.silenceMs / 1000)}s` + ); + }); + + log.info('reliability', 'ReliabilityManager initialized', { + dlqMaxRetries: merged.dlq?.maxRetries || 3, + cbThreshold: merged.circuitBreaker?.failureThreshold || 3, + safeModeDefault: merged.safeMode?.defaultLevel || 'normal', + autoEscalate: merged.safeMode?.autoEscalate || false, + }); + } + + /** + * Load reliability config from .ai/tx/data/config.yaml + */ + private loadConfigFromFile(workDir: string): ReliabilityConfig { + const configPath = path.join(workDir, '.ai', 'tx', 'data', 'config.yaml'); + if (!fs.existsSync(configPath)) return {}; + + try { + const content = YAML.parse(fs.readFileSync(configPath, 'utf-8')); + return content?.reliability || {}; + } catch { + return {}; + } + } + + /** + * Start monitoring (heartbeat timer) + */ + start(): void { + this.heartbeat.start(); + log.info('reliability', 'Monitoring started'); + } + + /** + * Stop monitoring + */ + stop(): void { + this.heartbeat.stop(); + } + + // ============================================================ + // Integration API (called by dispatcher) + // ============================================================ + + /** + * Check if an agent can execute (circuit breaker + safe mode) + * Returns { allowed, reason } — dispatcher should skip spawn if !allowed + */ + canSpawn(meshName: string, agentId: string): { allowed: boolean; reason?: string } { + // Circuit breaker check + if (!this.circuitBreaker.canExecute(agentId)) { + this.sli.recordFailure(meshName, agentId, 'circuit_open', 'Circuit breaker is open'); + return { allowed: false, reason: `Circuit breaker OPEN for ${agentId}` }; + } + + // Safe mode check + const safeLevel = this.safeMode.getLevel(meshName); + if (safeLevel === 'lockdown') { + return { allowed: false, reason: `Safe mode LOCKDOWN for mesh ${meshName}` }; + } + + return { allowed: true }; + } + + /** + * Register agent for heartbeat monitoring (call on spawn) + */ + registerAgent(agentId: string): void { + this.heartbeat.register(agentId); + } + + /** + * Record heartbeat (call on worker output) + */ + recordHeartbeat(agentId: string): void { + this.heartbeat.heartbeat(agentId); + } + + /** + * Record successful completion + */ + recordSuccess(meshName: string, agentId: string, durationMs?: number): void { + this.sli.recordSuccess(meshName, agentId, durationMs); + this.circuitBreaker.recordSuccess(agentId); + this.heartbeat.unregister(agentId); + } + + /** + * Record failure + */ + recordFailure( + meshName: string, + agentId: string, + category: FailureCategory, + reason?: string + ): void { + this.sli.recordFailure(meshName, agentId, category, reason); + this.circuitBreaker.recordFailure(agentId, reason || category); + this.heartbeat.unregister(agentId); + + // Auto-evaluate safe mode after each failure + const snapshot = this.sli.getSnapshot(300_000); // 5 min window + this.safeMode.evaluateSLI(snapshot.successRate, meshName); + } + + /** + * Route a failed message to DLQ + */ + deadLetter(msg: { + from_agent: string; + to_agent: string; + type: string; + payload: Record; + source_file?: string; + }, reason: string, retryCount?: number): void { + this.dlq.add({ + from_agent: msg.from_agent, + to_agent: msg.to_agent, + type: msg.type, + payload: msg.payload, + source_file: msg.source_file, + failure_reason: reason, + retry_count: retryCount, + }); + } + + /** + * Clean up for a mesh (call on mesh complete) + */ + cleanupMesh(meshName: string): void { + this.circuitBreaker.resetForMesh(meshName); + this.heartbeat.clearForMesh(meshName); + } + + // ============================================================ + // Status API (for CLI / monitoring) + // ============================================================ + + /** + * Get comprehensive reliability status + */ + getStatus(windowMs?: number): ReliabilityStatus { + const cbStates = this.circuitBreaker.getAllStates(); + const circuitBreakers: Array<{ agentId: string; state: CircuitBreakerState; failures: number }> = []; + for (const [agentId, info] of cbStates) { + circuitBreakers.push({ agentId, ...info }); + } + + return { + sli: this.sli.getSnapshot(windowMs), + dlq: this.dlq.getStats(), + safeMode: this.safeMode.getState(), + circuitBreakers, + agentHealth: this.heartbeat.checkAll(), + }; + } + + /** + * Write status to log file for monitoring + */ + logStatus(): void { + const status = this.getStatus(300_000); // 5 min window + log.info('reliability', 'Status snapshot', { + ninesLevel: status.sli.ninesLevel, + successRate: status.sli.successRate, + totalEvents: status.sli.totalEvents, + dlqPending: status.dlq.pending, + safeModeLevel: status.safeMode.level, + openCircuits: status.circuitBreakers.filter(cb => cb.state === 'open').length, + }); + } +} diff --git a/src/reliability/safe-mode.ts b/src/reliability/safe-mode.ts new file mode 100644 index 00000000..66bed950 --- /dev/null +++ b/src/reliability/safe-mode.ts @@ -0,0 +1,235 @@ +/** + * SafeMode - Gradual autonomy toggle for mesh execution + * + * Nine 4 pattern: Treat autonomy as a knob, not a switch. + * When reliability drops below SLI thresholds, automatically + * restrict agent capabilities to prevent further damage. + * + * Levels: + * - normal: Full autonomy (all tools, all actions) + * - cautious: Disable risky tools (Bash write ops), require confirmation + * - restricted: Read-only mode, no file writes, no bash commands + * - lockdown: Stop all agent execution, alert human + * + * Safe mode can be triggered manually or automatically via SLI thresholds. + */ + +import { log } from '../shared/logger.ts'; + +export type SafeModeLevel = 'normal' | 'cautious' | 'restricted' | 'lockdown'; + +export interface SafeModeConfig { + /** Default safe mode level (default: 'normal') */ + defaultLevel: SafeModeLevel; + /** SLI threshold for auto-escalation to cautious (default: 0.95) */ + cautiousThreshold: number; + /** SLI threshold for auto-escalation to restricted (default: 0.90) */ + restrictedThreshold: number; + /** SLI threshold for auto-escalation to lockdown (default: 0.80) */ + lockdownThreshold: number; + /** Enable auto-escalation based on SLI (default: false) */ + autoEscalate: boolean; +} + +export interface SafeModeState { + level: SafeModeLevel; + reason: string; + changedAt: number; + autoEscalated: boolean; + /** Tools disabled at this level */ + disabledTools: string[]; + /** Actions blocked at this level */ + blockedActions: string[]; +} + +const DEFAULT_CONFIG: SafeModeConfig = { + defaultLevel: 'normal', + cautiousThreshold: 0.95, + restrictedThreshold: 0.90, + lockdownThreshold: 0.80, + autoEscalate: false, +}; + +/** Tools restricted at each level */ +const TOOL_RESTRICTIONS: Record = { + normal: [], + cautious: [], // No tool blocks, but Bash writes require confirmation via guardrails + restricted: ['Write', 'Edit', 'NotebookEdit', 'Bash'], + lockdown: ['Write', 'Edit', 'NotebookEdit', 'Bash', 'Glob', 'Grep', 'Read'], +}; + +/** Actions blocked at each level */ +const ACTION_RESTRICTIONS: Record = { + normal: [], + cautious: ['destructive_bash', 'git_push', 'file_delete'], + restricted: ['all_writes', 'all_bash', 'git_operations'], + lockdown: ['all_operations'], +}; + +export class SafeMode { + private config: SafeModeConfig; + private levels: Map = new Map(); // per-mesh + private globalLevel: SafeModeLevel; + private changeHistory: Array<{ meshName: string | null; from: SafeModeLevel; to: SafeModeLevel; reason: string; at: number }> = []; + + constructor(config?: Partial) { + this.config = { ...DEFAULT_CONFIG, ...config }; + this.globalLevel = this.config.defaultLevel; + } + + /** + * Get effective safe mode level for a mesh + * Mesh-specific overrides take priority over global + */ + getLevel(meshName?: string): SafeModeLevel { + if (meshName && this.levels.has(meshName)) { + return this.levels.get(meshName)!; + } + return this.globalLevel; + } + + /** + * Get full state for display/API + */ + getState(meshName?: string): SafeModeState { + const level = this.getLevel(meshName); + const lastChange = this.changeHistory.filter( + h => h.meshName === (meshName || null) + ).pop(); + + return { + level, + reason: lastChange?.reason || 'default', + changedAt: lastChange?.at || 0, + autoEscalated: lastChange?.reason.startsWith('auto:') || false, + disabledTools: TOOL_RESTRICTIONS[level], + blockedActions: ACTION_RESTRICTIONS[level], + }; + } + + /** + * Set safe mode level for a specific mesh + */ + setLevel(meshName: string, level: SafeModeLevel, reason: string): void { + const prev = this.levels.get(meshName) || this.globalLevel; + this.levels.set(meshName, level); + + this.changeHistory.push({ + meshName, + from: prev, + to: level, + reason, + at: Date.now(), + }); + + log.info('safe-mode', `Level changed: ${prev} → ${level}`, { + meshName, + reason, + }); + } + + /** + * Set global safe mode level + */ + setGlobalLevel(level: SafeModeLevel, reason: string): void { + const prev = this.globalLevel; + this.globalLevel = level; + + this.changeHistory.push({ + meshName: null, + from: prev, + to: level, + reason, + at: Date.now(), + }); + + log.info('safe-mode', `Global level changed: ${prev} → ${level}`, { reason }); + } + + /** + * Check if a tool is allowed at the current safe mode level + */ + isToolAllowed(toolName: string, meshName?: string): boolean { + const level = this.getLevel(meshName); + return !TOOL_RESTRICTIONS[level].includes(toolName); + } + + /** + * Check if an action is allowed + */ + isActionAllowed(action: string, meshName?: string): boolean { + const level = this.getLevel(meshName); + const blocked = ACTION_RESTRICTIONS[level]; + return !blocked.includes(action) && !blocked.includes('all_operations'); + } + + /** + * Auto-evaluate safe mode based on current SLI success rate + * Only acts if autoEscalate is enabled + */ + evaluateSLI(successRate: number, meshName?: string): SafeModeLevel { + if (!this.config.autoEscalate) { + return this.getLevel(meshName); + } + + let targetLevel: SafeModeLevel = 'normal'; + let reason = ''; + + if (successRate < this.config.lockdownThreshold) { + targetLevel = 'lockdown'; + reason = `auto: SLI ${(successRate * 100).toFixed(1)}% < ${this.config.lockdownThreshold * 100}% lockdown threshold`; + } else if (successRate < this.config.restrictedThreshold) { + targetLevel = 'restricted'; + reason = `auto: SLI ${(successRate * 100).toFixed(1)}% < ${this.config.restrictedThreshold * 100}% restricted threshold`; + } else if (successRate < this.config.cautiousThreshold) { + targetLevel = 'cautious'; + reason = `auto: SLI ${(successRate * 100).toFixed(1)}% < ${this.config.cautiousThreshold * 100}% cautious threshold`; + } + + const currentLevel = this.getLevel(meshName); + // Only escalate, never auto-de-escalate (human must clear) + if (this.severity(targetLevel) > this.severity(currentLevel)) { + if (meshName) { + this.setLevel(meshName, targetLevel, reason); + } else { + this.setGlobalLevel(targetLevel, reason); + } + } + + return this.getLevel(meshName); + } + + private severity(level: SafeModeLevel): number { + const map: Record = { + normal: 0, + cautious: 1, + restricted: 2, + lockdown: 3, + }; + return map[level]; + } + + /** + * Get change history (for forensics) + */ + getHistory(): typeof this.changeHistory { + return [...this.changeHistory]; + } + + /** + * Reset safe mode for a mesh (manual recovery) + */ + resetMesh(meshName: string): void { + this.levels.delete(meshName); + log.info('safe-mode', 'Mesh safe mode reset', { meshName }); + } + + /** + * Reset all safe mode state + */ + resetAll(): void { + this.levels.clear(); + this.globalLevel = this.config.defaultLevel; + this.changeHistory = []; + } +} diff --git a/src/reliability/sli-tracker.ts b/src/reliability/sli-tracker.ts new file mode 100644 index 00000000..0a802462 --- /dev/null +++ b/src/reliability/sli-tracker.ts @@ -0,0 +1,239 @@ +/** + * SLITracker - Service Level Indicator tracking for mesh reliability + * + * Nine 4 pattern: You can't improve what you can't measure. + * Track success rates, latencies, and failure categories per mesh. + * + * Tracks: + * - Message delivery success rate (target: 99.99%) + * - Worker completion rate + * - Mean time to recovery (MTTR) + * - Failure taxonomy (categorized failures for targeted fixes) + * - Per-step success rate (for multi-step workflows) + */ + +import { log } from '../shared/logger.ts'; + +export type FailureCategory = + | 'model_error' // API/model failure + | 'routing_error' // Message sent to wrong/missing agent + | 'timeout' // Worker exceeded time limit + | 'guardrail_kill' // Killed by guardrail enforcement + | 'crash' // Unexpected process crash + | 'stuck' // Agent stopped producing output + | 'policy_violation' // Usage policy error + | 'gate_failure' // FSM gate/script failure + | 'circuit_open' // Circuit breaker prevented execution + | 'unknown'; // Uncategorized + +export interface SLIConfig { + /** How long to retain data in ms (default: 7 days) */ + retentionMs: number; + /** Bucketing interval for rate calculations (default: 60000 = 1 min) */ + bucketMs: number; +} + +interface EventRecord { + timestamp: number; + meshName: string; + agentId: string; + success: boolean; + durationMs?: number; + category?: FailureCategory; + reason?: string; +} + +export interface SLISnapshot { + /** Overall success rate (0-1) */ + successRate: number; + /** Total events tracked */ + totalEvents: number; + /** Total successes */ + totalSuccesses: number; + /** Total failures */ + totalFailures: number; + /** Mean time to recovery in ms (avg time from failure to next success for same agent) */ + mttrMs: number | null; + /** Failure breakdown by category */ + failuresByCategory: Record; + /** Per-mesh success rates */ + byMesh: Record; + /** Per-agent success rates */ + byAgent: Record; + /** Current nines level (e.g., "99.9%") */ + ninesLevel: string; + /** Window start timestamp */ + windowStart: number; + /** Window end timestamp */ + windowEnd: number; +} + +const DEFAULT_CONFIG: SLIConfig = { + retentionMs: 7 * 24 * 60 * 60 * 1000, // 7 days + bucketMs: 60_000, +}; + +export class SLITracker { + private events: EventRecord[] = []; + private config: SLIConfig; + private lastFailureByAgent: Map = new Map(); + private mttrSamples: number[] = []; + + constructor(config?: Partial) { + this.config = { ...DEFAULT_CONFIG, ...config }; + } + + /** + * Record a successful operation + */ + recordSuccess(meshName: string, agentId: string, durationMs?: number): void { + const now = Date.now(); + this.events.push({ + timestamp: now, + meshName, + agentId, + success: true, + durationMs, + }); + + // MTTR: if this agent had a recent failure, record recovery time + const lastFailure = this.lastFailureByAgent.get(agentId); + if (lastFailure) { + this.mttrSamples.push(now - lastFailure); + this.lastFailureByAgent.delete(agentId); + } + + this.gc(); + } + + /** + * Record a failed operation + */ + recordFailure( + meshName: string, + agentId: string, + category: FailureCategory, + reason?: string + ): void { + const now = Date.now(); + this.events.push({ + timestamp: now, + meshName, + agentId, + success: false, + category, + reason, + }); + + this.lastFailureByAgent.set(agentId, now); + + log.warn('sli', 'Failure recorded', { + meshName, + agentId, + category, + reason, + }); + + this.gc(); + } + + /** + * Get SLI snapshot for a time window + */ + getSnapshot(windowMs?: number): SLISnapshot { + const now = Date.now(); + const windowStart = windowMs ? now - windowMs : 0; + const events = this.events.filter(e => e.timestamp >= windowStart); + + const totalEvents = events.length; + const totalSuccesses = events.filter(e => e.success).length; + const totalFailures = totalEvents - totalSuccesses; + const successRate = totalEvents > 0 ? totalSuccesses / totalEvents : 1; + + // Failure breakdown + const failuresByCategory: Record = {}; + for (const e of events) { + if (!e.success && e.category) { + failuresByCategory[e.category] = (failuresByCategory[e.category] || 0) + 1; + } + } + + // Per-mesh rates + const byMesh: Record = {}; + for (const e of events) { + if (!byMesh[e.meshName]) { + byMesh[e.meshName] = { success: 0, total: 0, rate: 0 }; + } + byMesh[e.meshName].total++; + if (e.success) byMesh[e.meshName].success++; + } + for (const mesh of Object.values(byMesh)) { + mesh.rate = mesh.total > 0 ? mesh.success / mesh.total : 1; + } + + // Per-agent rates + const byAgent: Record = {}; + for (const e of events) { + if (!byAgent[e.agentId]) { + byAgent[e.agentId] = { success: 0, total: 0, rate: 0 }; + } + byAgent[e.agentId].total++; + if (e.success) byAgent[e.agentId].success++; + } + for (const agent of Object.values(byAgent)) { + agent.rate = agent.total > 0 ? agent.success / agent.total : 1; + } + + // MTTR + const mttrMs = this.mttrSamples.length > 0 + ? this.mttrSamples.reduce((a, b) => a + b, 0) / this.mttrSamples.length + : null; + + return { + successRate, + totalEvents, + totalSuccesses, + totalFailures, + mttrMs, + failuresByCategory, + byMesh, + byAgent, + ninesLevel: this.calculateNines(successRate), + windowStart, + windowEnd: now, + }; + } + + /** + * Calculate human-readable nines level + */ + private calculateNines(rate: number): string { + if (rate >= 0.9999) return '99.99% (4 nines)'; + if (rate >= 0.999) return '99.9% (3 nines)'; + if (rate >= 0.99) return '99% (2 nines)'; + if (rate >= 0.9) return '90% (1 nine)'; + return `${(rate * 100).toFixed(1)}% (< 1 nine)`; + } + + /** + * Garbage collect old events + */ + private gc(): void { + const cutoff = Date.now() - this.config.retentionMs; + this.events = this.events.filter(e => e.timestamp >= cutoff); + + // Also clean MTTR samples (keep last 100) + if (this.mttrSamples.length > 100) { + this.mttrSamples = this.mttrSamples.slice(-100); + } + } + + /** + * Reset all tracking (e.g., fresh start) + */ + reset(): void { + this.events = []; + this.lastFailureByAgent.clear(); + this.mttrSamples = []; + } +} diff --git a/src/worker/dispatcher.ts b/src/worker/dispatcher.ts index 5a119a31..f2d6d578 100644 --- a/src/worker/dispatcher.ts +++ b/src/worker/dispatcher.ts @@ -46,6 +46,7 @@ import { GuardrailConfig } from './guardrail-config.ts'; import { buildPathContext, validateAgentArtifacts, findWriters, resolveManifestVariables } from './manifest-validator.ts'; import { SystemMessageWriter } from '../core/system-message-writer.ts'; import { NudgeDetector } from './nudge-detector.ts'; +import { ReliabilityManager } from '../reliability/reliability-manager.ts'; import YAML from 'yaml'; /** @@ -351,6 +352,8 @@ export class WorkerDispatcher extends EventEmitter { systemWriter!: SystemMessageWriter; // Auto-nudge recovery for stalled routes private nudgeDetector?: NudgeDetector; + // Reliability: circuit breakers, heartbeat, SLI, DLQ, safe-mode + reliability?: ReliabilityManager; constructor(config: DispatcherConfig, queue: MessageQueue) { super(); @@ -1185,6 +1188,10 @@ export class WorkerDispatcher extends EventEmitter { const nudgeConfig = this.guardrails.getNudgeConfig?.() ?? {}; this.nudgeDetector = new NudgeDetector(this.systemWriter, this.queue, nudgeConfig); + // Initialize reliability manager (circuit breakers, heartbeat, SLI, DLQ, safe-mode) + this.reliability = new ReliabilityManager(this.queue.getDb(), this.config.workDir); + this.reliability.start(); + // Subscribe to consumer events for event-driven dispatch if (consumer) { this.boundMessageHandler = (event: { agentId: string }) => { @@ -3999,6 +4006,19 @@ You are working in an isolated git worktree for feature: **${hookContext.feature const worker = new SdkRunner(runnerConfig, this.queue); workerRef.current = worker; // Populate ref for write-gate kill callback + // Reliability: register agent for heartbeat monitoring + circuit breaker check + if (this.reliability) { + const spawnCheck = this.reliability.canSpawn(meshName!, agentId); + if (!spawnCheck.allowed) { + log.warn('dispatcher', `Spawn blocked by reliability`, { + agentId, reason: spawnCheck.reason, + }); + log.activity('reliability:blocked', agentId, spawnCheck.reason || 'blocked'); + return; + } + this.reliability.registerAgent(agentId); + } + // Parity gate: emit session-start for consumer to clear stale pending asks this.emit('session-start', { agentId }); @@ -4036,6 +4056,8 @@ You are working in an isolated git worktree for feature: **${hookContext.feature result.worker.lastOutputAt = Date.now(); } } + // Reliability: heartbeat on output + this.reliability?.recordHeartbeat(agentId); this.emit('worker:output', data); }); @@ -4745,6 +4767,10 @@ You are working in an isolated git worktree for feature: **${hookContext.feature : undefined, }); + // Reliability: record successful completion + const durationMs = Date.now() - (activeWorker?.startedAt || Date.now()); + this.reliability?.recordSuccess(meshName!, agentId, durationMs); + // OAOM: Check queue for next message this.processNextQueuedMessage(agentId); }); @@ -4894,6 +4920,13 @@ You are working in an isolated git worktree for feature: **${hookContext.feature } this.emit('worker:error', { ...data, workerId: errorWorkerId, transitionName: 'error' }); + + // Reliability: record failure with categorization + const category = data.error?.includes('usage policy') ? 'policy_violation' + : data.error?.includes('timeout') ? 'timeout' + : data.error?.includes('overloaded') ? 'model_error' + : 'crash'; + this.reliability?.recordFailure(meshName!, agentId, category as any, data.error); }); // Add worker to active workers with unique workerId for parallel execution @@ -5870,6 +5903,10 @@ ${output} // Cancel any pending nudge timers for this mesh this.nudgeDetector?.cancelForMesh(meshName); + // Reliability: cleanup mesh-level state, log status + this.reliability?.cleanupMesh(meshName); + this.reliability?.logStatus(); + // Find session by meshName (delegates to MetricsAggregator) const result = this.metricsAggregator.findSessionByMeshName(meshName); From 8fd11cc2ce4eee453e66247bfb18d65eae0f0787 Mon Sep 17 00:00:00 2001 From: Claude Date: Sun, 8 Mar 2026 23:57:38 +0000 Subject: [PATCH 02/12] feat(reliability): Add DLQ replay via SystemMessageWriter and circuit breaker checkpointing MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - DLQ: replayOne(), replayAll(), replayForAgent() — re-injects failed messages back into the live system through SystemMessageWriter with [DLQ REPLAY] prefix and original failure context - Circuit Breaker: SQLite checkpointing — persists open/half_open circuit states to circuit_breaker_checkpoints table, restores on restart so agents that were failing before a crash stay circuit-broken - HeartbeatMonitor: Fix NodeJS.Timeout type to ReturnType - ReliabilityManager: Expose replayDLQ(), replayDLQEntry(), replayDLQForAgent() and pass DB to CircuitBreaker constructor for persistence https://claude.ai/code/session_012PTx7bZNNh74rAshCcuSQg --- src/reliability/circuit-breaker.ts | 100 ++++++++++++++++++++++++- src/reliability/dead-letter-queue.ts | 88 +++++++++++++++++++++- src/reliability/heartbeat-monitor.ts | 2 +- src/reliability/index.ts | 3 +- src/reliability/reliability-manager.ts | 31 +++++++- 5 files changed, 218 insertions(+), 6 deletions(-) diff --git a/src/reliability/circuit-breaker.ts b/src/reliability/circuit-breaker.ts index 14a19a5c..6b27f370 100644 --- a/src/reliability/circuit-breaker.ts +++ b/src/reliability/circuit-breaker.ts @@ -10,8 +10,10 @@ * - HALF_OPEN: After cooldown, allow one probe request * * Applied per agent (mesh/agent) to isolate failures. + * Checkpoints to SQLite for persistence across restarts. */ +import type Database from 'better-sqlite3'; import { log } from '../shared/logger.ts'; export interface CircuitBreakerConfig { @@ -42,9 +44,99 @@ const DEFAULT_CONFIG: CircuitBreakerConfig = { export class CircuitBreaker { private circuits: Map = new Map(); private config: CircuitBreakerConfig; + private db: Database.Database | null = null; - constructor(config?: Partial) { + constructor(config?: Partial, db?: Database.Database) { this.config = { ...DEFAULT_CONFIG, ...config }; + + if (db) { + this.db = db; + this.ensureSchema(); + this.restoreCheckpoint(); + } + } + + /** + * Create checkpoint table if it doesn't exist + */ + private ensureSchema(): void { + this.db?.exec(` + CREATE TABLE IF NOT EXISTS circuit_breaker_checkpoints ( + agent_id TEXT PRIMARY KEY, + state TEXT NOT NULL, + failures INTEGER NOT NULL, + last_failure_at INTEGER NOT NULL, + opened_at INTEGER NOT NULL, + updated_at INTEGER NOT NULL + ); + `); + } + + /** + * Restore circuit states from SQLite checkpoint on startup + */ + private restoreCheckpoint(): void { + if (!this.db) return; + + const rows = this.db.prepare( + 'SELECT * FROM circuit_breaker_checkpoints' + ).all() as Array<{ + agent_id: string; + state: CircuitBreakerState; + failures: number; + last_failure_at: number; + opened_at: number; + }>; + + for (const row of rows) { + // Only restore non-closed circuits (closed is default) + if (row.state !== 'closed') { + this.circuits.set(row.agent_id, { + state: row.state, + failures: row.failures, + lastFailureAt: row.last_failure_at, + openedAt: row.opened_at, + successesSinceHalfOpen: 0, + }); + } + } + + if (rows.length > 0) { + const nonClosed = rows.filter(r => r.state !== 'closed').length; + log.info('circuit-breaker', 'Restored checkpoints', { + total: rows.length, + nonClosed, + }); + } + } + + /** + * Persist current circuit state to SQLite + */ + private checkpoint(agentId: string): void { + if (!this.db) return; + + const circuit = this.circuits.get(agentId); + if (!circuit || circuit.state === 'closed') { + // Remove checkpoint for closed circuits (default state) + this.db.prepare( + 'DELETE FROM circuit_breaker_checkpoints WHERE agent_id = ?' + ).run(agentId); + return; + } + + this.db.prepare(` + INSERT OR REPLACE INTO circuit_breaker_checkpoints + (agent_id, state, failures, last_failure_at, opened_at, updated_at) + VALUES (?, ?, ?, ?, ?, ?) + `).run( + agentId, + circuit.state, + circuit.failures, + circuit.lastFailureAt, + circuit.openedAt, + Date.now() + ); } /** @@ -91,6 +183,7 @@ export class CircuitBreaker { circuit.state = 'closed'; circuit.failures = 0; log.info('circuit-breaker', 'Circuit closed (recovery successful)', { agentId }); + this.checkpoint(agentId); } } @@ -132,6 +225,7 @@ export class CircuitBreaker { circuit.state = 'open'; circuit.openedAt = now; log.error('circuit-breaker', 'Circuit reopened (probe failed)', { agentId, reason }); + this.checkpoint(agentId); return; } @@ -144,6 +238,7 @@ export class CircuitBreaker { threshold: this.config.failureThreshold, reason, }); + this.checkpoint(agentId); } } @@ -170,6 +265,7 @@ export class CircuitBreaker { */ reset(agentId: string): void { this.circuits.delete(agentId); + this.checkpoint(agentId); log.info('circuit-breaker', 'Circuit manually reset', { agentId }); } @@ -178,6 +274,7 @@ export class CircuitBreaker { */ resetAll(): void { this.circuits.clear(); + this.db?.prepare('DELETE FROM circuit_breaker_checkpoints').run(); } /** @@ -187,6 +284,7 @@ export class CircuitBreaker { for (const agentId of this.circuits.keys()) { if (agentId.startsWith(`${meshName}/`)) { this.circuits.delete(agentId); + this.checkpoint(agentId); } } } diff --git a/src/reliability/dead-letter-queue.ts b/src/reliability/dead-letter-queue.ts index 98b6ec06..494a3003 100644 --- a/src/reliability/dead-letter-queue.ts +++ b/src/reliability/dead-letter-queue.ts @@ -7,12 +7,13 @@ * Features: * - Automatic retry with exponential backoff (up to maxRetries) * - DLQ storage in SQLite for persistence across restarts - * - Replay capability for manual recovery + * - Replay via SystemMessageWriter (re-injects into live system) * - Failure reason tracking for taxonomy */ import type Database from 'better-sqlite3'; import { log } from '../shared/logger.ts'; +import type { SystemMessageWriter } from '../core/system-message-writer.ts'; export interface DLQEntry { id: number; @@ -37,6 +38,12 @@ export interface DLQStats { byAgent: Record; } +export interface ReplayResult { + id: number; + success: boolean; + error?: string; +} + export class DeadLetterQueue { private db: Database.Database; private maxRetries: number; @@ -143,6 +150,85 @@ export class DeadLetterQueue { log.info('dlq', 'DLQ entry replayed', { id }); } + /** + * Replay a single DLQ entry via SystemMessageWriter. + * Re-injects the message into the live system with a [DLQ REPLAY] prefix. + */ + replayOne(id: number, writer: SystemMessageWriter): ReplayResult { + const entry = this.db.prepare( + 'SELECT * FROM dead_letter_queue WHERE id = ? AND replayed_at IS NULL' + ).get(id) as DLQEntry | undefined; + + if (!entry) { + return { id, success: false, error: 'Entry not found or already replayed' }; + } + + try { + const payload = JSON.parse(entry.payload) as Record; + const headline = (payload.headline as string) || 'DLQ Replay'; + const body = (payload.body as string) || JSON.stringify(payload, null, 2); + + writer.write({ + to: entry.to_agent, + from: entry.from_agent, + type: entry.type, + headline: `[DLQ REPLAY] ${headline}`, + body: `> Replayed from dead letter queue (DLQ #${entry.id})\n> Original failure: ${entry.failure_reason}\n> Failed at: ${new Date(entry.first_failed_at).toISOString()}\n> Retries: ${entry.retry_count}/${entry.max_retries}\n\n${body}`, + msgId: `dlq-replay-${entry.id}-${Date.now()}`, + }); + + this.markReplayed(id); + + log.info('dlq', 'Message replayed via SystemMessageWriter', { + id, + to: entry.to_agent, + from: entry.from_agent, + originalReason: entry.failure_reason, + }); + + return { id, success: true }; + } catch (err) { + const error = (err as Error).message; + log.error('dlq', 'Replay failed', { id, error }); + return { id, success: false, error }; + } + } + + /** + * Replay all pending DLQ entries via SystemMessageWriter. + * Returns results for each entry. + */ + replayAll(writer: SystemMessageWriter): ReplayResult[] { + const pending = this.getPending(); + const results: ReplayResult[] = []; + + for (const entry of pending) { + results.push(this.replayOne(entry.id, writer)); + } + + log.info('dlq', 'Bulk replay complete', { + total: pending.length, + succeeded: results.filter(r => r.success).length, + failed: results.filter(r => !r.success).length, + }); + + return results; + } + + /** + * Replay all pending DLQ entries for a specific agent. + */ + replayForAgent(agentId: string, writer: SystemMessageWriter): ReplayResult[] { + const entries = this.getForAgent(agentId); + const results: ReplayResult[] = []; + + for (const entry of entries) { + results.push(this.replayOne(entry.id, writer)); + } + + return results; + } + /** * Get DLQ statistics */ diff --git a/src/reliability/heartbeat-monitor.ts b/src/reliability/heartbeat-monitor.ts index 8bfde400..19e4d7ad 100644 --- a/src/reliability/heartbeat-monitor.ts +++ b/src/reliability/heartbeat-monitor.ts @@ -45,7 +45,7 @@ type HeartbeatCallback = (health: AgentHealth) => void; export class HeartbeatMonitor { private agents: Map = new Map(); private config: HeartbeatConfig; - private checkInterval: NodeJS.Timeout | null = null; + private checkInterval: ReturnType | null = null; private onStale: HeartbeatCallback | null = null; private onDead: HeartbeatCallback | null = null; private onWarn: HeartbeatCallback | null = null; diff --git a/src/reliability/index.ts b/src/reliability/index.ts index b1989b53..796b0c14 100644 --- a/src/reliability/index.ts +++ b/src/reliability/index.ts @@ -12,7 +12,8 @@ * not just more of what got you the previous nine. */ -export { DeadLetterQueue, type DLQEntry, type DLQStats } from './dead-letter-queue.ts'; +export { DeadLetterQueue, type DLQEntry, type DLQStats, type ReplayResult } from './dead-letter-queue.ts'; +export { ReliabilityManager, type ReliabilityConfig, type ReliabilityStatus } from './reliability-manager.ts'; export { CircuitBreaker, type CircuitBreakerConfig, type CircuitBreakerState } from './circuit-breaker.ts'; export { HeartbeatMonitor, type HeartbeatConfig, type AgentHealth } from './heartbeat-monitor.ts'; export { SLITracker, type SLIConfig, type SLISnapshot, type FailureCategory } from './sli-tracker.ts'; diff --git a/src/reliability/reliability-manager.ts b/src/reliability/reliability-manager.ts index 012177bf..069103de 100644 --- a/src/reliability/reliability-manager.ts +++ b/src/reliability/reliability-manager.ts @@ -22,11 +22,12 @@ */ import type Database from 'better-sqlite3'; -import { DeadLetterQueue, type DLQStats } from './dead-letter-queue.ts'; +import { DeadLetterQueue, type DLQStats, type ReplayResult } from './dead-letter-queue.ts'; import { CircuitBreaker, type CircuitBreakerState } from './circuit-breaker.ts'; import { HeartbeatMonitor, type AgentHealth } from './heartbeat-monitor.ts'; import { SLITracker, type SLISnapshot, type FailureCategory } from './sli-tracker.ts'; import { SafeMode, type SafeModeLevel, type SafeModeState } from './safe-mode.ts'; +import type { SystemMessageWriter } from '../core/system-message-writer.ts'; import { log } from '../shared/logger.ts'; import fs from 'node:fs'; import path from 'node:path'; @@ -83,7 +84,7 @@ export class ReliabilityManager { const merged = { ...fileConfig, ...config }; this.dlq = new DeadLetterQueue(db, merged.dlq?.maxRetries); - this.circuitBreaker = new CircuitBreaker(merged.circuitBreaker); + this.circuitBreaker = new CircuitBreaker(merged.circuitBreaker, db); this.heartbeat = new HeartbeatMonitor(merged.heartbeat); this.sli = new SLITracker(merged.sli); this.safeMode = new SafeMode(merged.safeMode); @@ -236,6 +237,32 @@ export class ReliabilityManager { this.heartbeat.clearForMesh(meshName); } + // ============================================================ + // DLQ Replay API + // ============================================================ + + /** + * Replay all pending DLQ entries via SystemMessageWriter. + * Re-injects failed messages back into the live system. + */ + replayDLQ(writer: SystemMessageWriter): ReplayResult[] { + return this.dlq.replayAll(writer); + } + + /** + * Replay a single DLQ entry by ID. + */ + replayDLQEntry(id: number, writer: SystemMessageWriter): ReplayResult { + return this.dlq.replayOne(id, writer); + } + + /** + * Replay all DLQ entries for a specific agent. + */ + replayDLQForAgent(agentId: string, writer: SystemMessageWriter): ReplayResult[] { + return this.dlq.replayForAgent(agentId, writer); + } + // ============================================================ // Status API (for CLI / monitoring) // ============================================================ From ce10e2c08634a83c282f23ebbb3e8036fa1f5fa8 Mon Sep 17 00:00:00 2001 From: Claude Date: Mon, 9 Mar 2026 17:28:56 +0000 Subject: [PATCH 03/12] feat(reliability): Session-aware DLQ recovery instead of raw message replay Replace naive message replay with session-aware recovery that preserves conversation history. DLQ now captures sessionId at failure time and uses RecoveryMode (session_resume/requeue/manual) to determine the right recovery strategy. When a worker crashes mid-work with an active session, recovery resumes the SDK session instead of replaying a raw message. - Rewrite DLQ schema with session_id, recovery_mode, failure_category - Update ReliabilityManager with session-aware deadLetter() and recover*() APIs - Wire dispatcher error handler to capture sessionId and route exhausted retries to DLQ with full session context - Export RecoveryMode, RecoveryResult, FailureContext types from index https://claude.ai/code/session_012PTx7bZNNh74rAshCcuSQg --- src/reliability/dead-letter-queue.ts | 274 +++++++++++++++---------- src/reliability/index.ts | 10 +- src/reliability/reliability-manager.ts | 185 ++++++++++++++--- src/worker/dispatcher.ts | 28 ++- 4 files changed, 345 insertions(+), 152 deletions(-) diff --git a/src/reliability/dead-letter-queue.ts b/src/reliability/dead-letter-queue.ts index 494a3003..94a4c2df 100644 --- a/src/reliability/dead-letter-queue.ts +++ b/src/reliability/dead-letter-queue.ts @@ -1,46 +1,71 @@ /** - * DeadLetterQueue - Messages that failed delivery after max retries + * DeadLetterQueue - Session-aware failure recovery * - * Nine 2 pattern: Instead of silently dropping failed messages, - * route them to a DLQ for inspection and manual replay. + * Nine 2 pattern: Instead of silently dropping failed work, + * capture the session context at failure time and enable recovery + * via session resume (not raw message replay). * - * Features: - * - Automatic retry with exponential backoff (up to maxRetries) - * - DLQ storage in SQLite for persistence across restarts - * - Replay via SystemMessageWriter (re-injects into live system) - * - Failure reason tracking for taxonomy + * Two recovery modes: + * 1. Session resume: Agent crashed mid-work → resume with sessionId + * (preserves full conversation history + tool state) + * 2. Message re-queue: Message undeliverable → re-queue to dispatcher + * (for circuit-open or routing failures where no session exists) + * + * The key insight: replaying a raw message loses all conversation context. + * Session resume picks up exactly where the agent left off. */ import type Database from 'better-sqlite3'; import { log } from '../shared/logger.ts'; -import type { SystemMessageWriter } from '../core/system-message-writer.ts'; + +/** + * Recovery mode determines how to restore failed work + */ +export type RecoveryMode = + | 'session_resume' // Crashed mid-work: resume via sessionId + | 'requeue' // Undeliverable: re-insert into message queue + | 'manual'; // Needs human intervention export interface DLQEntry { id: number; + agent_id: string; // The agent that failed (mesh/agent) + mesh_name: string; + recovery_mode: RecoveryMode; + session_id: string | null; // For session_resume: SDK session to resume + /** Original message context (for requeue mode) */ from_agent: string; to_agent: string; type: string; - payload: string; + payload: string; // JSON-serialized original payload source_file: string | null; + /** Failure context */ failure_reason: string; + failure_category: string; // SLI failure category retry_count: number; max_retries: number; + /** Worker state at failure time */ + messages_sent: number; // How many messages worker sent before failing + output_snapshot: string | null; // Last output (truncated) for diagnostics + /** Timestamps */ first_failed_at: number; last_failed_at: number; - replayed_at: number | null; + recovered_at: number | null; } export interface DLQStats { total: number; - pending: number; // Not yet replayed - replayed: number; // Successfully replayed + pending: number; // Not yet recovered + recovered: number; // Successfully recovered byReason: Record; byAgent: Record; + byMode: Record; } -export interface ReplayResult { +export interface RecoveryResult { id: number; success: boolean; + mode: RecoveryMode; + sessionId?: string; error?: string; } @@ -58,72 +83,137 @@ export class DeadLetterQueue { this.db.exec(` CREATE TABLE IF NOT EXISTS dead_letter_queue ( id INTEGER PRIMARY KEY AUTOINCREMENT, + agent_id TEXT NOT NULL, + mesh_name TEXT NOT NULL, + recovery_mode TEXT NOT NULL DEFAULT 'requeue', + session_id TEXT, from_agent TEXT NOT NULL, to_agent TEXT NOT NULL, type TEXT NOT NULL, payload TEXT NOT NULL, source_file TEXT, failure_reason TEXT NOT NULL, + failure_category TEXT NOT NULL DEFAULT 'unknown', retry_count INTEGER DEFAULT 0, max_retries INTEGER NOT NULL, + messages_sent INTEGER DEFAULT 0, + output_snapshot TEXT, first_failed_at INTEGER NOT NULL, last_failed_at INTEGER NOT NULL, - replayed_at INTEGER + recovered_at INTEGER ); - CREATE INDEX IF NOT EXISTS idx_dlq_agent ON dead_letter_queue(to_agent, replayed_at); - CREATE INDEX IF NOT EXISTS idx_dlq_reason ON dead_letter_queue(failure_reason); + CREATE INDEX IF NOT EXISTS idx_dlq_agent ON dead_letter_queue(agent_id, recovered_at); + CREATE INDEX IF NOT EXISTS idx_dlq_mesh ON dead_letter_queue(mesh_name, recovered_at); + CREATE INDEX IF NOT EXISTS idx_dlq_mode ON dead_letter_queue(recovery_mode, recovered_at); `); } /** - * Add a failed message to the DLQ + * Add a failed operation to the DLQ with full session context. + * + * The recovery_mode is determined by what state existed at failure: + * - session_resume: Agent had an active sessionId → can resume + * - requeue: No session (e.g., failed before starting, or routing error) + * - manual: Repeated failures, needs human decision */ add(entry: { + agent_id: string; + mesh_name: string; + session_id?: string; from_agent: string; to_agent: string; type: string; payload: Record; source_file?: string; failure_reason: string; + failure_category: string; retry_count?: number; + messages_sent?: number; + output_snapshot?: string; }): number { const now = Date.now(); + const retryCount = entry.retry_count || 0; + + // Determine recovery mode from context + let mode: RecoveryMode; + if (retryCount >= this.maxRetries) { + mode = 'manual'; // Exhausted retries + } else if (entry.session_id) { + mode = 'session_resume'; // Has session → can resume + } else { + mode = 'requeue'; // No session → re-inject message + } + const result = this.db.prepare(` INSERT INTO dead_letter_queue - (from_agent, to_agent, type, payload, source_file, failure_reason, - retry_count, max_retries, first_failed_at, last_failed_at) - VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?) + (agent_id, mesh_name, recovery_mode, session_id, + from_agent, to_agent, type, payload, source_file, + failure_reason, failure_category, retry_count, max_retries, + messages_sent, output_snapshot, + first_failed_at, last_failed_at) + VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?) `).run( + entry.agent_id, + entry.mesh_name, + mode, + entry.session_id || null, entry.from_agent, entry.to_agent, entry.type, JSON.stringify(entry.payload), entry.source_file || null, entry.failure_reason, - entry.retry_count || 0, + entry.failure_category, + retryCount, this.maxRetries, + entry.messages_sent || 0, + entry.output_snapshot?.slice(0, 2000) || null, // Truncate snapshot now, now ); - log.warn('dlq', 'Message added to dead letter queue', { + log.warn('dlq', 'Added to dead letter queue', { id: result.lastInsertRowid, - from: entry.from_agent, - to: entry.to_agent, + agent: entry.agent_id, + mode, + sessionId: entry.session_id?.slice(0, 8), reason: entry.failure_reason, - retries: entry.retry_count || 0, + category: entry.failure_category, + retries: retryCount, }); return result.lastInsertRowid as number; } /** - * Get all unreplayed DLQ entries + * Get all unrecovered DLQ entries */ getPending(): DLQEntry[] { return this.db.prepare(` SELECT * FROM dead_letter_queue - WHERE replayed_at IS NULL + WHERE recovered_at IS NULL + ORDER BY last_failed_at DESC + `).all() as DLQEntry[]; + } + + /** + * Get DLQ entries that can be auto-recovered (session_resume or requeue) + */ + getRecoverable(): DLQEntry[] { + return this.db.prepare(` + SELECT * FROM dead_letter_queue + WHERE recovered_at IS NULL AND recovery_mode != 'manual' + ORDER BY last_failed_at ASC + `).all() as DLQEntry[]; + } + + /** + * Get entries requiring manual intervention + */ + getManual(): DLQEntry[] { + return this.db.prepare(` + SELECT * FROM dead_letter_queue + WHERE recovered_at IS NULL AND recovery_mode = 'manual' ORDER BY last_failed_at DESC `).all() as DLQEntry[]; } @@ -134,99 +224,55 @@ export class DeadLetterQueue { getForAgent(agentId: string): DLQEntry[] { return this.db.prepare(` SELECT * FROM dead_letter_queue - WHERE to_agent = ? AND replayed_at IS NULL + WHERE agent_id = ? AND recovered_at IS NULL ORDER BY last_failed_at DESC `).all(agentId) as DLQEntry[]; } /** - * Mark a DLQ entry as replayed + * Get DLQ entries for a specific mesh */ - markReplayed(id: number): void { - this.db.prepare(` - UPDATE dead_letter_queue SET replayed_at = ? WHERE id = ? - `).run(Date.now(), id); - - log.info('dlq', 'DLQ entry replayed', { id }); + getForMesh(meshName: string): DLQEntry[] { + return this.db.prepare(` + SELECT * FROM dead_letter_queue + WHERE mesh_name = ? AND recovered_at IS NULL + ORDER BY last_failed_at DESC + `).all(meshName) as DLQEntry[]; } /** - * Replay a single DLQ entry via SystemMessageWriter. - * Re-injects the message into the live system with a [DLQ REPLAY] prefix. + * Get a single entry by ID */ - replayOne(id: number, writer: SystemMessageWriter): ReplayResult { - const entry = this.db.prepare( - 'SELECT * FROM dead_letter_queue WHERE id = ? AND replayed_at IS NULL' + getById(id: number): DLQEntry | undefined { + return this.db.prepare( + 'SELECT * FROM dead_letter_queue WHERE id = ?' ).get(id) as DLQEntry | undefined; - - if (!entry) { - return { id, success: false, error: 'Entry not found or already replayed' }; - } - - try { - const payload = JSON.parse(entry.payload) as Record; - const headline = (payload.headline as string) || 'DLQ Replay'; - const body = (payload.body as string) || JSON.stringify(payload, null, 2); - - writer.write({ - to: entry.to_agent, - from: entry.from_agent, - type: entry.type, - headline: `[DLQ REPLAY] ${headline}`, - body: `> Replayed from dead letter queue (DLQ #${entry.id})\n> Original failure: ${entry.failure_reason}\n> Failed at: ${new Date(entry.first_failed_at).toISOString()}\n> Retries: ${entry.retry_count}/${entry.max_retries}\n\n${body}`, - msgId: `dlq-replay-${entry.id}-${Date.now()}`, - }); - - this.markReplayed(id); - - log.info('dlq', 'Message replayed via SystemMessageWriter', { - id, - to: entry.to_agent, - from: entry.from_agent, - originalReason: entry.failure_reason, - }); - - return { id, success: true }; - } catch (err) { - const error = (err as Error).message; - log.error('dlq', 'Replay failed', { id, error }); - return { id, success: false, error }; - } } /** - * Replay all pending DLQ entries via SystemMessageWriter. - * Returns results for each entry. + * Mark a DLQ entry as recovered */ - replayAll(writer: SystemMessageWriter): ReplayResult[] { - const pending = this.getPending(); - const results: ReplayResult[] = []; - - for (const entry of pending) { - results.push(this.replayOne(entry.id, writer)); - } - - log.info('dlq', 'Bulk replay complete', { - total: pending.length, - succeeded: results.filter(r => r.success).length, - failed: results.filter(r => !r.success).length, - }); + markRecovered(id: number): void { + this.db.prepare(` + UPDATE dead_letter_queue SET recovered_at = ? WHERE id = ? + `).run(Date.now(), id); - return results; + log.info('dlq', 'Entry recovered', { id }); } /** - * Replay all pending DLQ entries for a specific agent. + * Escalate a requeue entry to manual (e.g., after failed recovery attempt) */ - replayForAgent(agentId: string, writer: SystemMessageWriter): ReplayResult[] { - const entries = this.getForAgent(agentId); - const results: ReplayResult[] = []; - - for (const entry of entries) { - results.push(this.replayOne(entry.id, writer)); - } + escalateToManual(id: number, reason: string): void { + const now = Date.now(); + this.db.prepare(` + UPDATE dead_letter_queue + SET recovery_mode = 'manual', failure_reason = ?, last_failed_at = ?, + retry_count = retry_count + 1 + WHERE id = ? + `).run(`${reason} (escalated from auto-recovery)`, now, id); - return results; + log.warn('dlq', 'Entry escalated to manual recovery', { id, reason }); } /** @@ -238,36 +284,44 @@ export class DeadLetterQueue { ).get() as { c: number }).c; const pending = (this.db.prepare( - 'SELECT COUNT(*) as c FROM dead_letter_queue WHERE replayed_at IS NULL' + 'SELECT COUNT(*) as c FROM dead_letter_queue WHERE recovered_at IS NULL' ).get() as { c: number }).c; const byReasonRows = this.db.prepare(` SELECT failure_reason, COUNT(*) as c FROM dead_letter_queue - WHERE replayed_at IS NULL GROUP BY failure_reason + WHERE recovered_at IS NULL GROUP BY failure_reason `).all() as Array<{ failure_reason: string; c: number }>; const byAgentRows = this.db.prepare(` - SELECT to_agent, COUNT(*) as c FROM dead_letter_queue - WHERE replayed_at IS NULL GROUP BY to_agent - `).all() as Array<{ to_agent: string; c: number }>; + SELECT agent_id, COUNT(*) as c FROM dead_letter_queue + WHERE recovered_at IS NULL GROUP BY agent_id + `).all() as Array<{ agent_id: string; c: number }>; + + const byModeRows = this.db.prepare(` + SELECT recovery_mode, COUNT(*) as c FROM dead_letter_queue + WHERE recovered_at IS NULL GROUP BY recovery_mode + `).all() as Array<{ recovery_mode: RecoveryMode; c: number }>; const byReason: Record = {}; for (const row of byReasonRows) byReason[row.failure_reason] = row.c; const byAgent: Record = {}; - for (const row of byAgentRows) byAgent[row.to_agent] = row.c; + for (const row of byAgentRows) byAgent[row.agent_id] = row.c; + + const byMode: Record = { session_resume: 0, requeue: 0, manual: 0 }; + for (const row of byModeRows) byMode[row.recovery_mode] = row.c; - return { total, pending, replayed: total - pending, byReason, byAgent }; + return { total, pending, recovered: total - pending, byReason, byAgent, byMode }; } /** - * Clear old replayed entries (garbage collection) + * Clear old recovered entries (garbage collection) */ - clearReplayed(olderThanMs = 24 * 60 * 60 * 1000): number { + clearRecovered(olderThanMs = 24 * 60 * 60 * 1000): number { const cutoff = Date.now() - olderThanMs; const result = this.db.prepare(` DELETE FROM dead_letter_queue - WHERE replayed_at IS NOT NULL AND replayed_at < ? + WHERE recovered_at IS NOT NULL AND recovered_at < ? `).run(cutoff); return result.changes; } diff --git a/src/reliability/index.ts b/src/reliability/index.ts index 796b0c14..718ecc3d 100644 --- a/src/reliability/index.ts +++ b/src/reliability/index.ts @@ -4,16 +4,16 @@ * Implements four-nines (99.99%) reliability patterns for TX mesh execution: * * Nine 1 (90%): Basic error handling, logging ✓ (existing) - * Nine 2 (99%): Dead letter queue, message retry, idempotency + * Nine 2 (99%): Dead letter queue, session-aware recovery, idempotency * Nine 3 (99.9%): Circuit breakers, heartbeat detection, structured traces * Nine 4 (99.99%): SLI tracking, failure taxonomy, safe-mode, canary checks * - * Reference: Karpathy's "March of Nines" - each nine requires new approaches, - * not just more of what got you the previous nine. + * Key insight: Recovery via session resume (not raw message replay) preserves + * full conversation history and tool state. Reference: Karpathy's "March of Nines" */ -export { DeadLetterQueue, type DLQEntry, type DLQStats, type ReplayResult } from './dead-letter-queue.ts'; -export { ReliabilityManager, type ReliabilityConfig, type ReliabilityStatus } from './reliability-manager.ts'; +export { DeadLetterQueue, type DLQEntry, type DLQStats, type RecoveryMode, type RecoveryResult } from './dead-letter-queue.ts'; +export { ReliabilityManager, type ReliabilityConfig, type ReliabilityStatus, type FailureContext, type SessionResumeHandler, type RequeueHandler } from './reliability-manager.ts'; export { CircuitBreaker, type CircuitBreakerConfig, type CircuitBreakerState } from './circuit-breaker.ts'; export { HeartbeatMonitor, type HeartbeatConfig, type AgentHealth } from './heartbeat-monitor.ts'; export { SLITracker, type SLIConfig, type SLISnapshot, type FailureCategory } from './sli-tracker.ts'; diff --git a/src/reliability/reliability-manager.ts b/src/reliability/reliability-manager.ts index 069103de..73e50b31 100644 --- a/src/reliability/reliability-manager.ts +++ b/src/reliability/reliability-manager.ts @@ -2,7 +2,7 @@ * ReliabilityManager - Central coordinator for all reliability features * * Provides a single integration point for the dispatcher to wire up: - * - Dead letter queue (failed message recovery) + * - Dead letter queue (session-aware failure recovery) * - Circuit breakers (cascading failure prevention) * - Heartbeat monitoring (stalled worker detection) * - SLI tracking (reliability measurement) @@ -15,19 +15,18 @@ * Wire events: * // On worker complete * this.reliability.recordSuccess(meshName, agentId, durationMs); - * // On worker error - * this.reliability.recordFailure(meshName, agentId, 'crash', error.message); + * // On worker error (with session context for DLQ) + * this.reliability.recordFailure(meshName, agentId, 'crash', error.message, { sessionId, messagesSent }); * // On worker output (heartbeat) * this.reliability.heartbeat(agentId); */ import type Database from 'better-sqlite3'; -import { DeadLetterQueue, type DLQStats, type ReplayResult } from './dead-letter-queue.ts'; +import { DeadLetterQueue, type DLQEntry, type DLQStats, type RecoveryMode, type RecoveryResult } from './dead-letter-queue.ts'; import { CircuitBreaker, type CircuitBreakerState } from './circuit-breaker.ts'; import { HeartbeatMonitor, type AgentHealth } from './heartbeat-monitor.ts'; import { SLITracker, type SLISnapshot, type FailureCategory } from './sli-tracker.ts'; import { SafeMode, type SafeModeLevel, type SafeModeState } from './safe-mode.ts'; -import type { SystemMessageWriter } from '../core/system-message-writer.ts'; import { log } from '../shared/logger.ts'; import fs from 'node:fs'; import path from 'node:path'; @@ -68,6 +67,28 @@ export interface ReliabilityStatus { agentHealth: AgentHealth[]; } +/** Context captured at failure time for session-aware DLQ */ +export interface FailureContext { + sessionId?: string | null; + messagesSent?: number; + outputSnapshot?: string; + sourceFile?: string; + fromAgent?: string; + toAgent?: string; + msgType?: string; + payload?: Record; +} + +/** Callback for session resume recovery */ +export type SessionResumeHandler = ( + agentId: string, + sessionId: string, + meshName: string +) => Promise<{ success: boolean; error?: string }>; + +/** Callback for message requeue recovery */ +export type RequeueHandler = (entry: DLQEntry) => { success: boolean; error?: string }; + export class ReliabilityManager { readonly dlq: DeadLetterQueue; readonly circuitBreaker: CircuitBreaker; @@ -191,13 +212,18 @@ export class ReliabilityManager { } /** - * Record failure + * Record failure with optional session context for DLQ routing. + * + * When failureCtx includes a sessionId, the DLQ entry is marked for + * session_resume recovery (picks up exactly where the agent left off). + * Without a sessionId, it falls back to message requeue. */ recordFailure( meshName: string, agentId: string, category: FailureCategory, - reason?: string + reason?: string, + failureCtx?: FailureContext ): void { this.sli.recordFailure(meshName, agentId, category, reason); this.circuitBreaker.recordFailure(agentId, reason || category); @@ -209,23 +235,33 @@ export class ReliabilityManager { } /** - * Route a failed message to DLQ + * Route a failed operation to the DLQ with full session context. + * + * The DLQ auto-determines recovery mode: + * - session_resume: sessionId present → can resume conversation + * - requeue: no session → re-inject message into queue + * - manual: retries exhausted → needs human intervention */ - deadLetter(msg: { - from_agent: string; - to_agent: string; - type: string; - payload: Record; - source_file?: string; - }, reason: string, retryCount?: number): void { + deadLetter( + meshName: string, + agentId: string, + category: FailureCategory, + reason: string, + ctx?: FailureContext + ): void { this.dlq.add({ - from_agent: msg.from_agent, - to_agent: msg.to_agent, - type: msg.type, - payload: msg.payload, - source_file: msg.source_file, + agent_id: agentId, + mesh_name: meshName, + session_id: ctx?.sessionId || undefined, + from_agent: ctx?.fromAgent || agentId, + to_agent: ctx?.toAgent || agentId, + type: ctx?.msgType || 'task', + payload: ctx?.payload || {}, + source_file: ctx?.sourceFile, failure_reason: reason, - retry_count: retryCount, + failure_category: category, + messages_sent: ctx?.messagesSent, + output_snapshot: ctx?.outputSnapshot, }); } @@ -238,29 +274,114 @@ export class ReliabilityManager { } // ============================================================ - // DLQ Replay API + // Session-Aware Recovery API // ============================================================ /** - * Replay all pending DLQ entries via SystemMessageWriter. - * Re-injects failed messages back into the live system. + * Recover all auto-recoverable DLQ entries. + * + * For session_resume entries: calls sessionResumeHandler to resume + * the SDK session where it left off (preserves conversation history). + * + * For requeue entries: calls requeueHandler to re-inject the message + * into the queue for fresh dispatch. */ - replayDLQ(writer: SystemMessageWriter): ReplayResult[] { - return this.dlq.replayAll(writer); + async recoverAll( + sessionResumeHandler: SessionResumeHandler, + requeueHandler: RequeueHandler + ): Promise { + const entries = this.dlq.getRecoverable(); + const results: RecoveryResult[] = []; + + for (const entry of entries) { + const result = await this.recoverEntry(entry, sessionResumeHandler, requeueHandler); + results.push(result); + } + + return results; } /** - * Replay a single DLQ entry by ID. + * Recover DLQ entries for a specific mesh. */ - replayDLQEntry(id: number, writer: SystemMessageWriter): ReplayResult { - return this.dlq.replayOne(id, writer); + async recoverForMesh( + meshName: string, + sessionResumeHandler: SessionResumeHandler, + requeueHandler: RequeueHandler + ): Promise { + const entries = this.dlq.getForMesh(meshName); + const results: RecoveryResult[] = []; + + for (const entry of entries) { + if (entry.recovery_mode === 'manual') continue; + const result = await this.recoverEntry(entry, sessionResumeHandler, requeueHandler); + results.push(result); + } + + return results; } /** - * Replay all DLQ entries for a specific agent. + * Recover a single DLQ entry by ID. */ - replayDLQForAgent(agentId: string, writer: SystemMessageWriter): ReplayResult[] { - return this.dlq.replayForAgent(agentId, writer); + async recoverById( + id: number, + sessionResumeHandler: SessionResumeHandler, + requeueHandler: RequeueHandler + ): Promise { + const entry = this.dlq.getById(id); + if (!entry) { + return { id, success: false, mode: 'manual', error: 'DLQ entry not found' }; + } + return this.recoverEntry(entry, sessionResumeHandler, requeueHandler); + } + + /** + * Recover a single DLQ entry using the appropriate recovery mode. + */ + private async recoverEntry( + entry: DLQEntry, + sessionResumeHandler: SessionResumeHandler, + requeueHandler: RequeueHandler + ): Promise { + if (entry.recovery_mode === 'session_resume' && entry.session_id) { + // Resume the SDK session — preserves full conversation history + try { + const result = await sessionResumeHandler(entry.agent_id, entry.session_id, entry.mesh_name); + if (result.success) { + this.dlq.markRecovered(entry.id); + log.info('reliability', 'DLQ entry recovered via session resume', { + id: entry.id, + agent: entry.agent_id, + sessionId: entry.session_id.slice(0, 8), + }); + return { id: entry.id, success: true, mode: 'session_resume', sessionId: entry.session_id }; + } else { + // Session resume failed — escalate to manual + this.dlq.escalateToManual(entry.id, result.error || 'Session resume failed'); + return { id: entry.id, success: false, mode: 'session_resume', error: result.error }; + } + } catch (err) { + this.dlq.escalateToManual(entry.id, (err as Error).message); + return { id: entry.id, success: false, mode: 'session_resume', error: (err as Error).message }; + } + } else if (entry.recovery_mode === 'requeue') { + // Re-inject message into the queue + const result = requeueHandler(entry); + if (result.success) { + this.dlq.markRecovered(entry.id); + log.info('reliability', 'DLQ entry recovered via requeue', { + id: entry.id, + agent: entry.agent_id, + }); + return { id: entry.id, success: true, mode: 'requeue' }; + } else { + this.dlq.escalateToManual(entry.id, result.error || 'Requeue failed'); + return { id: entry.id, success: false, mode: 'requeue', error: result.error }; + } + } else { + return { id: entry.id, success: false, mode: 'manual', error: 'Requires manual intervention' }; + } } // ============================================================ diff --git a/src/worker/dispatcher.ts b/src/worker/dispatcher.ts index f2d6d578..bc8bf5ea 100644 --- a/src/worker/dispatcher.ts +++ b/src/worker/dispatcher.ts @@ -4886,6 +4886,12 @@ You are working in an isolated git worktree for feature: **${hookContext.feature await machine.error(data.error); + // Reliability: categorize failure + const category = data.error?.includes('usage policy') ? 'policy_violation' + : data.error?.includes('timeout') ? 'timeout' + : data.error?.includes('overloaded') ? 'model_error' + : 'crash'; + // Check if we can retry const canRetry = await machine.canTransition('retry', { status: 'initializing', @@ -4915,17 +4921,29 @@ You are working in an isolated git worktree for feature: **${hookContext.feature }, 1000); } else { log.error('dispatcher', `Worker exhausted retries`, { agentId, workerId: errorWorkerId }); + + // Reliability: route to DLQ with session context for recovery + if (this.reliability) { + const sessionId = activeWorker?.runner.getSessionId() || undefined; + const msgsSent = activeWorker?.messagesSent?.length || 0; + this.reliability.deadLetter(meshName!, agentId, category, data.error || 'Unknown error', { + sessionId, + messagesSent: msgsSent, + fromAgent: nextMsg?.from_agent, + toAgent: agentId, + msgType: nextMsg?.type, + payload: nextMsg?.payload as Record, + sourceFile: nextMsg?.source_file, + }); + } + // Cleanup using consolidated helper this.cleanupWorker(agentId, errorWorkerId); } this.emit('worker:error', { ...data, workerId: errorWorkerId, transitionName: 'error' }); - // Reliability: record failure with categorization - const category = data.error?.includes('usage policy') ? 'policy_violation' - : data.error?.includes('timeout') ? 'timeout' - : data.error?.includes('overloaded') ? 'model_error' - : 'crash'; + // Reliability: record failure this.reliability?.recordFailure(meshName!, agentId, category as any, data.error); }); From 962acfe4f9d4a5bdd35efdb18f9e5fb5e95e8ce2 Mon Sep 17 00:00:00 2001 From: Claude Date: Mon, 9 Mar 2026 17:56:29 +0000 Subject: [PATCH 04/12] feat(cli): Add tx mesh health and tx mesh dlq commands MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Surface reliability internals via CLI: - tx mesh health [mesh] — SLI nines dashboard with success rate, MTTR, failure categories, circuit breaker states, safe mode level, agent health, and DLQ summary. Per-mesh/per-agent breakdown when mesh name provided. - tx mesh dlq [mesh] — List pending dead letter queue entries with recovery mode (session_resume/requeue/manual), failure context, retry counts, and session hints. - tx mesh dlq clear — Garbage collect recovered DLQ entries. Both support --json for programmatic consumption. https://claude.ai/code/session_012PTx7bZNNh74rAshCcuSQg --- src/cli/mesh.ts | 214 ++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 214 insertions(+) diff --git a/src/cli/mesh.ts b/src/cli/mesh.ts index 1e89f8f9..c746ab82 100644 --- a/src/cli/mesh.ts +++ b/src/cli/mesh.ts @@ -20,6 +20,9 @@ * tx mesh ideal Ideal execution stages from routing + manifest * tx mesh drain Drain all pending messages (mark delivered, unblock queue) * tx mesh dump [mesh] Chronological dump of all events (logs, messages, sessions, prompts) + * tx mesh health [mesh] Reliability dashboard (SLI nines, circuit breakers, safe mode) + * tx mesh dlq [mesh] Dead letter queue entries with recovery modes + * tx mesh dlq clear Clear recovered DLQ entries */ import { MessageQueue, FSMPersistence } from '../queue/index.ts'; @@ -28,6 +31,8 @@ import { HeadlessRunner } from '../worker/headless-runner.ts'; import { SessionStore } from '../session/index.ts'; import { validateMesh } from './validate-mesh.ts'; import { MeshValidator } from '../worker/mesh-validator.ts'; +import { ReliabilityManager } from '../reliability/reliability-manager.ts'; +import { DeadLetterQueue } from '../reliability/dead-letter-queue.ts'; import { log } from '../shared/logger.ts'; import { chalk } from '../shared/colors.ts'; import { formatTimeAgo } from '../shared/time.ts'; @@ -3435,6 +3440,204 @@ async function meshDump(meshName: string | undefined, flags: MeshFlags): Promise /** * Print usage help */ +/** + * Show reliability health: SLI nines, circuit breakers, safe mode, heartbeat + */ +async function meshHealth(meshName: string | undefined, flags: MeshFlags): Promise { + const cwd = process.env.TX_CWD || process.cwd(); + const queuePath = path.join(cwd, '.ai/tx/queue.db'); + + if (!fs.existsSync(queuePath)) { + console.log(chalk.yellow('No queue database found. Run a mesh first.')); + return; + } + + const queue = new MessageQueue(queuePath); + const reliability = new ReliabilityManager(queue.getDb(), cwd); + const status = reliability.getStatus(300_000); // 5 min window + + if (flags.json) { + console.log(JSON.stringify(status, null, 2)); + return; + } + + // Header + const nines = status.sli.ninesLevel; + const rate = (status.sli.successRate * 100).toFixed(2); + const ninesColor = status.sli.successRate >= 0.9999 ? chalk.green + : status.sli.successRate >= 0.999 ? chalk.cyan + : status.sli.successRate >= 0.99 ? chalk.yellow + : chalk.red; + + console.log(); + console.log(chalk.bold('Reliability Health')); + console.log(chalk.dim('─'.repeat(50))); + + // SLI + console.log(` Nines: ${ninesColor(nines)} (${rate}% success)`); + console.log(` Events: ${status.sli.totalEvents} total ${chalk.green(String(status.sli.totalSuccesses))} ok ${chalk.red(String(status.sli.totalFailures))} fail`); + if (status.sli.mttrMs != null) { + console.log(` MTTR: ${(status.sli.mttrMs / 1000).toFixed(1)}s`); + } + + // Failure categories + const cats = status.sli.failuresByCategory; + if (Object.keys(cats).length > 0) { + console.log(` Failures: ${Object.entries(cats).map(([k, v]) => `${k}=${v}`).join(' ')}`); + } + + // Safe mode + const safeLevelColor = status.safeMode.level === 'normal' ? chalk.green + : status.safeMode.level === 'cautious' ? chalk.yellow + : status.safeMode.level === 'restricted' ? chalk.red + : chalk.bgRed; + console.log(` Safe mode: ${safeLevelColor(status.safeMode.level)}${status.safeMode.autoEscalated ? chalk.dim(' (auto)') : ''}`); + + // Circuit breakers + const open = status.circuitBreakers.filter(cb => cb.state === 'open'); + const halfOpen = status.circuitBreakers.filter(cb => cb.state === 'half_open'); + if (open.length > 0 || halfOpen.length > 0) { + console.log(chalk.dim('─'.repeat(50))); + console.log(chalk.bold(' Circuit Breakers')); + for (const cb of open) { + console.log(` ${chalk.red('OPEN')} ${cb.agentId} (${cb.failures} failures)`); + } + for (const cb of halfOpen) { + console.log(` ${chalk.yellow('HALF_OPEN')} ${cb.agentId} (${cb.failures} failures)`); + } + } + + // Agent health + const unhealthy = status.agentHealth.filter(h => h.status !== 'healthy'); + if (unhealthy.length > 0) { + console.log(chalk.dim('─'.repeat(50))); + console.log(chalk.bold(' Agent Health')); + for (const h of unhealthy) { + const statusColor = h.status === 'dead' ? chalk.red : h.status === 'stale' ? chalk.yellow : chalk.dim; + console.log(` ${statusColor(h.status.padEnd(8))} ${h.agentId} silent ${(h.silenceMs / 1000).toFixed(0)}s`); + } + } + + // DLQ summary + if (status.dlq.pending > 0) { + console.log(chalk.dim('─'.repeat(50))); + console.log(` DLQ: ${chalk.red(String(status.dlq.pending) + ' pending')} ${chalk.dim(String(status.dlq.recovered) + ' recovered')}`); + const modes = status.dlq.byMode; + if (modes.session_resume > 0) console.log(` ${chalk.cyan(String(modes.session_resume))} session_resume`); + if (modes.requeue > 0) console.log(` ${chalk.yellow(String(modes.requeue))} requeue`); + if (modes.manual > 0) console.log(` ${chalk.red(String(modes.manual))} manual`); + console.log(` ${chalk.dim('Use')} tx mesh dlq ${chalk.dim('for details')}`); + } else { + console.log(` DLQ: ${chalk.green('clean')}`); + } + + // Per-mesh breakdown if requested + if (meshName) { + const meshSLI = status.sli.byMesh[meshName]; + if (meshSLI) { + console.log(chalk.dim('─'.repeat(50))); + console.log(chalk.bold(` Mesh: ${meshName}`)); + console.log(` Rate: ${(meshSLI.rate * 100).toFixed(1)}% (${meshSLI.success}/${meshSLI.total})`); + } + // Per-agent within mesh + const agentEntries = Object.entries(status.sli.byAgent) + .filter(([id]) => id.startsWith(`${meshName}/`)); + for (const [id, data] of agentEntries) { + const agentName = id.split('/').slice(1).join('/'); + const rateColor = data.rate >= 0.99 ? chalk.green : data.rate >= 0.9 ? chalk.yellow : chalk.red; + console.log(` ${agentName.padEnd(20)} ${rateColor((data.rate * 100).toFixed(0) + '%')} (${data.success}/${data.total})`); + } + } + + console.log(); +} + +/** + * Show and manage dead letter queue entries + * + * tx mesh dlq List pending DLQ entries + * tx mesh dlq List DLQ entries for a mesh + * tx mesh dlq clear Clear recovered entries + */ +async function meshDLQ(meshName: string | undefined, flags: MeshFlags): Promise { + const cwd = process.env.TX_CWD || process.cwd(); + const queuePath = path.join(cwd, '.ai/tx/queue.db'); + + if (!fs.existsSync(queuePath)) { + console.log(chalk.yellow('No queue database found.')); + return; + } + + const queue = new MessageQueue(queuePath); + const dlq = new DeadLetterQueue(queue.getDb()); + + // Special action: clear recovered + if (meshName === 'clear') { + const cleared = dlq.clearRecovered(); + console.log(cleared > 0 + ? chalk.green(`Cleared ${cleared} recovered DLQ entries.`) + : chalk.dim('No recovered entries to clear.')); + return; + } + + const entries = meshName ? dlq.getForMesh(meshName) : dlq.getPending(); + const stats = dlq.getStats(); + + if (flags.json) { + console.log(JSON.stringify({ stats, entries }, null, 2)); + return; + } + + console.log(); + console.log(chalk.bold('Dead Letter Queue')); + console.log(chalk.dim('─'.repeat(70))); + console.log(` Total: ${stats.total} Pending: ${chalk.red(String(stats.pending))} Recovered: ${chalk.green(String(stats.recovered))}`); + + if (entries.length === 0) { + console.log(chalk.green('\n No pending DLQ entries.')); + console.log(); + return; + } + + console.log(); + + for (const entry of entries) { + const modeColor = entry.recovery_mode === 'session_resume' ? chalk.cyan + : entry.recovery_mode === 'requeue' ? chalk.yellow + : chalk.red; + const age = formatTimeAgo(entry.first_failed_at); + const sessionHint = entry.session_id ? chalk.dim(` sid:${entry.session_id.slice(0, 8)}`) : ''; + + console.log(` ${chalk.dim('#' + entry.id)} ${modeColor(entry.recovery_mode.padEnd(16))} ${chalk.bold(entry.agent_id)}`); + console.log(` ${chalk.dim('mesh:')} ${entry.mesh_name} ${chalk.dim('category:')} ${entry.failure_category} ${chalk.dim('retries:')} ${entry.retry_count}/${entry.max_retries}${sessionHint}`); + console.log(` ${chalk.dim('reason:')} ${entry.failure_reason.slice(0, 80)}`); + if (entry.messages_sent > 0) { + console.log(` ${chalk.dim('msgs sent:')} ${entry.messages_sent} before failure`); + } + console.log(` ${chalk.dim('failed:')} ${age}${entry.recovered_at ? chalk.green(' recovered') : ''}`); + console.log(); + } + + // Recovery hints + const resumable = entries.filter(e => e.recovery_mode === 'session_resume'); + const requeueable = entries.filter(e => e.recovery_mode === 'requeue'); + const manual = entries.filter(e => e.recovery_mode === 'manual'); + + if (resumable.length > 0 || requeueable.length > 0 || manual.length > 0) { + console.log(chalk.dim('─'.repeat(70))); + if (resumable.length > 0) { + console.log(` ${chalk.cyan(String(resumable.length))} can resume session (conversation preserved)`); + } + if (requeueable.length > 0) { + console.log(` ${chalk.yellow(String(requeueable.length))} can be requeued (fresh dispatch)`); + } + if (manual.length > 0) { + console.log(` ${chalk.red(String(manual.length))} need manual intervention`); + } + console.log(); + } +} + function printUsage(): void { console.log(` ${chalk.bold('Usage:')} tx mesh [mesh] [options] @@ -3456,6 +3659,9 @@ ${chalk.bold('Actions:')} ${chalk.cyan('guardrails')} [mesh] Show guardrail violations from activity logs ${chalk.cyan('ideal')} Ideal execution stages from routing + manifest ${chalk.cyan('dump')} [mesh] Chronological dump of all events (logs, msgs, sessions, prompts) + ${chalk.cyan('health')} [mesh] Reliability dashboard (SLI nines, circuits, safe mode, DLQ) + ${chalk.cyan('dlq')} [mesh] Dead letter queue (pending failures, recovery modes) + ${chalk.cyan('dlq clear')} Clear recovered DLQ entries ${chalk.bold('Options:')} ${chalk.dim('--json')} Output as JSON @@ -3647,6 +3853,14 @@ export async function mesh(args: string[]): Promise { await meshDump(meshName, flags); break; + case 'health': + await meshHealth(meshName, flags); + break; + + case 'dlq': + await meshDLQ(meshName, flags); + break; + default: printUsage(); } From 96bc2a0f8a5f5a11b106d1095862414754d27699 Mon Sep 17 00:00:00 2001 From: Claude Date: Mon, 9 Mar 2026 23:52:38 +0000 Subject: [PATCH 05/12] feat(reliability): Wire all features end-to-end with CLI and docs MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Every reliability feature now has actuation, not just observation: Heartbeat dead → kill: ReliabilityManager.bindDispatcher() receives killAgent callback. When heartbeat fires 'dead', it kills the stuck worker via AbortController.abort() and records the failure. DLQ recovery (3 trigger paths): 1. Automatic on startup — dispatcher calls recoverAll() 2. CLI — tx mesh recover sends SIGUSR2 to dispatcher 3. Front-matter — message with recover: true triggers recovery Session resume: writes message with session-id front-matter so dispatcher spawns worker resuming the SDK conversation. Requeue: re-injects original message via SystemMessageWriter. Safe mode enforcement: createSafeModeHook() returns a PreToolUse hook (same pattern as write-gate) that blocks Write/Edit/Bash at restricted+ levels. Hook is registered per-agent at spawn time. SIGUSR2 dlq-recover control signal in start.ts. tx mesh recover CLI with SIGUSR2 + message fallback. Test mesh config with tight thresholds for quick testing. docs/reliability.md — complete guide for all features. https://claude.ai/code/session_012PTx7bZNNh74rAshCcuSQg --- docs/reliability.md | 242 +++++++++++++++++++++ meshes/reliability-test/config.yaml | 56 ++--- src/cli/mesh.ts | 100 +++++++++ src/cli/start.ts | 10 + src/reliability/index.ts | 8 +- src/reliability/reliability-manager.ts | 277 ++++++++++++++++--------- src/worker/dispatcher.ts | 57 +++++ 7 files changed, 615 insertions(+), 135 deletions(-) create mode 100644 docs/reliability.md diff --git a/docs/reliability.md b/docs/reliability.md new file mode 100644 index 00000000..9e62dfd0 --- /dev/null +++ b/docs/reliability.md @@ -0,0 +1,242 @@ +# Reliability — Four Nines + +TX reliability features organized by Karpathy's "March of Nines" — each nine requires fundamentally new approaches. + +## Quick Start + +```bash +# View reliability dashboard +tx mesh health + +# View per-mesh reliability +tx mesh health reliability-test + +# View dead letter queue +tx mesh dlq + +# Recover failed work +tx mesh recover reliability-test +``` + +## Configuration + +Set reliability thresholds in `.ai/tx/data/config.yaml`: + +```yaml +reliability: + circuitBreaker: + failureThreshold: 3 # Failures before circuit opens + cooldownMs: 30000 # How long circuit stays open + heartbeat: + warnMs: 60000 # Warn after 60s silence + staleMs: 120000 # Stale after 120s + deadMs: 300000 # Kill worker after 300s silence + safeMode: + autoEscalate: true # Auto-restrict on SLI drop + cautiousThreshold: 0.95 + restrictedThreshold: 0.90 + lockdownThreshold: 0.80 + dlq: + maxRetries: 3 +``` + +## Features + +### 1. Circuit Breaker + +**What it does**: Stops spawning an agent that keeps failing. Prevents cascade failures. + +**States**: `closed` (normal) → `open` (blocked) → `half_open` (testing) + +**How it works**: +- Each agent has an independent circuit +- After `failureThreshold` consecutive failures, circuit opens +- While open, `canSpawn()` returns false — dispatcher skips that agent +- After `cooldownMs`, circuit moves to half_open — allows one test spawn +- Success closes the circuit; failure re-opens it + +**State persists to SQLite** — survives restarts. + +**Observe it**: +```bash +tx mesh health # Shows open/half_open circuits +tx spy # Watch for reliability:blocked activity +``` + +### 2. Heartbeat Monitor + +**What it does**: Detects stuck workers and kills them. + +**Thresholds**: `warn` → `stale` → `dead` + +**How it works**: +- On spawn, agent is registered with the heartbeat monitor +- Every worker output event records a heartbeat +- A background timer checks silence intervals +- At `warnMs`: logs a warning +- At `staleMs`: logs a stale warning +- At `deadMs`: **kills the worker** via `AbortController.abort()`, records failure, routes to DLQ + +**Observe it**: +```bash +tx mesh health # Shows unhealthy agents with silence duration +tx logs --component reliability # Heartbeat kill events +``` + +### 3. Dead Letter Queue (DLQ) + +**What it does**: Captures failed work with enough context to recover it. + +**Recovery modes**: +- `session_resume`: Agent had an active SDK session → recovery spawns a new worker with `session-id` front-matter, resuming the conversation where it left off. **Conversation history preserved.** +- `requeue`: No session existed → original message is re-injected into the queue for fresh dispatch. +- `manual`: Retries exhausted → needs human decision. + +**How entries are created**: +- Worker exhausts all retries → dispatcher calls `reliability.deadLetter()` with the worker's sessionId, messages sent, and failure category +- Heartbeat kills a stuck worker → recorded as failure, may generate DLQ entry on next retry exhaustion + +**How recovery works**: + +1. **Automatic on startup**: When `tx start` runs, the dispatcher calls `recoverAll()` — recovers any pending session_resume and requeue entries from the previous run. + +2. **CLI**: `tx mesh recover ` sends a SIGUSR2 signal to the running dispatcher, triggering recovery for that mesh's DLQ entries. + +3. **Front-matter message**: An agent (or core) can write a message with `recover: true` to trigger DLQ recovery: + ```markdown + --- + to: reliability-test/planner + from: core/core + type: task + recover: true + --- + Recover failed work. + ``` + +4. **Fallback**: If the dispatcher isn't running, `tx mesh recover` writes a recovery message to the msgs dir that will be processed on next start. + +**Observe it**: +```bash +tx mesh dlq # List pending entries with recovery mode +tx mesh dlq my-mesh # Filter by mesh +tx mesh dlq --json # Machine-readable output +tx mesh dlq clear # GC recovered entries +``` + +### 4. SLI Tracker + +**What it does**: Measures success rate, failure categories, MTTR, and nines level. + +**Metrics tracked**: +- Success rate (per-mesh, per-agent, overall) +- Nines level (90%, 99%, 99.9%, 99.99%) +- Mean Time To Recovery (MTTR) +- Failure taxonomy: `crash`, `timeout`, `model_error`, `policy_violation`, `circuit_open`, `stuck` + +**How it works**: +- `recordSuccess()` on worker completion, `recordFailure()` on worker error +- In-memory with configurable retention window +- Feeds safe mode auto-escalation + +**Observe it**: +```bash +tx mesh health # Nines level, MTTR, failure breakdown +tx mesh health my-mesh # Per-agent success rates +tx mesh health --json # Full snapshot +``` + +### 5. Safe Mode + +**What it does**: Restricts agent capabilities when reliability drops. + +**Levels**: +| Level | Tool restrictions | Trigger | +|-------|------------------|---------| +| `normal` | None | Default | +| `cautious` | None (action-level blocks only) | SLI < cautiousThreshold | +| `restricted` | Write, Edit, NotebookEdit, Bash blocked | SLI < restrictedThreshold | +| `lockdown` | All tools blocked, spawns blocked | SLI < lockdownThreshold | + +**How it works**: +- After every failure, SLI is evaluated against thresholds +- If `autoEscalate: true` and SLI drops below a threshold, safe mode escalates +- **Only escalates, never auto-de-escalates** — human must clear it +- At `restricted`+: a PreToolUse hook blocks Write/Edit/Bash calls +- At `lockdown`: `canSpawn()` blocks all new workers for that mesh + +**Enforcement**: Safe mode hook is registered as a PreToolUse hook alongside write-gate and identity-gate. When an agent tries to use a blocked tool, it gets a rejection message explaining the restriction. + +**Observe it**: +```bash +tx mesh health # Shows current safe mode level +tx spy # Watch safe-mode:blocked activity events +``` + +## Test Mesh + +The `reliability-test` mesh is configured with tight thresholds for quick testing: +- Circuit breaker opens after 2 failures (not 3) +- Heartbeat kills after 120s (not 300s) +- Safe mode auto-escalates at 80%/50%/25% (not 95%/90%/80%) + +```bash +# Run the test mesh +tx msg "Write a hello world function" --to reliability-test/planner + +# Monitor reliability during execution +tx mesh health reliability-test + +# If failures occur, check DLQ +tx mesh dlq reliability-test + +# Recover failed work +tx mesh recover reliability-test +``` + +## Front-Matter Options + +Agents can interact with reliability features via message front-matter: + +| Field | Value | Effect | +|-------|-------|--------| +| `recover` | `true` | Triggers DLQ recovery for the target mesh | +| `session-id` | SDK session ID | Spawns worker resuming that session | +| `resume-mesh` | `true` | Preserves mesh state instead of clearing on entry | + +## CLI Reference + +| Command | Description | +|---------|-------------| +| `tx mesh health [mesh]` | Reliability dashboard (SLI, circuits, safe mode, DLQ) | +| `tx mesh health --json` | Machine-readable health output | +| `tx mesh dlq [mesh]` | List dead letter queue entries | +| `tx mesh dlq clear` | Clear recovered DLQ entries | +| `tx mesh recover ` | Trigger DLQ recovery via running dispatcher | +| `tx mesh recover --all` | Recover all pending DLQ entries | + +## Architecture + +``` + ┌──────────────────────┐ + │ ReliabilityManager │ + │ │ + │ ┌─ SLI Tracker │ + │ ├─ Circuit Breaker │ ← SQLite persisted + │ ├─ Heartbeat Monitor│ ← kills via bindings + │ ├─ Dead Letter Queue│ ← SQLite persisted + │ └─ Safe Mode │ ← PreToolUse hook + │ │ + │ bindDispatcher({ │ + │ killAgent, │ ← WorkerLifecycle.killForAgent + │ requeueMessage, │ ← SystemMessageWriter.write + │ }) │ + └──────────┬───────────┘ + │ + ┌────────────────┼────────────────┐ + │ │ │ + ┌─────┴─────┐ ┌─────┴─────┐ ┌─────┴─────┐ + │ canSpawn() │ │recordFail │ │ heartbeat │ + │ safe mode │ │ + DLQ │ │ dead→kill │ + │ + circuit │ │ + SLI │ │ + DLQ │ + └────────────┘ └───────────┘ └───────────┘ +``` diff --git a/meshes/reliability-test/config.yaml b/meshes/reliability-test/config.yaml index 584b4233..7f7fedc6 100644 --- a/meshes/reliability-test/config.yaml +++ b/meshes/reliability-test/config.yaml @@ -1,14 +1,11 @@ # reliability-test/config.yaml # Test mesh for validating four-nines reliability features # -# Exercises: circuit breakers, heartbeat monitoring, SLI tracking, -# dead letter queue, safe mode, and failure recovery. -# -# This mesh has an intentionally fragile agent (chaos-agent) that may -# produce routing errors or slow output to test reliability detection. +# Uses tight thresholds so reliability events trigger quickly during testing. +# See docs/reliability.md for how to exercise each feature. mesh: reliability-test -description: "Test mesh for four-nines reliability features: circuit breakers, heartbeat, SLI, DLQ, safe mode" +description: "Reliability test mesh: circuit breakers, heartbeat, SLI, DLQ, safe mode" agents: - name: planner @@ -45,42 +42,31 @@ routing: blocked: worker: "Checks failed, rework needed" -# Reliability-specific guardrails for testing +# Tight guardrails to trigger reliability events quickly guardrails: max_messages: strict: true warning: true - limit: 20 + limit: 10 max_turns: strict: false warning: true - limit: 15 - routing_error: - strict: false - warning: true - max_retries: 2 + limit: 8 + +# Reliability config — tight thresholds for testing +reliability: + circuitBreaker: + failureThreshold: 2 # Opens after just 2 failures (default: 3) + cooldownMs: 15000 # 15s cooldown (default: 30s) + heartbeat: + warnMs: 30000 # Warn after 30s silence (default: 60s) + staleMs: 60000 # Stale after 60s (default: 120s) + deadMs: 120000 # Dead after 120s (default: 300s) + safeMode: + autoEscalate: true # Auto-restrict on SLI drop + cautiousThreshold: 0.80 # Cautious when <80% success + restrictedThreshold: 0.50 # Restricted when <50% + lockdownThreshold: 0.25 # Lockdown when <25% -# Workspace for output workspace: path: ".ai/output/{task-id}/" - -lifecycle: - post: - - commit:auto - -playbook_notes: | - Reliability test mesh for exercising four-nines patterns: - - 1. PLANNER: Breaks down the task into steps - 2. WORKER: Executes the implementation - 3. CHECKER: Validates the output - - This mesh is configured with tight guardrails to exercise: - - Circuit breaker trips on repeated failures - - Heartbeat detection on stalled agents - - SLI tracking for success/failure rates - - DLQ routing for undeliverable messages - - Safe mode escalation when SLI drops - - Run with: tx msg "Implement a simple hello world function" - Monitor with: tx status (shows reliability metrics) diff --git a/src/cli/mesh.ts b/src/cli/mesh.ts index c746ab82..93360739 100644 --- a/src/cli/mesh.ts +++ b/src/cli/mesh.ts @@ -3638,6 +3638,100 @@ async function meshDLQ(meshName: string | undefined, flags: MeshFlags): Promise< } } +/** + * Trigger DLQ recovery for a mesh via the running dispatcher. + * Uses SIGUSR2 control signal (same pattern as tx mesh kill). + * + * If the dispatcher is not running, falls back to writing a recovery + * message directly so it's picked up on next start. + * + * tx mesh recover Recover DLQ entries for a mesh + * tx mesh recover --all Recover all DLQ entries + */ +async function meshRecover(meshName: string | undefined, flags: MeshFlags): Promise { + const cwd = process.env.TX_CWD || process.cwd(); + const queuePath = path.join(cwd, '.ai/tx/queue.db'); + const dataDir = path.join(cwd, '.ai/tx/data'); + const pidFile = path.join(dataDir, '.pid'); + const controlFile = path.join(dataDir, 'control.json'); + + if (!fs.existsSync(queuePath)) { + console.log(chalk.yellow('No queue database found.')); + return; + } + + // Check what's in the DLQ first + const queue = new MessageQueue(queuePath); + const dlq = new DeadLetterQueue(queue.getDb()); + + const entries = meshName && !flags.all + ? dlq.getForMesh(meshName).filter(e => e.recovery_mode !== 'manual') + : dlq.getRecoverable(); + + if (entries.length === 0) { + console.log(chalk.green('No recoverable DLQ entries.')); + return; + } + + const resumable = entries.filter(e => e.recovery_mode === 'session_resume'); + const requeueable = entries.filter(e => e.recovery_mode === 'requeue'); + console.log(`\nRecovering ${entries.length} entries: ${chalk.cyan(String(resumable.length))} session_resume, ${chalk.yellow(String(requeueable.length))} requeue`); + + // Try SIGUSR2 to running dispatcher + if (fs.existsSync(pidFile)) { + const pid = parseInt(fs.readFileSync(pidFile, 'utf-8').trim(), 10); + if (!isNaN(pid)) { + const target = meshName && !flags.all ? meshName : '_all'; + fs.writeFileSync(controlFile, JSON.stringify({ action: 'dlq-recover', mesh: target })); + + try { + process.kill(pid, 'SIGUSR2'); + + // Wait for ACK + for (let i = 0; i < 50; i++) { + if (!fs.existsSync(controlFile)) { + console.log(chalk.green(`Recovery triggered successfully.`)); + return; + } + await new Promise(r => setTimeout(r, 100)); + } + console.log(chalk.yellow('Timeout waiting for dispatcher. Entries will be recovered on next start.')); + if (fs.existsSync(controlFile)) fs.unlinkSync(controlFile); + return; + } catch { + // Process not running — fall through to message-based recovery + if (fs.existsSync(controlFile)) fs.unlinkSync(controlFile); + } + } + } + + // Fallback: write a recovery message so next start picks it up + // Use SystemMessageWriter pattern — write directly to msgs dir + if (meshName) { + const msgsDir = path.join(cwd, '.ai/tx/msgs'); + if (!fs.existsSync(msgsDir)) fs.mkdirSync(msgsDir, { recursive: true }); + + // Look up entry point from mesh config + const meshDir = path.join(cwd, 'meshes', meshName); + let entryPoint = 'worker'; + const configPath = path.join(meshDir, 'config.yaml'); + if (fs.existsSync(configPath)) { + try { + const cfg = YAML.parse(fs.readFileSync(configPath, 'utf-8')); + entryPoint = cfg.entry_point || cfg.agents?.[0]?.name || 'worker'; + } catch { /* use default */ } + } + + const timestamp = Date.now(); + const filename = `${timestamp}-task-system-dlq-recovery--${meshName}-${entryPoint}-recover.md`; + const content = `---\nto: ${meshName}/${entryPoint}\nfrom: system/dlq-recovery\ntype: task\nheadline: DLQ recovery\nrecover: true\ntimestamp: ${new Date(timestamp).toISOString()}\n---\n\nRecover failed work from dead letter queue.\n`; + fs.writeFileSync(path.join(msgsDir, filename), content); + console.log(chalk.cyan(`Recovery message written. Will be processed on next tx start.`)); + } else { + console.log(chalk.yellow('Cannot write fallback recovery without mesh name. Use: tx mesh recover ')); + } +} + function printUsage(): void { console.log(` ${chalk.bold('Usage:')} tx mesh [mesh] [options] @@ -3662,6 +3756,8 @@ ${chalk.bold('Actions:')} ${chalk.cyan('health')} [mesh] Reliability dashboard (SLI nines, circuits, safe mode, DLQ) ${chalk.cyan('dlq')} [mesh] Dead letter queue (pending failures, recovery modes) ${chalk.cyan('dlq clear')} Clear recovered DLQ entries + ${chalk.cyan('recover')} Trigger DLQ recovery (session resume or requeue) + ${chalk.cyan('recover')} --all Recover all pending DLQ entries ${chalk.bold('Options:')} ${chalk.dim('--json')} Output as JSON @@ -3861,6 +3957,10 @@ export async function mesh(args: string[]): Promise { await meshDLQ(meshName, flags); break; + case 'recover': + await meshRecover(meshName, flags); + break; + default: printUsage(); } diff --git a/src/cli/start.ts b/src/cli/start.ts index 3906c4d0..2dda9920 100644 --- a/src/cli/start.ts +++ b/src/cli/start.ts @@ -271,6 +271,16 @@ export async function start(workDir?: string, options?: StartOptions): Promise r.success).length; + log.info('start', 'SIGUSR2: DLQ recovery', { + mesh: ctrl.mesh, attempted: results.length, succeeded, + }); + } } // Delete control file as ACK diff --git a/src/reliability/index.ts b/src/reliability/index.ts index 718ecc3d..ee6861fb 100644 --- a/src/reliability/index.ts +++ b/src/reliability/index.ts @@ -3,17 +3,17 @@ * * Implements four-nines (99.99%) reliability patterns for TX mesh execution: * - * Nine 1 (90%): Basic error handling, logging ✓ (existing) + * Nine 1 (90%): Basic error handling, logging (existing) * Nine 2 (99%): Dead letter queue, session-aware recovery, idempotency - * Nine 3 (99.9%): Circuit breakers, heartbeat detection, structured traces - * Nine 4 (99.99%): SLI tracking, failure taxonomy, safe-mode, canary checks + * Nine 3 (99.9%): Circuit breakers, heartbeat detection + kill, structured traces + * Nine 4 (99.99%): SLI tracking, failure taxonomy, safe-mode enforcement * * Key insight: Recovery via session resume (not raw message replay) preserves * full conversation history and tool state. Reference: Karpathy's "March of Nines" */ export { DeadLetterQueue, type DLQEntry, type DLQStats, type RecoveryMode, type RecoveryResult } from './dead-letter-queue.ts'; -export { ReliabilityManager, type ReliabilityConfig, type ReliabilityStatus, type FailureContext, type SessionResumeHandler, type RequeueHandler } from './reliability-manager.ts'; +export { ReliabilityManager, type ReliabilityConfig, type ReliabilityStatus, type FailureContext, type DispatcherBindings } from './reliability-manager.ts'; export { CircuitBreaker, type CircuitBreakerConfig, type CircuitBreakerState } from './circuit-breaker.ts'; export { HeartbeatMonitor, type HeartbeatConfig, type AgentHealth } from './heartbeat-monitor.ts'; export { SLITracker, type SLIConfig, type SLISnapshot, type FailureCategory } from './sli-tracker.ts'; diff --git a/src/reliability/reliability-manager.ts b/src/reliability/reliability-manager.ts index 73e50b31..4537b6fe 100644 --- a/src/reliability/reliability-manager.ts +++ b/src/reliability/reliability-manager.ts @@ -1,24 +1,21 @@ /** * ReliabilityManager - Central coordinator for all reliability features * - * Provides a single integration point for the dispatcher to wire up: + * Provides a single integration point for the dispatcher: * - Dead letter queue (session-aware failure recovery) * - Circuit breakers (cascading failure prevention) - * - Heartbeat monitoring (stalled worker detection) + * - Heartbeat monitoring (stalled worker detection + kill) * - SLI tracking (reliability measurement) - * - Safe mode (gradual autonomy control) + * - Safe mode (gradual autonomy control via PreToolUse hooks) * * Usage in dispatcher.start(): - * this.reliability = new ReliabilityManager(this.queue.getDb(), this.config.workDir); + * this.reliability = new ReliabilityManager(db, workDir); + * this.reliability.bindDispatcher({ + * killAgent: (agentId, reason) => this.workerLifecycle.killForAgent(agentId, reason), + * requeueMessage: (from, to, type, payload) => this.systemWriter.write({...}), + * getActiveSessionId: (agentId) => worker?.runner.getSessionId(), + * }); * this.reliability.start(); - * - * Wire events: - * // On worker complete - * this.reliability.recordSuccess(meshName, agentId, durationMs); - * // On worker error (with session context for DLQ) - * this.reliability.recordFailure(meshName, agentId, 'crash', error.message, { sessionId, messagesSent }); - * // On worker output (heartbeat) - * this.reliability.heartbeat(agentId); */ import type Database from 'better-sqlite3'; @@ -79,15 +76,17 @@ export interface FailureContext { payload?: Record; } -/** Callback for session resume recovery */ -export type SessionResumeHandler = ( - agentId: string, - sessionId: string, - meshName: string -) => Promise<{ success: boolean; error?: string }>; - -/** Callback for message requeue recovery */ -export type RequeueHandler = (entry: DLQEntry) => { success: boolean; error?: string }; +/** + * Dispatcher callbacks — these let the reliability manager + * take real action (kill workers, requeue messages) without + * importing the dispatcher directly. + */ +export interface DispatcherBindings { + /** Kill all workers for an agent, returns count killed */ + killAgent: (agentId: string, reason: string) => number; + /** Write a message into the queue (for requeue recovery) */ + requeueMessage: (from: string, to: string, type: string, payload: Record, extraFrontmatter?: Record) => void; +} export class ReliabilityManager { readonly dlq: DeadLetterQueue; @@ -96,6 +95,7 @@ export class ReliabilityManager { readonly sli: SLITracker; readonly safeMode: SafeMode; private workDir: string; + private bindings?: DispatcherBindings; constructor(db: Database.Database, workDir: string, config?: ReliabilityConfig) { this.workDir = workDir; @@ -110,7 +110,23 @@ export class ReliabilityManager { this.sli = new SLITracker(merged.sli); this.safeMode = new SafeMode(merged.safeMode); - // Wire heartbeat callbacks + log.info('reliability', 'ReliabilityManager initialized', { + dlqMaxRetries: merged.dlq?.maxRetries || 3, + cbThreshold: merged.circuitBreaker?.failureThreshold || 3, + safeModeDefault: merged.safeMode?.defaultLevel || 'normal', + autoEscalate: merged.safeMode?.autoEscalate || false, + }); + } + + /** + * Bind dispatcher actions. Must be called before start(). + * This gives the reliability manager the ability to actually + * kill stuck workers and requeue messages — not just observe. + */ + bindDispatcher(bindings: DispatcherBindings): void { + this.bindings = bindings; + + // Now that we can kill, wire the heartbeat dead callback this.heartbeat.on('stale', (health) => { log.warn('reliability', `Agent stale: ${health.agentId}`, { silenceMs: health.silenceMs, @@ -118,20 +134,25 @@ export class ReliabilityManager { }); this.heartbeat.on('dead', (health) => { - this.recordFailure( - health.agentId.split('/')[0], - health.agentId, - 'stuck', - `No output for ${Math.round(health.silenceMs / 1000)}s` - ); - }); + const meshName = health.agentId.split('/')[0]; - log.info('reliability', 'ReliabilityManager initialized', { - dlqMaxRetries: merged.dlq?.maxRetries || 3, - cbThreshold: merged.circuitBreaker?.failureThreshold || 3, - safeModeDefault: merged.safeMode?.defaultLevel || 'normal', - autoEscalate: merged.safeMode?.autoEscalate || false, + // Record failure (updates SLI, circuit breaker, safe mode) + this.recordFailure(meshName, health.agentId, 'stuck', + `No output for ${Math.round(health.silenceMs / 1000)}s`); + + // Kill the stuck worker + const killed = this.bindings!.killAgent(health.agentId, `heartbeat dead: ${Math.round(health.silenceMs / 1000)}s silent`); + log.warn('reliability', `Killed stuck agent`, { + agentId: health.agentId, + silenceMs: health.silenceMs, + workersKilled: killed, + }); + + log.activity('reliability:heartbeat-kill', health.agentId, + `Killed after ${Math.round(health.silenceMs / 1000)}s silence`); }); + + log.info('reliability', 'Dispatcher bindings attached'); } /** @@ -213,17 +234,12 @@ export class ReliabilityManager { /** * Record failure with optional session context for DLQ routing. - * - * When failureCtx includes a sessionId, the DLQ entry is marked for - * session_resume recovery (picks up exactly where the agent left off). - * Without a sessionId, it falls back to message requeue. */ recordFailure( meshName: string, agentId: string, category: FailureCategory, reason?: string, - failureCtx?: FailureContext ): void { this.sli.recordFailure(meshName, agentId, category, reason); this.circuitBreaker.recordFailure(agentId, reason || category); @@ -265,6 +281,46 @@ export class ReliabilityManager { }); } + /** + * Create a PreToolUse hook that enforces safe mode tool restrictions. + * Returns a hook object compatible with the dispatcher's chaos hooks. + * + * At 'restricted' level: blocks Write, Edit, NotebookEdit, Bash + * At 'lockdown' level: blocks everything (spawn already blocked) + * At 'cautious' level: allows all tools (restrictions are action-level) + */ + createSafeModeHook(meshName: string, agentId: string): { matcher: string; hooks: Array<(input: unknown) => { decision: string; reason?: string }> } | null { + const level = this.safeMode.getLevel(meshName); + if (level === 'normal') return null; + + const state = this.safeMode.getState(meshName); + const disabledTools = state.disabledTools; + if (disabledTools.length === 0) return null; + + return { + matcher: '*', // Check all tools + hooks: [(input: unknown) => { + const toolInput = input as { tool_name?: string }; + const toolName = toolInput?.tool_name || ''; + + if (disabledTools.includes(toolName)) { + log.warn('safe-mode', `Blocked tool ${toolName}`, { + agentId, + meshName, + level, + }); + log.activity('safe-mode:blocked', agentId, `${toolName} blocked at ${level} level`); + + return { + decision: 'block', + reason: `Safe mode ${level}: ${toolName} is disabled. Current restrictions: ${disabledTools.join(', ')}`, + }; + } + return { decision: 'allow' }; + }], + }; + } + /** * Clean up for a mesh (call on mesh complete) */ @@ -274,28 +330,33 @@ export class ReliabilityManager { } // ============================================================ - // Session-Aware Recovery API + // DLQ Recovery — triggered by CLI or front-matter message // ============================================================ /** * Recover all auto-recoverable DLQ entries. * - * For session_resume entries: calls sessionResumeHandler to resume - * the SDK session where it left off (preserves conversation history). + * For session_resume: writes a new message to the target agent + * with session-id front-matter so the dispatcher spawns with resume. + * + * For requeue: re-injects the original message into the queue. * - * For requeue entries: calls requeueHandler to re-inject the message - * into the queue for fresh dispatch. + * Requires bindings — call bindDispatcher() first. */ - async recoverAll( - sessionResumeHandler: SessionResumeHandler, - requeueHandler: RequeueHandler - ): Promise { + recoverAll(): RecoveryResult[] { + if (!this.bindings) { + log.error('reliability', 'Cannot recover: no dispatcher bindings'); + return []; + } + const entries = this.dlq.getRecoverable(); + if (entries.length === 0) return []; + + log.info('reliability', `Recovering ${entries.length} DLQ entries`); const results: RecoveryResult[] = []; for (const entry of entries) { - const result = await this.recoverEntry(entry, sessionResumeHandler, requeueHandler); - results.push(result); + results.push(this.recoverEntry(entry)); } return results; @@ -304,18 +365,15 @@ export class ReliabilityManager { /** * Recover DLQ entries for a specific mesh. */ - async recoverForMesh( - meshName: string, - sessionResumeHandler: SessionResumeHandler, - requeueHandler: RequeueHandler - ): Promise { + recoverForMesh(meshName: string): RecoveryResult[] { + if (!this.bindings) return []; + const entries = this.dlq.getForMesh(meshName); const results: RecoveryResult[] = []; for (const entry of entries) { if (entry.recovery_mode === 'manual') continue; - const result = await this.recoverEntry(entry, sessionResumeHandler, requeueHandler); - results.push(result); + results.push(this.recoverEntry(entry)); } return results; @@ -324,63 +382,90 @@ export class ReliabilityManager { /** * Recover a single DLQ entry by ID. */ - async recoverById( - id: number, - sessionResumeHandler: SessionResumeHandler, - requeueHandler: RequeueHandler - ): Promise { + recoverById(id: number): RecoveryResult { + if (!this.bindings) { + return { id, success: false, mode: 'manual', error: 'No dispatcher bindings' }; + } + const entry = this.dlq.getById(id); if (!entry) { return { id, success: false, mode: 'manual', error: 'DLQ entry not found' }; } - return this.recoverEntry(entry, sessionResumeHandler, requeueHandler); + return this.recoverEntry(entry); } /** * Recover a single DLQ entry using the appropriate recovery mode. + * + * session_resume: Write a message to the agent with session-id in + * front-matter. The dispatcher's existing session-id handling spawns + * a new worker that resumes the SDK conversation. + * + * requeue: Re-inject the original message from→to with its payload. */ - private async recoverEntry( - entry: DLQEntry, - sessionResumeHandler: SessionResumeHandler, - requeueHandler: RequeueHandler - ): Promise { - if (entry.recovery_mode === 'session_resume' && entry.session_id) { - // Resume the SDK session — preserves full conversation history - try { - const result = await sessionResumeHandler(entry.agent_id, entry.session_id, entry.mesh_name); - if (result.success) { - this.dlq.markRecovered(entry.id); - log.info('reliability', 'DLQ entry recovered via session resume', { - id: entry.id, - agent: entry.agent_id, - sessionId: entry.session_id.slice(0, 8), - }); - return { id: entry.id, success: true, mode: 'session_resume', sessionId: entry.session_id }; - } else { - // Session resume failed — escalate to manual - this.dlq.escalateToManual(entry.id, result.error || 'Session resume failed'); - return { id: entry.id, success: false, mode: 'session_resume', error: result.error }; + private recoverEntry(entry: DLQEntry): RecoveryResult { + try { + if (entry.recovery_mode === 'session_resume' && entry.session_id) { + // Write a recovery message with session-id front-matter + // The dispatcher already handles session-id: spawns worker resuming that session + this.bindings!.requeueMessage( + 'system/dlq-recovery', + entry.agent_id, + 'task', + { + headline: `DLQ recovery: resuming session ${entry.session_id.slice(0, 8)}`, + body: `Resuming failed work. Original failure: ${entry.failure_reason}`, + 'resume-mesh': 'true', + }, + { 'session-id': entry.session_id } + ); + + this.dlq.markRecovered(entry.id); + log.info('reliability', 'DLQ entry recovered via session resume', { + id: entry.id, + agent: entry.agent_id, + sessionId: entry.session_id.slice(0, 8), + }); + log.activity('reliability:recovered', entry.agent_id, + `Session resume (sid:${entry.session_id.slice(0, 8)})`); + + return { id: entry.id, success: true, mode: 'session_resume', sessionId: entry.session_id }; + + } else if (entry.recovery_mode === 'requeue') { + // Re-inject the original message + let payload: Record; + try { + payload = JSON.parse(entry.payload); + } catch { + payload = { body: entry.payload }; } - } catch (err) { - this.dlq.escalateToManual(entry.id, (err as Error).message); - return { id: entry.id, success: false, mode: 'session_resume', error: (err as Error).message }; - } - } else if (entry.recovery_mode === 'requeue') { - // Re-inject message into the queue - const result = requeueHandler(entry); - if (result.success) { + + this.bindings!.requeueMessage( + entry.from_agent, + entry.to_agent, + entry.type, + { ...payload, headline: payload.headline || `DLQ requeue: ${entry.failure_reason.slice(0, 50)}` }, + ); + this.dlq.markRecovered(entry.id); log.info('reliability', 'DLQ entry recovered via requeue', { id: entry.id, agent: entry.agent_id, + from: entry.from_agent, + to: entry.to_agent, }); + log.activity('reliability:recovered', entry.agent_id, 'Message requeued'); + return { id: entry.id, success: true, mode: 'requeue' }; + } else { - this.dlq.escalateToManual(entry.id, result.error || 'Requeue failed'); - return { id: entry.id, success: false, mode: 'requeue', error: result.error }; + return { id: entry.id, success: false, mode: 'manual', error: 'Requires manual intervention' }; } - } else { - return { id: entry.id, success: false, mode: 'manual', error: 'Requires manual intervention' }; + } catch (err) { + const msg = (err as Error).message; + this.dlq.escalateToManual(entry.id, msg); + log.error('reliability', 'Recovery failed', { id: entry.id, error: msg }); + return { id: entry.id, success: false, mode: entry.recovery_mode, error: msg }; } } diff --git a/src/worker/dispatcher.ts b/src/worker/dispatcher.ts index bc8bf5ea..5c3820c2 100644 --- a/src/worker/dispatcher.ts +++ b/src/worker/dispatcher.ts @@ -1190,8 +1190,35 @@ export class WorkerDispatcher extends EventEmitter { // Initialize reliability manager (circuit breakers, heartbeat, SLI, DLQ, safe-mode) this.reliability = new ReliabilityManager(this.queue.getDb(), this.config.workDir); + this.reliability.bindDispatcher({ + killAgent: (agentId: string, reason: string) => { + return this.workerLifecycle.killForAgent(agentId, reason); + }, + requeueMessage: (from: string, to: string, type: string, payload: Record, extraFrontmatter?: Record) => { + this.systemWriter.write({ + from, + to, + type, + headline: (payload.headline as string) || 'DLQ recovery', + body: (payload.body as string) || '', + extraFrontmatter: { ...extraFrontmatter, ...Object.fromEntries( + Object.entries(payload).filter(([k]) => !['headline', 'body'].includes(k)).map(([k, v]) => [k, String(v)]) + )}, + }); + }, + }); this.reliability.start(); + // Recover any pending DLQ entries from previous crash + const dlqRecovery = this.reliability.recoverAll(); + if (dlqRecovery.length > 0) { + log.info('dispatcher', 'DLQ startup recovery', { + attempted: dlqRecovery.length, + succeeded: dlqRecovery.filter(r => r.success).length, + failed: dlqRecovery.filter(r => !r.success).length, + }); + } + // Subscribe to consumer events for event-driven dispatch if (consumer) { this.boundMessageHandler = (event: { agentId: string }) => { @@ -1558,6 +1585,24 @@ export class WorkerDispatcher extends EventEmitter { } } + // DLQ RECOVERY: recover front-matter triggers DLQ recovery for this mesh + // Core agent or CLI can send: `recover: true` to trigger auto-recovery + if (pendingMessage?.payload?.['recover'] === true || pendingMessage?.payload?.['recover'] === 'true') { + if (this.reliability) { + const results = this.reliability.recoverForMesh(meshName); + const succeeded = results.filter(r => r.success).length; + log.info('dispatcher', 'DLQ recovery triggered by front-matter', { + meshName, attempted: results.length, succeeded, + }); + + // Consume the recover message — its purpose is fulfilled + this.queue.pollOne(agentId); + + // If entries were recovered, they'll flow through as new messages + if (results.length > 0) return; + } + } + // NEW MESH RUN: Clear stale state when task arrives at entry point // This handles crashed/abandoned meshes that never sent task-complete to core const entryPoint = meshConfig.entry_point || 'worker'; @@ -3679,6 +3724,18 @@ Please advise the agent or check mesh configuration.`; }); } + // Safe mode gate: block tools based on current safe mode level + if (this.reliability) { + const safeModeHook = this.reliability.createSafeModeHook(meshName!, agentId); + if (safeModeHook) { + preToolUseHooks.push(safeModeHook); + log.info('safe-mode', 'Safe mode hook enabled', { + agentId, + level: this.reliability.safeMode.getLevel(meshName!), + }); + } + } + // Orchestrator gate: restrict Write to msgs dir only if (agent.orchestrator) { const msgsDir = this.config.msgsDir; From f1d1334e3bf36ed42441184816bf58391bfa1e8a Mon Sep 17 00:00:00 2001 From: Claude Date: Tue, 10 Mar 2026 00:12:39 +0000 Subject: [PATCH 06/12] feat(reliability): Checkpoint log + rewind-to recovery MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Adds the ability to rewind recovery to any FSM state checkpoint, not just the crash point. Core agent can now say "rewind-to: build" to skip failed work and resume from a known-good state. Checkpoint log (SQLite): Saves session IDs at every FSM state transition in the dispatcher's onWorkerComplete handler. Key: mesh_name + state_name → session_id. Lookup, list, GC, and clear operations. rewind-to front-matter: recover: true + rewind-to: on a message looks up the checkpoint for that state and uses its session ID instead of the DLQ entry's crash-point session. Three trigger paths: 1. CLI: tx mesh recover --rewind-to=build 2. Message: recover: true + rewind-to: build front-matter 3. SIGUSR2: {"action":"dlq-recover","mesh":"x","rewindTo":"build"} tx mesh recover now shows available checkpoints before recovering. Core prompt updated with Reliability & Recovery section teaching the agent how to use recover, rewind-to, and check health. mesh-builder skill updated with reliability front-matter fields. docs/reliability.md updated with checkpoint log docs. https://claude.ai/code/session_012PTx7bZNNh74rAshCcuSQg --- .claude/skills/mesh-builder/SKILL.md | 6 + docs/reliability.md | 60 +++++++++- src/cli/mesh.ts | 32 ++++- src/cli/start.ts | 2 +- src/prompt/core.ts | 59 +++++++++ src/reliability/checkpoint-log.ts | 160 +++++++++++++++++++++++++ src/reliability/index.ts | 1 + src/reliability/reliability-manager.ts | 60 ++++++++-- src/worker/dispatcher.ts | 23 +++- 9 files changed, 386 insertions(+), 17 deletions(-) create mode 100644 src/reliability/checkpoint-log.ts diff --git a/.claude/skills/mesh-builder/SKILL.md b/.claude/skills/mesh-builder/SKILL.md index 8f08ebcf..313dd902 100644 --- a/.claude/skills/mesh-builder/SKILL.md +++ b/.claude/skills/mesh-builder/SKILL.md @@ -167,6 +167,12 @@ agents: **Propagation:** Upstream agents must include the key in their completion message frontmatter for downstream agents to receive it. The consumer maps frontmatter fields to `payload` automatically. +**Reliability front-matter fields** (used by core agent for recovery, not in mesh configs): +- `recover: true` — triggers DLQ recovery for the target mesh +- `rewind-to: ` — override recovery session with checkpoint from named FSM state +- `session-id: ` — resume a specific SDK session +- `resume-mesh: true` — preserve mesh state instead of clearing on new entry + ``` User message: feature: auth → prebuild gets "/know:prebuild auth" Prebuild msg: feature: auth → builder gets "/know:build auth" diff --git a/docs/reliability.md b/docs/reliability.md index 9e62dfd0..94f082c1 100644 --- a/docs/reliability.md +++ b/docs/reliability.md @@ -123,6 +123,61 @@ tx mesh dlq --json # Machine-readable output tx mesh dlq clear # GC recovered entries ``` +### Checkpoint Log & Rewind-To + +**What it does**: Saves session IDs at every FSM state transition. Enables rewinding to any completed state instead of just the crash point. + +**How checkpoints are saved**: +- Every time an FSM mesh transitions states, the completing agent's session ID is saved to SQLite +- Checkpoint key: `mesh_name + state_name` → `session_id` +- Multiple checkpoints per state are kept (most recent wins on lookup) + +**How rewind-to works**: + +When recovering from the DLQ, you can specify `rewind-to: ` to use a checkpoint's session ID instead of the crash-point session. This means the recovered worker resumes from after that state completed — skipping all the bad work that happened after. + +``` +FSM: analyze → build → verify → complete + ↑ ✗ (crashed here) + └── rewind-to: build (resumes from here) +``` + +**Three ways to trigger rewind-to**: + +1. **CLI**: + ```bash + tx mesh recover my-mesh --rewind-to=build + ``` + +2. **Front-matter message** (core agent): + ```markdown + --- + to: my-mesh/worker + from: core/core + recover: true + rewind-to: build + --- + The verify step went wrong. Rewind to after build completed. + ``` + +3. **SIGUSR2 control signal** (programmatic): + ```json + {"action": "dlq-recover", "mesh": "my-mesh", "rewindTo": "build"} + ``` + +**Viewing available checkpoints**: +```bash +tx mesh recover my-mesh # Lists checkpoints before recovering +``` +Output: +``` +Available checkpoints (use --rewind-to=): + analyze sid:a1b2c3d4 agent:my-mesh/analyst 2026-03-10 14:30:00 + build sid:e5f6g7h8 agent:my-mesh/builder 2026-03-10 14:31:15 +``` + +**When checkpoints are cleared**: On mesh completion (`clearMeshState`). Old checkpoints are garbage collected (keeps last 50 per mesh). + ### 4. SLI Tracker **What it does**: Measures success rate, failure categories, MTTR, and nines level. @@ -200,6 +255,7 @@ Agents can interact with reliability features via message front-matter: | Field | Value | Effect | |-------|-------|--------| | `recover` | `true` | Triggers DLQ recovery for the target mesh | +| `rewind-to` | FSM state name | Override recovery session with checkpoint from this state | | `session-id` | SDK session ID | Spawns worker resuming that session | | `resume-mesh` | `true` | Preserves mesh state instead of clearing on entry | @@ -211,7 +267,8 @@ Agents can interact with reliability features via message front-matter: | `tx mesh health --json` | Machine-readable health output | | `tx mesh dlq [mesh]` | List dead letter queue entries | | `tx mesh dlq clear` | Clear recovered DLQ entries | -| `tx mesh recover ` | Trigger DLQ recovery via running dispatcher | +| `tx mesh recover ` | Trigger DLQ recovery (shows checkpoints first) | +| `tx mesh recover --rewind-to=` | Recover rewinding to a specific FSM state | | `tx mesh recover --all` | Recover all pending DLQ entries | ## Architecture @@ -224,6 +281,7 @@ Agents can interact with reliability features via message front-matter: │ ├─ Circuit Breaker │ ← SQLite persisted │ ├─ Heartbeat Monitor│ ← kills via bindings │ ├─ Dead Letter Queue│ ← SQLite persisted + │ ├─ Checkpoint Log │ ← SQLite, rewind-to │ └─ Safe Mode │ ← PreToolUse hook │ │ │ bindDispatcher({ │ diff --git a/src/cli/mesh.ts b/src/cli/mesh.ts index 93360739..8517c030 100644 --- a/src/cli/mesh.ts +++ b/src/cli/mesh.ts @@ -52,6 +52,7 @@ interface MeshFlags { next?: boolean; all?: boolean; verbose?: boolean; + rewindTo?: string; } /** @@ -73,6 +74,14 @@ function parseFlags(args: string[]): MeshFlags { flags.all = true; } else if (arg === '--verbose') { flags.verbose = true; + } else if (arg.startsWith('--rewind-to=')) { + flags.rewindTo = arg.split('=')[1]; + } else if (arg === '--rewind-to') { + // Next arg will be picked up as a positional, but we handle it here + const idx = args.indexOf(arg); + if (idx < args.length - 1 && !args[idx + 1].startsWith('-')) { + flags.rewindTo = args[idx + 1]; + } } } @@ -3673,16 +3682,32 @@ async function meshRecover(meshName: string | undefined, flags: MeshFlags): Prom return; } + // Show available checkpoints for rewind-to + if (meshName) { + const { CheckpointLog } = await import('../reliability/checkpoint-log.ts'); + const checkpointLog = new CheckpointLog(queue.getDb()); + const checkpoints = checkpointLog.latestPerState(meshName); + if (checkpoints.length > 0) { + console.log(`\n${chalk.bold('Available checkpoints')} (use --rewind-to=):`); + for (const cp of checkpoints) { + console.log(` ${chalk.cyan(cp.state_name.padEnd(20))} ${chalk.dim('sid:')}${cp.session_id.slice(0, 8)} ${chalk.dim('agent:')}${cp.agent_id} ${chalk.dim(cp.created_at)}`); + } + } + } + const resumable = entries.filter(e => e.recovery_mode === 'session_resume'); const requeueable = entries.filter(e => e.recovery_mode === 'requeue'); - console.log(`\nRecovering ${entries.length} entries: ${chalk.cyan(String(resumable.length))} session_resume, ${chalk.yellow(String(requeueable.length))} requeue`); + const rewindNote = flags.rewindTo ? chalk.magenta(` (rewind-to: ${flags.rewindTo})`) : ''; + console.log(`\nRecovering ${entries.length} entries: ${chalk.cyan(String(resumable.length))} session_resume, ${chalk.yellow(String(requeueable.length))} requeue${rewindNote}`); // Try SIGUSR2 to running dispatcher if (fs.existsSync(pidFile)) { const pid = parseInt(fs.readFileSync(pidFile, 'utf-8').trim(), 10); if (!isNaN(pid)) { const target = meshName && !flags.all ? meshName : '_all'; - fs.writeFileSync(controlFile, JSON.stringify({ action: 'dlq-recover', mesh: target })); + const ctrl: Record = { action: 'dlq-recover', mesh: target }; + if (flags.rewindTo) ctrl.rewindTo = flags.rewindTo; + fs.writeFileSync(controlFile, JSON.stringify(ctrl)); try { process.kill(pid, 'SIGUSR2'); @@ -3724,7 +3749,8 @@ async function meshRecover(meshName: string | undefined, flags: MeshFlags): Prom const timestamp = Date.now(); const filename = `${timestamp}-task-system-dlq-recovery--${meshName}-${entryPoint}-recover.md`; - const content = `---\nto: ${meshName}/${entryPoint}\nfrom: system/dlq-recovery\ntype: task\nheadline: DLQ recovery\nrecover: true\ntimestamp: ${new Date(timestamp).toISOString()}\n---\n\nRecover failed work from dead letter queue.\n`; + const rewindLine = flags.rewindTo ? `\nrewind-to: ${flags.rewindTo}` : ''; + const content = `---\nto: ${meshName}/${entryPoint}\nfrom: system/dlq-recovery\ntype: task\nheadline: DLQ recovery\nrecover: true${rewindLine}\ntimestamp: ${new Date(timestamp).toISOString()}\n---\n\nRecover failed work from dead letter queue.\n`; fs.writeFileSync(path.join(msgsDir, filename), content); console.log(chalk.cyan(`Recovery message written. Will be processed on next tx start.`)); } else { diff --git a/src/cli/start.ts b/src/cli/start.ts index 2dda9920..3883a416 100644 --- a/src/cli/start.ts +++ b/src/cli/start.ts @@ -275,7 +275,7 @@ export async function start(workDir?: string, options?: StartOptions): Promise r.success).length; log.info('start', 'SIGUSR2: DLQ recovery', { mesh: ctrl.mesh, attempted: results.length, succeeded, diff --git a/src/prompt/core.ts b/src/prompt/core.ts index 44f390fb..d3ea5d9a 100644 --- a/src/prompt/core.ts +++ b/src/prompt/core.ts @@ -237,6 +237,65 @@ tx mesh status narrative-engine # Find the msg-id tx mesh resolve ask-123 "Approved, continue with the plan" \`\`\` +## Reliability & Recovery + +When mesh work fails, the system captures failures in a Dead Letter Queue (DLQ) with session context. You can recover failed work and rewind to specific checkpoints. + +**Check health:** +\`\`\`bash +tx mesh health # SLI, circuits, safe mode, DLQ summary +tx mesh health # Per-agent stats +tx mesh dlq # List failed entries with recovery modes +\`\`\` + +**Recover failed work (CLI):** +\`\`\`bash +tx mesh recover # Resume from crash point +tx mesh recover --rewind-to=build # Rewind to state checkpoint +\`\`\` + +**Recover via message** (when CLI isn't suitable or you want to trigger from a response): + +Simple recovery — resume from crash point: +\`\`\`markdown +--- +to: / +from: core/core +recover: true +msg-id: recover-${timestampMs} +headline: Recover failed work +timestamp: ${timestamp} +--- + +Recover failed work from the dead letter queue. +\`\`\` + +Rewind recovery — go back to a known-good state: +\`\`\`markdown +--- +to: / +from: core/core +recover: true +rewind-to: build +msg-id: recover-${timestampMs} +headline: Rewind to build checkpoint +timestamp: ${timestamp} +--- + +The verify step went wrong. Rewind to after build completed and retry. +\`\`\` + +**How rewind-to works:** +- Every FSM state transition saves a checkpoint (state name → session ID) +- \`rewind-to: build\` finds the session active when \`build\` completed +- Recovery resumes that exact session — full conversation history preserved +- The agent picks up where it left off, skipping the failed work + +**When to use:** +- User says "go back to step X" or "that went wrong" +- A later state failed but an earlier state was good +- \`tx mesh recover \` shows available checkpoints with state names + ## Message Directory: ${msgsDir}/ ## How to Start Work diff --git a/src/reliability/checkpoint-log.ts b/src/reliability/checkpoint-log.ts new file mode 100644 index 00000000..f3b80d3e --- /dev/null +++ b/src/reliability/checkpoint-log.ts @@ -0,0 +1,160 @@ +/** + * CheckpointLog - Persisted session state at FSM boundaries + * + * Saves session IDs at every FSM state transition so recovery can + * rewind to any named state, not just the crash point. + * + * Core agent uses `rewind-to: ` front-matter to specify which + * checkpoint to recover from. The system looks up the most recent + * session ID for that mesh+state. + */ + +import type Database from 'better-sqlite3'; +import { log } from '../shared/logger.ts'; + +export interface Checkpoint { + id: number; + mesh_name: string; + state_name: string; + agent_id: string; + session_id: string; + from_state: string; + context_snapshot: string; // JSON: FSM context at transition time + created_at: string; +} + +export class CheckpointLog { + private db: Database.Database; + + constructor(db: Database.Database) { + this.db = db; + this.ensureSchema(); + } + + private ensureSchema(): void { + this.db.exec(` + CREATE TABLE IF NOT EXISTS checkpoint_log ( + id INTEGER PRIMARY KEY AUTOINCREMENT, + mesh_name TEXT NOT NULL, + state_name TEXT NOT NULL, + agent_id TEXT NOT NULL, + session_id TEXT NOT NULL, + from_state TEXT NOT NULL, + context_snapshot TEXT DEFAULT '{}', + created_at TEXT NOT NULL DEFAULT (datetime('now')) + ); + CREATE INDEX IF NOT EXISTS idx_checkpoint_mesh_state + ON checkpoint_log(mesh_name, state_name, created_at DESC); + `); + } + + /** + * Save a checkpoint at an FSM state transition. + * Called by the dispatcher when fsm:transition fires. + */ + save(opts: { + meshName: string; + stateName: string; + agentId: string; + sessionId: string; + fromState: string; + context?: Record; + }): void { + this.db.prepare(` + INSERT INTO checkpoint_log (mesh_name, state_name, agent_id, session_id, from_state, context_snapshot, created_at) + VALUES (?, ?, ?, ?, ?, ?, datetime('now')) + `).run( + opts.meshName, + opts.stateName, + opts.agentId, + opts.sessionId, + opts.fromState, + JSON.stringify(opts.context || {}), + ); + + log.debug('checkpoint', 'Saved', { + mesh: opts.meshName, + state: opts.stateName, + agent: opts.agentId, + session: opts.sessionId.slice(0, 8), + }); + } + + /** + * Look up the most recent checkpoint for a mesh+state. + * This is what `rewind-to: ` resolves against. + */ + lookup(meshName: string, stateName: string): Checkpoint | null { + return this.db.prepare(` + SELECT * FROM checkpoint_log + WHERE mesh_name = ? AND state_name = ? + ORDER BY created_at DESC + LIMIT 1 + `).get(meshName, stateName) as Checkpoint | null; + } + + /** + * Get all checkpoints for a mesh (most recent first). + * Used by `tx mesh health` and core agent to see available rewind points. + */ + listForMesh(meshName: string): Checkpoint[] { + return this.db.prepare(` + SELECT * FROM checkpoint_log + WHERE mesh_name = ? + ORDER BY created_at DESC + LIMIT 20 + `).all(meshName) as Checkpoint[]; + } + + /** + * Get the latest checkpoint per state for a mesh. + * Compact view: one row per state, most recent only. + */ + latestPerState(meshName: string): Checkpoint[] { + return this.db.prepare(` + SELECT c.* FROM checkpoint_log c + INNER JOIN ( + SELECT mesh_name, state_name, MAX(created_at) as max_created + FROM checkpoint_log + WHERE mesh_name = ? + GROUP BY mesh_name, state_name + ) latest ON c.mesh_name = latest.mesh_name + AND c.state_name = latest.state_name + AND c.created_at = latest.max_created + ORDER BY c.created_at DESC + `).all(meshName) as Checkpoint[]; + } + + /** + * Clear checkpoints for a mesh (on mesh completion or clear). + */ + clearForMesh(meshName: string): number { + const result = this.db.prepare( + `DELETE FROM checkpoint_log WHERE mesh_name = ?` + ).run(meshName); + return result.changes; + } + + /** + * GC old checkpoints (keep last N per mesh). + */ + gc(keepPerMesh = 50): number { + // Find meshes with more than `keepPerMesh` entries + const meshes = this.db.prepare(` + SELECT mesh_name, COUNT(*) as cnt FROM checkpoint_log + GROUP BY mesh_name HAVING cnt > ? + `).all(keepPerMesh) as Array<{ mesh_name: string; cnt: number }>; + + let total = 0; + for (const { mesh_name } of meshes) { + const result = this.db.prepare(` + DELETE FROM checkpoint_log WHERE mesh_name = ? AND id NOT IN ( + SELECT id FROM checkpoint_log WHERE mesh_name = ? + ORDER BY created_at DESC LIMIT ? + ) + `).run(mesh_name, mesh_name, keepPerMesh); + total += result.changes; + } + return total; + } +} diff --git a/src/reliability/index.ts b/src/reliability/index.ts index ee6861fb..7cb9b693 100644 --- a/src/reliability/index.ts +++ b/src/reliability/index.ts @@ -18,3 +18,4 @@ export { CircuitBreaker, type CircuitBreakerConfig, type CircuitBreakerState } f export { HeartbeatMonitor, type HeartbeatConfig, type AgentHealth } from './heartbeat-monitor.ts'; export { SLITracker, type SLIConfig, type SLISnapshot, type FailureCategory } from './sli-tracker.ts'; export { SafeMode, type SafeModeConfig, type SafeModeState } from './safe-mode.ts'; +export { CheckpointLog, type Checkpoint } from './checkpoint-log.ts'; diff --git a/src/reliability/reliability-manager.ts b/src/reliability/reliability-manager.ts index 4537b6fe..508f6993 100644 --- a/src/reliability/reliability-manager.ts +++ b/src/reliability/reliability-manager.ts @@ -24,6 +24,7 @@ import { CircuitBreaker, type CircuitBreakerState } from './circuit-breaker.ts'; import { HeartbeatMonitor, type AgentHealth } from './heartbeat-monitor.ts'; import { SLITracker, type SLISnapshot, type FailureCategory } from './sli-tracker.ts'; import { SafeMode, type SafeModeLevel, type SafeModeState } from './safe-mode.ts'; +import { CheckpointLog, type Checkpoint } from './checkpoint-log.ts'; import { log } from '../shared/logger.ts'; import fs from 'node:fs'; import path from 'node:path'; @@ -94,6 +95,7 @@ export class ReliabilityManager { readonly heartbeat: HeartbeatMonitor; readonly sli: SLITracker; readonly safeMode: SafeMode; + readonly checkpoints: CheckpointLog; private workDir: string; private bindings?: DispatcherBindings; @@ -109,6 +111,7 @@ export class ReliabilityManager { this.heartbeat = new HeartbeatMonitor(merged.heartbeat); this.sli = new SLITracker(merged.sli); this.safeMode = new SafeMode(merged.safeMode); + this.checkpoints = new CheckpointLog(db); log.info('reliability', 'ReliabilityManager initialized', { dlqMaxRetries: merged.dlq?.maxRetries || 3, @@ -364,16 +367,38 @@ export class ReliabilityManager { /** * Recover DLQ entries for a specific mesh. + * If rewindTo is specified, override the session ID with the + * checkpoint for that FSM state (instead of the crash-point session). */ - recoverForMesh(meshName: string): RecoveryResult[] { + recoverForMesh(meshName: string, rewindTo?: string): RecoveryResult[] { if (!this.bindings) return []; + // If rewind-to is specified, look up the checkpoint for that state + let overrideSessionId: string | undefined; + if (rewindTo) { + const checkpoint = this.checkpoints.lookup(meshName, rewindTo); + if (checkpoint) { + overrideSessionId = checkpoint.session_id; + log.info('reliability', `Rewind-to resolved`, { + meshName, + state: rewindTo, + sessionId: checkpoint.session_id.slice(0, 8), + agent: checkpoint.agent_id, + }); + } else { + log.warn('reliability', `No checkpoint found for rewind-to state`, { + meshName, state: rewindTo, + available: this.checkpoints.latestPerState(meshName).map(c => c.state_name), + }); + } + } + const entries = this.dlq.getForMesh(meshName); const results: RecoveryResult[] = []; for (const entry of entries) { if (entry.recovery_mode === 'manual') continue; - results.push(this.recoverEntry(entry)); + results.push(this.recoverEntry(entry, overrideSessionId)); } return results; @@ -403,33 +428,46 @@ export class ReliabilityManager { * * requeue: Re-inject the original message from→to with its payload. */ - private recoverEntry(entry: DLQEntry): RecoveryResult { + /** + * @param overrideSessionId - If set (from rewind-to), use this session + * instead of the DLQ entry's crash-point session. + */ + private recoverEntry(entry: DLQEntry, overrideSessionId?: string): RecoveryResult { try { - if (entry.recovery_mode === 'session_resume' && entry.session_id) { + // Use override session (from rewind-to checkpoint) or the crash-point session + const sessionId = overrideSessionId || entry.session_id; + + if ((entry.recovery_mode === 'session_resume' || overrideSessionId) && sessionId) { // Write a recovery message with session-id front-matter // The dispatcher already handles session-id: spawns worker resuming that session + const isRewind = overrideSessionId && overrideSessionId !== entry.session_id; this.bindings!.requeueMessage( 'system/dlq-recovery', entry.agent_id, 'task', { - headline: `DLQ recovery: resuming session ${entry.session_id.slice(0, 8)}`, - body: `Resuming failed work. Original failure: ${entry.failure_reason}`, + headline: isRewind + ? `DLQ recovery: rewinding to checkpoint ${sessionId.slice(0, 8)}` + : `DLQ recovery: resuming session ${sessionId.slice(0, 8)}`, + body: isRewind + ? `Rewinding past failure. Original failure: ${entry.failure_reason}` + : `Resuming failed work. Original failure: ${entry.failure_reason}`, 'resume-mesh': 'true', }, - { 'session-id': entry.session_id } + { 'session-id': sessionId } ); this.dlq.markRecovered(entry.id); - log.info('reliability', 'DLQ entry recovered via session resume', { + log.info('reliability', `DLQ entry recovered via ${isRewind ? 'rewind' : 'session resume'}`, { id: entry.id, agent: entry.agent_id, - sessionId: entry.session_id.slice(0, 8), + sessionId: sessionId.slice(0, 8), + rewind: isRewind || false, }); log.activity('reliability:recovered', entry.agent_id, - `Session resume (sid:${entry.session_id.slice(0, 8)})`); + isRewind ? `Rewound to checkpoint (sid:${sessionId.slice(0, 8)})` : `Session resume (sid:${sessionId.slice(0, 8)})`); - return { id: entry.id, success: true, mode: 'session_resume', sessionId: entry.session_id }; + return { id: entry.id, success: true, mode: 'session_resume', sessionId }; } else if (entry.recovery_mode === 'requeue') { // Re-inject the original message diff --git a/src/worker/dispatcher.ts b/src/worker/dispatcher.ts index 5c3820c2..aacc1bfa 100644 --- a/src/worker/dispatcher.ts +++ b/src/worker/dispatcher.ts @@ -1589,10 +1589,13 @@ export class WorkerDispatcher extends EventEmitter { // Core agent or CLI can send: `recover: true` to trigger auto-recovery if (pendingMessage?.payload?.['recover'] === true || pendingMessage?.payload?.['recover'] === 'true') { if (this.reliability) { - const results = this.reliability.recoverForMesh(meshName); + // rewind-to: overrides DLQ session with checkpoint session + const rewindTo = pendingMessage?.payload?.['rewind-to'] as string | undefined; + const results = this.reliability.recoverForMesh(meshName, rewindTo || undefined); const succeeded = results.filter(r => r.success).length; log.info('dispatcher', 'DLQ recovery triggered by front-matter', { meshName, attempted: results.length, succeeded, + rewindTo: rewindTo || null, }); // Consume the recover message — its purpose is fulfilled @@ -4738,6 +4741,24 @@ You are working in an isolated git worktree for feature: **${hookContext.feature } } + // Save reliability checkpoint at FSM state boundaries + // Every agent completion in an FSM mesh records the session ID + // keyed by the current FSM state — enables rewind-to recovery + if (this.reliability && data.sessionId && meshName) { + const fsm = this.meshFSMs.get(meshName); + if (fsm?.isInitialized()) { + const fsmState = fsm.getStatus().currentState; + this.reliability.checkpoints.save({ + meshName, + stateName: fsmState, + agentId, + sessionId: data.sessionId, + fromState: fsmState, + context: fsm.getContext() as Record, + }); + } + } + // Emit quality pass if we had preflight and made it here without errors if (workerHookContext.qualityPreflight) { this.emit('quality:pass', { From 0398edabec8fd3943deca2eb19e1fe5988d04fcb Mon Sep 17 00:00:00 2001 From: Claude Date: Tue, 10 Mar 2026 00:16:51 +0000 Subject: [PATCH 07/12] feat(reliability): Enforce human review before recovery Core prompt now requires a 4-step workflow for recovery: 1. Diagnose (tx mesh health + tx mesh dlq) 2. Present options to user (resume vs rewind vs drop + checkpoints) 3. Get explicit confirmation 4. Execute chosen recovery Never triggers recovery silently. The user sees what failed, why, and which checkpoints are available before approving any action. docs/reliability.md updated to match: human-initiated recovery is the preferred path, automatic recovery only on startup crash. https://claude.ai/code/session_012PTx7bZNNh74rAshCcuSQg --- docs/reliability.md | 21 +++++++----------- src/prompt/core.ts | 54 +++++++++++++++++++++++++++++++-------------- 2 files changed, 45 insertions(+), 30 deletions(-) diff --git a/docs/reliability.md b/docs/reliability.md index 94f082c1..678da70b 100644 --- a/docs/reliability.md +++ b/docs/reliability.md @@ -98,22 +98,17 @@ tx logs --component reliability # Heartbeat kill events **How recovery works**: -1. **Automatic on startup**: When `tx start` runs, the dispatcher calls `recoverAll()` — recovers any pending session_resume and requeue entries from the previous run. +**Important: Recovery requires human review.** The core agent is instructed to always diagnose, present options (resume vs rewind vs drop), and get explicit user confirmation before triggering recovery. This prevents silent re-execution of bad work. -2. **CLI**: `tx mesh recover ` sends a SIGUSR2 signal to the running dispatcher, triggering recovery for that mesh's DLQ entries. +1. **Automatic on startup**: When `tx start` runs, the dispatcher calls `recoverAll()` — recovers any pending session_resume and requeue entries from the previous run. (This is the only automatic path — it handles crash recovery between restarts.) -3. **Front-matter message**: An agent (or core) can write a message with `recover: true` to trigger DLQ recovery: - ```markdown - --- - to: reliability-test/planner - from: core/core - type: task - recover: true - --- - Recover failed work. - ``` +2. **Human-initiated via core agent** (preferred): User asks core to investigate. Core runs `tx mesh health` + `tx mesh dlq`, presents findings with available checkpoints, user picks a recovery strategy, core writes the recovery message. + +3. **CLI**: `tx mesh recover ` sends a SIGUSR2 signal to the running dispatcher. Shows available checkpoints first. + +4. **Front-matter message**: Core writes a message with `recover: true` (and optionally `rewind-to: `) to trigger DLQ recovery. -4. **Fallback**: If the dispatcher isn't running, `tx mesh recover` writes a recovery message to the msgs dir that will be processed on next start. +5. **Fallback**: If the dispatcher isn't running, `tx mesh recover` writes a recovery message to the msgs dir that will be processed on next start. **Observe it**: ```bash diff --git a/src/prompt/core.ts b/src/prompt/core.ts index d3ea5d9a..8167fb1a 100644 --- a/src/prompt/core.ts +++ b/src/prompt/core.ts @@ -241,22 +241,33 @@ tx mesh resolve ask-123 "Approved, continue with the plan" When mesh work fails, the system captures failures in a Dead Letter Queue (DLQ) with session context. You can recover failed work and rewind to specific checkpoints. -**Check health:** -\`\`\`bash -tx mesh health # SLI, circuits, safe mode, DLQ summary -tx mesh health # Per-agent stats -tx mesh dlq # List failed entries with recovery modes -\`\`\` +**CRITICAL: Recovery requires human approval.** Never trigger recovery silently. Always diagnose, present options, and get explicit user confirmation first. -**Recover failed work (CLI):** +### Recovery Workflow (Always Follow These Steps) + +**Step 1: Diagnose** — Run these and present findings to the user: \`\`\`bash -tx mesh recover # Resume from crash point -tx mesh recover --rewind-to=build # Rewind to state checkpoint +tx mesh health # SLI, circuit breakers, safe mode level +tx mesh dlq # Failed entries: what failed, why, recovery mode \`\`\` -**Recover via message** (when CLI isn't suitable or you want to trigger from a response): +**Step 2: Present options** — Tell the user: +- What failed and why (failure category, reason) +- How many DLQ entries exist +- Recovery modes available (session_resume vs requeue) +- Available checkpoints if FSM mesh (state names the user can rewind to) + +Example: "The verify step failed after 3 retries (model_error). There's 1 DLQ entry with session_resume available. Checkpoints exist for: analyze, build. Options: +1. Resume from crash point (picks up where verify failed) +2. Rewind to build (redo verify from scratch with build context) +3. Rewind to analyze (start over from analysis) +4. Drop it (clear the DLQ entry)" + +**Step 3: Get confirmation** — Wait for the user to choose. Do NOT proceed without explicit approval. + +**Step 4: Execute** — Based on user choice: -Simple recovery — resume from crash point: +Resume from crash point: \`\`\`markdown --- to: / @@ -270,7 +281,7 @@ timestamp: ${timestamp} Recover failed work from the dead letter queue. \`\`\` -Rewind recovery — go back to a known-good state: +Rewind to a checkpoint: \`\`\`markdown --- to: / @@ -285,16 +296,25 @@ timestamp: ${timestamp} The verify step went wrong. Rewind to after build completed and retry. \`\`\` -**How rewind-to works:** +Drop / clear: +\`\`\`bash +tx mesh dlq clear # Clear recovered entries +tx mesh clear # Full state reset +\`\`\` + +### How rewind-to works - Every FSM state transition saves a checkpoint (state name → session ID) - \`rewind-to: build\` finds the session active when \`build\` completed - Recovery resumes that exact session — full conversation history preserved - The agent picks up where it left off, skipping the failed work -**When to use:** -- User says "go back to step X" or "that went wrong" -- A later state failed but an earlier state was good -- \`tx mesh recover \` shows available checkpoints with state names +### CLI equivalents (for reference) +\`\`\`bash +tx mesh recover # Resume from crash point +tx mesh recover --rewind-to=build # Rewind to state checkpoint +tx mesh health # Overall reliability dashboard +tx mesh dlq # All DLQ entries +\`\`\` ## Message Directory: ${msgsDir}/ From 01e5fa84539470f0907f58372169fbc5ed78434c Mon Sep 17 00:00:00 2001 From: Claude Date: Wed, 11 Mar 2026 05:22:35 +0000 Subject: [PATCH 08/12] docs(reliability): Add human review gates for all 6 priority items Each reliability priority now has explicit human review steps: 1. Checkpoints + replay: checkpoint notification, replay approval, post-replay review 2. Metrics + tracking: threshold alerts, safe mode escalation/de-escalation approval 3. Retry-with-variation: failure notification, variation transparency, exhaustion review 4. Schema validation: failure notification, correction approval, partial pass handling 5. Agent classification: classification review, non-critical failure reporting, promotion decisions 6. Observability dashboard: anomaly alerts, trend review, cost gates, weekly digest Core principle: "The system does work. The human makes decisions." Core prompt updated with condensed human review gates checklist. https://claude.ai/code/session_012PTx7bZNNh74rAshCcuSQg --- docs/reliability.md | 119 ++++++++++++++++++++++++++++++++++++++++++++ src/prompt/core.ts | 12 +++++ 2 files changed, 131 insertions(+) diff --git a/docs/reliability.md b/docs/reliability.md index 678da70b..8d126d31 100644 --- a/docs/reliability.md +++ b/docs/reliability.md @@ -293,3 +293,122 @@ Agents can interact with reliability features via message front-matter: │ + circuit │ │ + SLI │ │ + DLQ │ └────────────┘ └───────────┘ └───────────┘ ``` + +## Reliability Roadmap — Human Review Gates + +Every reliability improvement includes human review steps. The system **never** silently changes behavior, retries destructively, or masks failures. + +### Priority 1: Default-On Checkpoints + Replay + +**Impact**: 10x — turns N-step recovery into 1-step problem +**Effort**: Medium + +**What it does**: Every FSM state transition auto-saves a checkpoint. On failure, the user picks which checkpoint to rewind to and replay from. + +**Human review steps**: +1. **Checkpoint notification**: When a mesh completes a state transition, core can optionally surface it: "Mesh X completed 'build' — checkpoint saved." +2. **Replay approval**: Before any rewind-to replay, core presents: + - Which checkpoint to rewind to + - What work will be replayed (states after the checkpoint) + - What work will be discarded (failed states) +3. **Post-replay review**: After replay completes, core presents the result for user approval before the mesh continues to the next state. + +**Never automatic**: Replay does not happen without the user choosing a checkpoint. + +--- + +### Priority 2: Reliability Metrics Table + Tracking + +**Impact**: Foundation for everything else +**Effort**: Low + +**What it does**: SLI tracker records success rate, failure categories, MTTR, and nines level per mesh and per agent. + +**Human review steps**: +1. **Threshold alerts**: When SLI drops below a configured threshold, core surfaces it: "Mesh X reliability dropped to 94.2% (below 95% cautious threshold). 3 failures in last 10 runs. Categories: 2x model_error, 1x timeout." +2. **Safe mode escalation approval**: Before escalating safe mode (cautious → restricted → lockdown), core presents the SLI data and asks: "Restrict write access for mesh X? Current SLI: 89%." +3. **De-escalation approval**: Safe mode never auto-de-escalates. Core presents current metrics and asks: "SLI recovered to 98%. Clear restricted mode for mesh X?" +4. **Periodic health summary**: On user request (`tx mesh health`), core presents a table of all meshes with SLI, open circuits, DLQ entries, and safe mode level. + +**Never automatic**: Safe mode escalation beyond `cautious` requires user confirmation. SLI data is always visible. + +--- + +### Priority 3: Retry-With-Variation on Routing/Protocol Failures + +**Impact**: 3-5x improvement on retry success +**Effort**: Low + +**What it does**: When a retry fires, it varies the approach — different prompt framing, model fallback, or simplified task scope — instead of repeating the identical failing request. + +**Human review steps**: +1. **First failure notification**: On first failure, core reports: "Agent X failed (model_error). Retrying with variation: [describe variation]. Retry 1/3." +2. **Variation transparency**: Each retry logs what changed (e.g., "retry 2: simplified prompt, dropped optional context" or "retry 3: fallback model"). +3. **Retry exhaustion review**: When all retries exhaust, core presents the full retry history: "3 retries failed for agent X. Variations tried: [list]. Recommend: [recovery options]." User decides next step. +4. **Variation strategy approval**: If a new variation strategy is added to config, core surfaces it for review before it takes effect. + +**Never automatic**: Retries within the configured limit are automatic (they're cheap and fast), but the user sees what's happening. Exhausted retries always stop and ask. + +--- + +### Priority 4: Output Schema Validation + +**Impact**: Catches semantic failures early +**Effort**: Medium + +**What it does**: Validates agent outputs against expected schemas (front-matter structure, required fields, output format) before passing results downstream. + +**Human review steps**: +1. **Validation failure notification**: When output fails schema validation, core reports: "Agent X output failed validation: missing required field 'summary'. Output was [N] chars." +2. **Correction approval**: Before asking the agent to retry with validation feedback, core presents: "Ask agent X to fix output? Validation errors: [list]. Or drop this output?" +3. **Schema change review**: When a mesh config adds or modifies `output_schema`, core surfaces: "Mesh X now requires 'summary' field in output. Existing agents may need prompt updates." +4. **Partial pass handling**: When output partially validates (some fields valid, some not), core presents what passed and what failed. User decides: accept partial, retry, or drop. + +**Never automatic**: Schema validation failures are always surfaced. The system does not silently discard or re-request outputs. + +--- + +### Priority 5: Critical / Non-Critical Agent Classification + +**Impact**: Prevents cascade from optional steps +**Effort**: Low + +**What it does**: Agents are classified as `critical` (failure blocks mesh) or `non-critical` (failure is logged but mesh continues). Prevents optional agents from taking down the whole workflow. + +**Human review steps**: +1. **Classification review**: When a mesh is loaded, core can surface agent classifications: "Mesh X: critical=[planner, builder], non-critical=[linter, formatter]." +2. **Non-critical failure notification**: When a non-critical agent fails, core reports: "Non-critical agent 'linter' failed (timeout). Mesh continues. Output from this step will be missing." +3. **Promotion decision**: If a non-critical agent fails repeatedly, core asks: "Agent 'linter' has failed 5 times. Should it be promoted to critical (failures block mesh) or disabled?" +4. **Critical failure escalation**: Critical agent failures always stop the mesh and present recovery options (Priority 1 checkpoints + Priority 3 retry history). + +**Never automatic**: Non-critical failures are always reported. The user is never surprised by missing outputs from skipped agents. + +--- + +### Priority 6: Aggregate Observability Dashboard + +**Impact**: Needed to find the long-tail 0.01% +**Effort**: Medium + +**What it does**: Unified view across all meshes — SLI trends, failure patterns, cost tracking, and anomaly detection. + +**Human review steps**: +1. **Anomaly alerts**: When the dashboard detects anomalies (sudden SLI drop, unusual failure pattern, cost spike), core surfaces: "Anomaly detected: mesh X failure rate spiked from 2% to 15% in last hour. Failure category: model_error." +2. **Trend review**: On request, core presents trend data: "Last 24h: 47 mesh runs, 98.3% success, 1 DLQ entry (recovered). Top failure: timeout (3x in mesh Y)." +3. **Cost review gate**: Before approving expensive recovery (multiple retries, large context replay), core presents estimated cost: "Recovering mesh X with rewind-to will replay ~50k tokens. Proceed?" +4. **Weekly digest**: Core can present a weekly reliability summary: nines achieved, worst-performing meshes, recurring failure patterns, DLQ utilization. + +**Never automatic**: The dashboard is passive — it collects and presents. All actions triggered by dashboard insights go through the standard human review workflow (diagnose → present → confirm → execute). + +--- + +### Human Review Principle + +Across all 6 priorities, the same principle applies: + +> **The system does work. The human makes decisions.** + +- Retries within limits → automatic (but visible) +- Recovery, replay, escalation → always human-approved +- Failures → always surfaced with context and options +- No silent state changes that affect mesh behavior diff --git a/src/prompt/core.ts b/src/prompt/core.ts index 8167fb1a..bfeafb84 100644 --- a/src/prompt/core.ts +++ b/src/prompt/core.ts @@ -316,6 +316,18 @@ tx mesh health # Overall reliability dashboard tx mesh dlq # All DLQ entries \`\`\` +### Human Review Gates (Apply to ALL Reliability Events) + +**Principle: The system does work. The human makes decisions.** + +- **Safe mode escalation**: Present SLI data and ask before moving to restricted/lockdown +- **Safe mode de-escalation**: Never auto-de-escalate. Present recovery metrics and ask +- **Retry exhaustion**: Present retry history (what variations were tried) and ask for next step +- **Schema validation failures**: Present what failed validation and ask: retry, accept partial, or drop +- **Non-critical agent failures**: Always report skipped outputs — never silently continue +- **Anomaly detection**: Surface spikes in failure rates, cost, or unusual patterns immediately +- **Cost gates**: Before expensive recovery (large context replay), present estimated token cost + ## Message Directory: ${msgsDir}/ ## How to Start Work From 625d2cd50ff3ba940f84242bef14050ab56327b3 Mon Sep 17 00:00:00 2001 From: Claude Date: Wed, 11 Mar 2026 05:25:49 +0000 Subject: [PATCH 09/12] docs(reliability): Add March of Nines status table with human review gates Documents all existing reliability features organized by nines level: - Nine 1 (90%): SQLite WAL, worker retries, injection retries, routing correction - Nine 2 (99%): Parity gate, FSM validation, mesh validator, identity gate, write gate - Nine 2.5: Nudge detector, deadlock breaker, stale cleaner, quality iteration loops - Nine 3 (99.9%): Circuit breaker, heartbeat, DLQ, SLI tracker, safe mode, checkpoints - Nine 4 (99.99%): Roadmap items with human review gates Each level includes a feature table (what/where) and explicit human review steps. https://claude.ai/code/session_012PTx7bZNNh74rAshCcuSQg --- docs/reliability.md | 50 +++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 50 insertions(+) diff --git a/docs/reliability.md b/docs/reliability.md index 8d126d31..fa20e765 100644 --- a/docs/reliability.md +++ b/docs/reliability.md @@ -2,6 +2,56 @@ TX reliability features organized by Karpathy's "March of Nines" — each nine requires fundamentally new approaches. +## March of Nines — Current Status + +| Nines | Technique | TX Status | Human Review | +|-------|-----------|-----------|--------------| +| **1 (90%)** | Basic error handling, retries | SQLite WAL, worker retries (3x), injection retries (poll loop), routing correction injection | Retry exhaustion → present failure + options to user | +| **2 (99%)** | Validation, protocol enforcement | Parity gate, FSM validation, mesh validator, identity gate, write gate | Validation failures → surface to user with context. Identity/routing violations → warn or kill per guardrail mode | +| **~2.5** | Self-healing / auto-recovery | Nudge detector, deadlock breaker, stale cleaner, quality iteration loops | Nudge fires → logged. Deadlock cycles > `autoBreakDepth` → escalate to human. Quality exhaustion → present feedback history + ask user | +| **3 (99.9%)** | Monitoring, circuit breaking, DLQ | Circuit breaker, heartbeat monitor, DLQ with session resume, SLI tracker, safe mode, checkpoint log | Circuit open → notify user. Safe mode escalation → user approval. DLQ recovery → diagnose/present/confirm workflow | +| **4 (99.99%)** | [Roadmap] Retry-with-variation, schema validation, agent classification, observability | Planned — see Reliability Roadmap below | Every action requires human confirmation (see roadmap gates) | + +### Nine 1 — Basic Error Handling (90%) + +Foundational durability. Nothing silently drops. + +| Feature | What It Does | Where | +|---------|-------------|-------| +| **SQLite WAL mode** | Write-ahead logging prevents queue corruption on crash | `src/queue/index.ts` — `journal_mode=WAL` on init | +| **Worker retries (3x)** | Failed workers retry up to 3 times before DLQ | `src/worker/dispatcher.ts` — configurable via `dlq.maxRetries` | +| **Injection poll loop** | Core message injection retries on next poll if Claude is busy | `src/cli/start.ts` — leaves message at head of queue for next cycle | +| **Routing correction injection** | Bad routing target → corrective prompt injected back to sender | `src/worker/dispatcher.ts` — `handleRoutingError()`, max retries per guardrail config | + +**Human review**: When worker retries exhaust → DLQ entry created → core presents failure to user. When routing retries exhaust → escalated to user with full attempt history. + +### Nine 2 — Validation & Protocol Enforcement (99%) + +Catch bad outputs and protocol violations before they propagate. + +| Feature | What It Does | Where | +|---------|-------------|-------| +| **Parity gate** | Ensures completion agents answer all pending asks before completing | `src/worker/dispatcher.ts`, `src/core/consumer.ts` — tracks `pending_asks` table | +| **FSM validation** | State machine meshes enforce valid transitions, prevent skipped/repeated states | `src/state-machine/` — transition guards + checkpoint persistence | +| **Mesh validator** | Validates mesh config before loading (required fields, types, routing consistency) | `src/worker/mesh-validator.ts` — errors block load, warnings log | +| **Identity gate** | PreToolUse hook validates `from:` field matches agent identity | `src/worker/identity-gate.ts` — blocks/warns per guardrail mode, strike system | +| **Write gate** | Controls which tools agents can use based on safe mode level | `src/worker/guardrail-config.ts` — restricted/lockdown blocks Write/Edit/Bash | + +**Human review**: Parity gate violations → reminder injected, if unresolved → surfaced to user. Identity gate kills → logged with reason. Mesh validation errors → block load, user sees what's wrong. + +### Nine 2.5 — Self-Healing & Auto-Recovery + +Detect stuck states and recover without human intervention where safe. + +| Feature | What It Does | Where | +|---------|-------------|-------| +| **Nudge detector** | Detects when a completing agent fails to forward work to the next route step; summarizes dead output with Haiku and writes recovery task | `src/worker/nudge-detector.ts` — 15s delay, max 1 nudge/agent | +| **Deadlock breaker** | DFS cycle detection in ask graph; auto-breaks short cycles, escalates deep ones | `src/queue/deadlock-detector.ts` — scans every 60s, `autoBreakDepth: 3` | +| **Stale message cleaner** | TTL-based GC for unprocessed queue entries (missing target, crashed worker) | `src/queue/stale-cleaner.ts` — 30min TTL, warn/archive/delete actions | +| **Quality iteration loops** | Quality hooks evaluate output → inject feedback → agent retries with feedback | `src/hooks/post/quality-evaluate.ts` — configurable gates, max iterations | + +**Human review**: Nudges are logged and visible in `tx spy`. Deadlock cycles deeper than `autoBreakDepth` (default 3) → escalated to human with cycle visualization. Quality exhaustion (max iterations hit) → presents feedback history and asks user: retry, accept, or drop. Stale message cleanup → logged, user can audit via `tx spy`. + ## Quick Start ```bash From 2ddff31d4408797ce1c7231ad75143b3f765e35f Mon Sep 17 00:00:00 2001 From: Claude Date: Wed, 11 Mar 2026 05:43:25 +0000 Subject: [PATCH 10/12] docs(reliability): Add all reliability features from codebase scan Adds features found across the codebase organized by nines level: - Nine 1: graceful shutdown, usage policy recovery, recovery handler escalation - Nine 2: manifest validator, guardrail config chain - Nine 2.5: session suspend/resume, FSM persistence + backup, session store backfill - Nine 3: rate limiter, worker pool backpressure, metrics aggregator, worker lifecycle tracking https://claude.ai/code/session_012PTx7bZNNh74rAshCcuSQg --- docs/reliability.md | 70 ++++++++++++++++++++++++++++++++++++++------- 1 file changed, 60 insertions(+), 10 deletions(-) diff --git a/docs/reliability.md b/docs/reliability.md index fa20e765..898eb34f 100644 --- a/docs/reliability.md +++ b/docs/reliability.md @@ -4,13 +4,13 @@ TX reliability features organized by Karpathy's "March of Nines" — each nine r ## March of Nines — Current Status -| Nines | Technique | TX Status | Human Review | -|-------|-----------|-----------|--------------| -| **1 (90%)** | Basic error handling, retries | SQLite WAL, worker retries (3x), injection retries (poll loop), routing correction injection | Retry exhaustion → present failure + options to user | -| **2 (99%)** | Validation, protocol enforcement | Parity gate, FSM validation, mesh validator, identity gate, write gate | Validation failures → surface to user with context. Identity/routing violations → warn or kill per guardrail mode | -| **~2.5** | Self-healing / auto-recovery | Nudge detector, deadlock breaker, stale cleaner, quality iteration loops | Nudge fires → logged. Deadlock cycles > `autoBreakDepth` → escalate to human. Quality exhaustion → present feedback history + ask user | -| **3 (99.9%)** | Monitoring, circuit breaking, DLQ | Circuit breaker, heartbeat monitor, DLQ with session resume, SLI tracker, safe mode, checkpoint log | Circuit open → notify user. Safe mode escalation → user approval. DLQ recovery → diagnose/present/confirm workflow | -| **4 (99.99%)** | [Roadmap] Retry-with-variation, schema validation, agent classification, observability | Planned — see Reliability Roadmap below | Every action requires human confirmation (see roadmap gates) | +| Nines | Technique | TX Status | +|-------|-----------|-----------| +| **1 (90%)** | Basic error handling, retries | SQLite WAL, worker retries (3x), injection poll loop, routing correction, graceful shutdown, usage policy recovery, recovery handler escalation | +| **2 (99%)** | Validation, protocol enforcement | Parity gate, FSM validation, mesh validator, identity gate, write gate, manifest validator, guardrail config chain | +| **~2.5** | Self-healing / auto-recovery | Nudge detector, deadlock breaker, stale cleaner, quality iteration loops, session suspend/resume, FSM state persistence + backup, session store backfill | +| **3 (99.9%)** | Monitoring, circuit breaking, DLQ | Circuit breaker, heartbeat monitor, DLQ with session resume, SLI tracker, safe mode, checkpoint log, rate limiter, worker pool backpressure, metrics aggregator, worker lifecycle tracking | +| **4 (99.99%)** | [Roadmap] | Retry-with-variation, schema validation, agent classification, observability dashboard | ### Nine 1 — Basic Error Handling (90%) @@ -22,8 +22,11 @@ Foundational durability. Nothing silently drops. | **Worker retries (3x)** | Failed workers retry up to 3 times before DLQ | `src/worker/dispatcher.ts` — configurable via `dlq.maxRetries` | | **Injection poll loop** | Core message injection retries on next poll if Claude is busy | `src/cli/start.ts` — leaves message at head of queue for next cycle | | **Routing correction injection** | Bad routing target → corrective prompt injected back to sender | `src/worker/dispatcher.ts` — `handleRoutingError()`, max retries per guardrail config | +| **Graceful worker pool shutdown** | Drains active workers before terminating pool, prevents orphaned workers | `src/server/worker-pool.ts` | +| **Usage policy error handling** | Detects Claude API usage policy errors, captures diagnostic context, writes ask-human message instead of crashing | `src/worker/usage-policy-error.ts` | +| **Recovery handler with escalation** | Tracks recovery requests per agent, provides FSM guidance on first attempt, escalates to human after 3 requests in 60s | `src/core/recovery.ts` | -**Human review**: When worker retries exhaust → DLQ entry created → core presents failure to user. When routing retries exhaust → escalated to user with full attempt history. +**Human review**: When worker retries exhaust → DLQ entry created → core presents failure to user. When routing retries exhaust → escalated to user with full attempt history. Usage policy errors → human chooses retry/skip/modify-prompt/abort. ### Nine 2 — Validation & Protocol Enforcement (99%) @@ -36,8 +39,10 @@ Catch bad outputs and protocol violations before they propagate. | **Mesh validator** | Validates mesh config before loading (required fields, types, routing consistency) | `src/worker/mesh-validator.ts` — errors block load, warnings log | | **Identity gate** | PreToolUse hook validates `from:` field matches agent identity | `src/worker/identity-gate.ts` — blocks/warns per guardrail mode, strike system | | **Write gate** | Controls which tools agents can use based on safe mode level | `src/worker/guardrail-config.ts` — restricted/lockdown blocks Write/Edit/Bash | +| **Manifest validator** | Validates agent output artifacts against declared manifest paths with template variable resolution (5-pass chained substitution) | `src/worker/manifest-validator.ts` | +| **Guardrail config chain** | Unified strict/warning mode on every guardrail with override chain: agent > mesh > global > hardcoded | `src/worker/guardrail-config.ts` | -**Human review**: Parity gate violations → reminder injected, if unresolved → surfaced to user. Identity gate kills → logged with reason. Mesh validation errors → block load, user sees what's wrong. +**Human review**: Parity gate violations → reminder injected, if unresolved → surfaced to user. Identity gate kills → logged with reason. Mesh validation errors → block load, user sees what's wrong. Manifest validation failures → surfaced to user with missing/invalid paths. ### Nine 2.5 — Self-Healing & Auto-Recovery @@ -45,10 +50,13 @@ Detect stuck states and recover without human intervention where safe. | Feature | What It Does | Where | |---------|-------------|-------| -| **Nudge detector** | Detects when a completing agent fails to forward work to the next route step; summarizes dead output with Haiku and writes recovery task | `src/worker/nudge-detector.ts` — 15s delay, max 1 nudge/agent | +| **Nudge detector** | Detects when a completing agent fails to forward work; summarizes dead output with Haiku and writes recovery task | `src/worker/nudge-detector.ts` — 15s delay, max 1 nudge/agent | | **Deadlock breaker** | DFS cycle detection in ask graph; auto-breaks short cycles, escalates deep ones | `src/queue/deadlock-detector.ts` — scans every 60s, `autoBreakDepth: 3` | | **Stale message cleaner** | TTL-based GC for unprocessed queue entries (missing target, crashed worker) | `src/queue/stale-cleaner.ts` — 30min TTL, warn/archive/delete actions | | **Quality iteration loops** | Quality hooks evaluate output → inject feedback → agent retries with feedback | `src/hooks/post/quality-evaluate.ts` — configurable gates, max iterations | +| **Session suspend/resume** | Persists suspended session state to SQLite for crash recovery; re-buffers delivered responses on restart | `src/worker/session-manager.ts` — `restoreFromDatabase()` on startup | +| **FSM state persistence + backup** | Saves FSM state with atomic backup-before-update; can restore from latest backup on corruption | `src/mesh/fsm-persistence.ts` | +| **Session store with backfill** | SQLite session persistence with FTS5 search; backfills existing transcripts from filesystem on startup | `src/session/session-store.ts` | **Human review**: Nudges are logged and visible in `tx spy`. Deadlock cycles deeper than `autoBreakDepth` (default 3) → escalated to human with cycle visualization. Quality exhaustion (max iterations hit) → presents feedback history and asks user: retry, accept, or drop. Stale message cleanup → logged, user can audit via `tx spy`. @@ -272,6 +280,48 @@ tx mesh health # Shows current safe mode level tx spy # Watch safe-mode:blocked activity events ``` +### 6. Rate Limiter + +**What it does**: Token bucket rate limiting for server endpoints. Prevents burst overload. + +**How it works**: +- Per-endpoint limits with configurable burst capacity +- Automatic bucket cleanup every 5 minutes +- Smooth rate limiting (not hard cutoff) + +**Source**: `src/server/rate-limiter.ts` + +### 7. Worker Pool Backpressure + +**What it does**: Adaptive polling with concurrency limits prevents queue overload. + +**How it works**: +- Polls for work at configurable intervals (default 100ms) +- Respects concurrency limits — won't spawn beyond capacity +- Graceful shutdown drains active workers before terminating + +**Source**: `src/server/worker-pool.ts` + +### 8. Metrics Aggregator + +**What it does**: Per-query metrics collection with token cost tracking. + +**Tracks**: input/output tokens, duration, cost per query, aggregate totals for worker lifetime, tool call counts. + +**Source**: `src/worker/metrics-aggregator.ts` + +### 9. Worker Lifecycle Tracking + +**What it does**: Tracks parallel worker execution with unique instance IDs for deduplication and debugging. + +**How it works**: +- Generates unique worker IDs (`agentId-uuid`) +- Tracks parallel execution per agent +- Persists worker state to disk +- Tracks nudge counts and completion frontier + +**Source**: `src/worker/worker-lifecycle.ts` + ## Test Mesh The `reliability-test` mesh is configured with tight thresholds for quick testing: From 6283ee3ea370e008983e53c26e8be442ee3f3852 Mon Sep 17 00:00:00 2001 From: Claude Date: Wed, 11 Mar 2026 06:20:24 +0000 Subject: [PATCH 11/12] docs(reliability): Add bash guard to Nine 2 reliability features Bash guard (write-gate.ts createBashHook) intercepts Bash redirects (>, >>, tee) and validates target paths against write manifest. Strike system: 1-2 errors with paths, 3+ kills worker. https://claude.ai/code/session_012PTx7bZNNh74rAshCcuSQg --- docs/reliability.md | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/docs/reliability.md b/docs/reliability.md index 898eb34f..55332699 100644 --- a/docs/reliability.md +++ b/docs/reliability.md @@ -7,7 +7,7 @@ TX reliability features organized by Karpathy's "March of Nines" — each nine r | Nines | Technique | TX Status | |-------|-----------|-----------| | **1 (90%)** | Basic error handling, retries | SQLite WAL, worker retries (3x), injection poll loop, routing correction, graceful shutdown, usage policy recovery, recovery handler escalation | -| **2 (99%)** | Validation, protocol enforcement | Parity gate, FSM validation, mesh validator, identity gate, write gate, manifest validator, guardrail config chain | +| **2 (99%)** | Validation, protocol enforcement | Parity gate, FSM validation, mesh validator, identity gate, write gate, bash guard, manifest validator, guardrail config chain | | **~2.5** | Self-healing / auto-recovery | Nudge detector, deadlock breaker, stale cleaner, quality iteration loops, session suspend/resume, FSM state persistence + backup, session store backfill | | **3 (99.9%)** | Monitoring, circuit breaking, DLQ | Circuit breaker, heartbeat monitor, DLQ with session resume, SLI tracker, safe mode, checkpoint log, rate limiter, worker pool backpressure, metrics aggregator, worker lifecycle tracking | | **4 (99.99%)** | [Roadmap] | Retry-with-variation, schema validation, agent classification, observability dashboard | @@ -39,10 +39,11 @@ Catch bad outputs and protocol violations before they propagate. | **Mesh validator** | Validates mesh config before loading (required fields, types, routing consistency) | `src/worker/mesh-validator.ts` — errors block load, warnings log | | **Identity gate** | PreToolUse hook validates `from:` field matches agent identity | `src/worker/identity-gate.ts` — blocks/warns per guardrail mode, strike system | | **Write gate** | Controls which tools agents can use based on safe mode level | `src/worker/guardrail-config.ts` — restricted/lockdown blocks Write/Edit/Bash | +| **Bash guard** | PreToolUse hook intercepts Bash commands with redirects (`>`, `>>`, `tee`), validates target paths against allowed write manifest. Strike system: 1-2 violations → error with allowed paths, 3+ → kill worker | `src/worker/write-gate.ts` — `createBashHook()` | | **Manifest validator** | Validates agent output artifacts against declared manifest paths with template variable resolution (5-pass chained substitution) | `src/worker/manifest-validator.ts` | | **Guardrail config chain** | Unified strict/warning mode on every guardrail with override chain: agent > mesh > global > hardcoded | `src/worker/guardrail-config.ts` | -**Human review**: Parity gate violations → reminder injected, if unresolved → surfaced to user. Identity gate kills → logged with reason. Mesh validation errors → block load, user sees what's wrong. Manifest validation failures → surfaced to user with missing/invalid paths. +**Human review**: Parity gate violations → reminder injected, if unresolved → surfaced to user. Identity gate kills → logged with reason. Mesh validation errors → block load, user sees what's wrong. Manifest validation failures → surfaced to user with missing/invalid paths. Bash guard violations → logged for forensics, worker killed after 3 strikes. ### Nine 2.5 — Self-Healing & Auto-Recovery From 8ee54c17babd306498e32bff98eca5c8d6ca9182 Mon Sep 17 00:00:00 2001 From: Claude Date: Wed, 11 Mar 2026 06:39:18 +0000 Subject: [PATCH 12/12] docs(reliability): Consistent format across all nines + extract human review - Add summary table to Nine 3 (matching Nine 1/2/2.5 format) - Add detailed explanations for all Nine 1/2/2.5 features - Extract all human review gates to dedicated HUMAN_REVIEW.md - Restructure roadmap into table + explanations https://claude.ai/code/session_012PTx7bZNNh74rAshCcuSQg --- docs/HUMAN_REVIEW.md | 147 ++++++++++++ docs/reliability.md | 560 +++++++++++++++++++++++++++---------------- 2 files changed, 497 insertions(+), 210 deletions(-) create mode 100644 docs/HUMAN_REVIEW.md diff --git a/docs/HUMAN_REVIEW.md b/docs/HUMAN_REVIEW.md new file mode 100644 index 00000000..a6b14dd9 --- /dev/null +++ b/docs/HUMAN_REVIEW.md @@ -0,0 +1,147 @@ +# Human Review Gates — Reliability + +Every reliability feature includes human review steps. The system **never** silently changes behavior, retries destructively, or masks failures. + +> **The system does work. The human makes decisions.** + +- Retries within limits → automatic (but visible) +- Recovery, replay, escalation → always human-approved +- Failures → always surfaced with context and options +- No silent state changes that affect mesh behavior + +See [reliability.md](./reliability.md) for feature details. + +--- + +## Nine 1 — Basic Error Handling + +### Worker Retries +- When retries exhaust → DLQ entry created → core presents failure to user +- User decides: retry with variation, recover from checkpoint, or drop + +### Injection Poll Loop +- Stale entries (>5min) are dropped but remain available via `tx inbox` +- If file-based fallback activates, user sees pending messages on next interaction + +### Routing Correction +- When routing retries exhaust → escalated to user with full attempt history +- User sees which targets were tried and picks correct one + +### Usage Policy Errors +- Human chooses: retry, skip, modify prompt, or abort +- Full diagnostic context (triggering prompt, recent history) included in ask-human message + +### Recovery Handler +- First 2 recovery requests: automatic guidance with FSM state and valid routes +- 3rd+ request in 60s: escalated to human — agent is repeatedly stuck + +--- + +## Nine 2 — Validation & Protocol Enforcement + +### Parity Gate +- Violations → reminder injected to agent +- If unresolved after reminder → surfaced to user with pending asks list + +### Identity Gate +- Kill events → logged with full reason (agent ID, expected vs actual `from:` field) +- User can audit identity violations via logs + +### Mesh Validator +- Validation errors → block mesh load, user sees what's wrong and how to fix it +- Warnings → logged but don't block (user can review in logs) + +### Manifest Validator +- Validation failures → surfaced to user with missing/invalid paths and responsible agents + +### Bash Guard +- 1-2 violations → error response with allowed paths shown to agent +- 3+ violations → worker killed, logged for forensics +- User can audit bash guard events in logs + +--- + +## Nine 2.5 — Self-Healing & Auto-Recovery + +### Nudge Detector +- Nudges are logged and visible in `tx spy` +- Max 1 nudge per agent prevents recovery loops + +### Deadlock Breaker +- Shallow cycles (depth ≤ 3) → auto-broken, logged +- Deep cycles (depth 5+) → escalated to human with cycle visualization (A→B→C→A) +- User decides which agent's ask to drop + +### Stale Message Cleaner +- Stale messages archived with reason — no silent deletion +- User can audit via `tx spy` and review archived messages + +### Quality Iteration Loops +- Max iterations hit → presents feedback history to user +- User decides: retry, accept current output, or drop +- Each iteration's feedback is visible for review + +--- + +## Nine 3 — Monitoring, Circuit Breaking, DLQ + +### Circuit Breaker +- Circuit open → agent skipped, logged with failure count +- Half-open test spawn → user can monitor via `tx mesh health` +- Circuits don't auto-close silently — health dashboard shows state + +### Heartbeat Monitor +- Warn threshold → logged warning (no action) +- Stale threshold → logged stale warning +- Dead threshold → **worker killed**, failure recorded, routed to DLQ +- All events visible in `tx mesh health` with silence duration + +### Dead Letter Queue +- **Recovery always requires human review** (except crash recovery on restart) +- Core agent diagnoses, presents options (resume vs rewind vs drop), gets explicit confirmation +- Available checkpoints shown before any recovery action +- `tx mesh dlq` shows all pending entries with recovery mode and context + +### Checkpoint Log & Rewind-To +- Checkpoint notification: core can surface "Mesh X completed 'build' — checkpoint saved" +- Before rewind-to replay, core presents: which checkpoint, what replays, what's discarded +- Post-replay: result presented for user approval before mesh continues +- **Replay never happens without user choosing a checkpoint** + +### SLI Tracker +- Threshold alerts: "Mesh X reliability dropped to 94.2% (below 95% cautious threshold). 3 failures in last 10 runs. Categories: 2x model_error, 1x timeout." +- Periodic health summary available via `tx mesh health` +- SLI data always visible — never hidden from user + +### Safe Mode +- **Escalation beyond cautious requires user confirmation** when surfaced via core +- Auto-escalation (if enabled) is logged with reason and SLI data +- **Never auto-de-escalates** — human must clear via `resetMesh()` or `resetAll()` +- Core presents: "SLI recovered to 98%. Clear restricted mode for mesh X?" + +--- + +## Roadmap — Nine 4 + +### Retry-With-Variation +- First failure: core reports "Agent X failed (model_error). Retrying with variation: [description]. Retry 1/3." +- Each retry logs what changed (e.g., "retry 2: simplified prompt, dropped optional context") +- Exhausted retries: core presents full retry history with variations tried — user decides next step +- New variation strategies require review before taking effect + +### Output Schema Validation +- Validation failure: core reports "Agent X output failed validation: missing required field 'summary'." +- Before retry with validation feedback, core presents: "Ask agent X to fix? Validation errors: [list]. Or drop?" +- Schema changes in mesh config: core surfaces impact on existing agents +- Partial pass: core shows what passed/failed, user decides accept partial, retry, or drop + +### Critical/Non-Critical Agent Classification +- On mesh load, core can surface classifications: "critical=[planner, builder], non-critical=[linter]" +- Non-critical failure: "Agent 'linter' failed (timeout). Mesh continues. Output from this step missing." +- Repeated non-critical failures: "Agent 'linter' failed 5 times. Promote to critical or disable?" +- Critical failures always stop the mesh and present recovery options + +### Aggregate Observability Dashboard +- Anomaly alerts: "Mesh X failure rate spiked from 2% to 15% in last hour. Category: model_error." +- Cost review gate: "Recovering mesh X with rewind-to will replay ~50k tokens. Proceed?" +- Dashboard is passive — all actions from insights go through standard human review diff --git a/docs/reliability.md b/docs/reliability.md index 55332699..4f5938a5 100644 --- a/docs/reliability.md +++ b/docs/reliability.md @@ -2,6 +2,8 @@ TX reliability features organized by Karpathy's "March of Nines" — each nine requires fundamentally new approaches. +Human review gates for all features are documented in [HUMAN_REVIEW.md](./HUMAN_REVIEW.md). + ## March of Nines — Current Status | Nines | Technique | TX Status | @@ -12,96 +14,295 @@ TX reliability features organized by Karpathy's "March of Nines" — each nine r | **3 (99.9%)** | Monitoring, circuit breaking, DLQ | Circuit breaker, heartbeat monitor, DLQ with session resume, SLI tracker, safe mode, checkpoint log, rate limiter, worker pool backpressure, metrics aggregator, worker lifecycle tracking | | **4 (99.99%)** | [Roadmap] | Retry-with-variation, schema validation, agent classification, observability dashboard | -### Nine 1 — Basic Error Handling (90%) +--- + +## Nine 1 — Basic Error Handling (90%) Foundational durability. Nothing silently drops. | Feature | What It Does | Where | |---------|-------------|-------| -| **SQLite WAL mode** | Write-ahead logging prevents queue corruption on crash | `src/queue/index.ts` — `journal_mode=WAL` on init | -| **Worker retries (3x)** | Failed workers retry up to 3 times before DLQ | `src/worker/dispatcher.ts` — configurable via `dlq.maxRetries` | -| **Injection poll loop** | Core message injection retries on next poll if Claude is busy | `src/cli/start.ts` — leaves message at head of queue for next cycle | -| **Routing correction injection** | Bad routing target → corrective prompt injected back to sender | `src/worker/dispatcher.ts` — `handleRoutingError()`, max retries per guardrail config | -| **Graceful worker pool shutdown** | Drains active workers before terminating pool, prevents orphaned workers | `src/server/worker-pool.ts` | -| **Usage policy error handling** | Detects Claude API usage policy errors, captures diagnostic context, writes ask-human message instead of crashing | `src/worker/usage-policy-error.ts` | -| **Recovery handler with escalation** | Tracks recovery requests per agent, provides FSM guidance on first attempt, escalates to human after 3 requests in 60s | `src/core/recovery.ts` | +| **SQLite WAL mode** | Write-ahead logging prevents queue corruption on crash | `src/queue/index.ts` | +| **Worker retries (3x)** | Failed workers retry up to 3 times before DLQ | `src/worker/dispatcher.ts` | +| **Injection poll loop** | Core message injection retries on next poll if Claude is busy | `src/cli/start.ts` | +| **Routing correction injection** | Bad routing target → corrective prompt injected back to sender | `src/worker/dispatcher.ts` | +| **Graceful worker pool shutdown** | Drains active workers before terminating pool | `src/server/worker-pool.ts` | +| **Usage policy error handling** | Detects Claude API usage policy errors, writes ask-human message instead of crashing | `src/worker/usage-policy-error.ts` | +| **Recovery handler with escalation** | Tracks recovery requests per agent, escalates to human after 3 requests in 60s | `src/core/recovery.ts` | + +### SQLite WAL Mode + +**What it does**: Prevents queue corruption on crash via Write-Ahead Logging. + +**How it works**: +- Enables WAL mode (`journal_mode=WAL`) on the SQLite message queue at init +- All writes are logged to WAL file before committing to main database +- Guarantees queue state is recoverable even if process crashes mid-write +- Allows concurrent readers while writes are in flight + +### Worker Retries (3x) + +**What it does**: Auto-retries failed workers before routing to DLQ. + +**How it works**: +- Each worker has a state machine tracking retry attempts +- On error, checks `canTransition('retry')` before respawning +- Differentiates retriable errors (crashes, model overload) vs non-retriable (suspension, max-turns, abort) +- After max retries exhausted, routes to Dead Letter Queue for recovery + +### Injection Poll Loop + +**What it does**: Ensures messages reach the core Claude session even when it's busy. + +**How it works**: +- Maintains an in-memory queue of messages waiting for injection into tmux +- Polls every 2s (`INJECTION_POLL_MS`) checking if Claude is idle, then injects +- Drops stale entries pending >5 minutes (they're available via `tx inbox`) +- Falls back to file-based delivery (`pending-for-core.json`) if active injection fails + +### Routing Correction Injection + +**What it does**: Recovers from bad routing by teaching the agent valid targets. + +**How it works**: +- Detects messages targeting non-existent meshes/agents, increments retry counter per sender→target pair +- Injects corrective message back to sender listing valid available targets (up to max retries) +- After max retries exceeded, escalates to human via `ask-human` message +- Supports strict mode (block immediately) and warning mode (allow + notify) per guardrail config + +### Graceful Worker Pool Shutdown + +**What it does**: Prevents orphaned workers on shutdown. + +**How it works**: +- Sets `running = false` to prevent new spawns, stops polling loop +- Collects all active worker promises and awaits completion via `Promise.all()` +- Logs count of in-flight workers being drained + +### Usage Policy Error Handling + +**What it does**: Captures false-positive usage policy errors with full diagnostic context. + +**How it works**: +- Detects usage policy errors from Claude API via pattern matching +- Captures diagnostic context: triggering prompt, recent history, in-progress tool calls, agent/mesh info +- Writes `ask-human` message to core with full context for human decision (retry, skip, modify prompt, abort) +- Preserves session ID for potential resume -**Human review**: When worker retries exhaust → DLQ entry created → core presents failure to user. When routing retries exhaust → escalated to user with full attempt history. Usage policy errors → human chooses retry/skip/modify-prompt/abort. +### Recovery Handler with Escalation -### Nine 2 — Validation & Protocol Enforcement (99%) +**What it does**: Detects repeatedly stuck agents and escalates to human. + +**How it works**: +- Intercepts messages routed to `system/recovery` +- Tracks frequency per agent with time window; resets counter outside escalation window +- First 2 attempts: returns guidance with current FSM state, pending asks, and valid exit routes +- 3rd+ attempt: escalates to `core/core` for human intervention + +--- + +## Nine 2 — Validation & Protocol Enforcement (99%) Catch bad outputs and protocol violations before they propagate. | Feature | What It Does | Where | |---------|-------------|-------| -| **Parity gate** | Ensures completion agents answer all pending asks before completing | `src/worker/dispatcher.ts`, `src/core/consumer.ts` — tracks `pending_asks` table | -| **FSM validation** | State machine meshes enforce valid transitions, prevent skipped/repeated states | `src/state-machine/` — transition guards + checkpoint persistence | -| **Mesh validator** | Validates mesh config before loading (required fields, types, routing consistency) | `src/worker/mesh-validator.ts` — errors block load, warnings log | -| **Identity gate** | PreToolUse hook validates `from:` field matches agent identity | `src/worker/identity-gate.ts` — blocks/warns per guardrail mode, strike system | -| **Write gate** | Controls which tools agents can use based on safe mode level | `src/worker/guardrail-config.ts` — restricted/lockdown blocks Write/Edit/Bash | -| **Bash guard** | PreToolUse hook intercepts Bash commands with redirects (`>`, `>>`, `tee`), validates target paths against allowed write manifest. Strike system: 1-2 violations → error with allowed paths, 3+ → kill worker | `src/worker/write-gate.ts` — `createBashHook()` | -| **Manifest validator** | Validates agent output artifacts against declared manifest paths with template variable resolution (5-pass chained substitution) | `src/worker/manifest-validator.ts` | -| **Guardrail config chain** | Unified strict/warning mode on every guardrail with override chain: agent > mesh > global > hardcoded | `src/worker/guardrail-config.ts` | +| **Parity gate** | Ensures completion agents answer all pending asks before completing | `src/worker/dispatcher.ts`, `src/core/consumer.ts` | +| **FSM validation** | State machine meshes enforce valid transitions, prevent skipped/repeated states | `src/state-machine/` | +| **Mesh validator** | Validates mesh config before loading (required fields, types, routing consistency) | `src/worker/mesh-validator.ts` | +| **Identity gate** | PreToolUse hook validates `from:` field matches agent identity | `src/worker/identity-gate.ts` | +| **Write gate** | Controls which paths agents can write to based on manifest | `src/worker/write-gate.ts` | +| **Bash guard** | PreToolUse hook blocks dangerous Bash patterns outside project boundary | `src/worker/bash-guard.ts` | +| **Manifest validator** | Validates agent output artifacts against declared manifest paths | `src/worker/manifest-validator.ts` | +| **Guardrail config chain** | Unified strict/warning mode with override chain: agent > mesh > global > hardcoded | `src/worker/guardrail-config.ts` | + +### Parity Gate + +**What it does**: Prevents agents from completing a mesh while unanswered questions remain. + +**How it works**: +- Tracks pending asks (questions sent to human boundary `core/core`) in SQLite queue +- Validates responses from `core/core` have a matching pending ask by msg-id (fallback to agent-level matching) +- Blocks `task-complete` messages with unresolved asks; deletes offending file and emits `parity-reminder` +- Terminal-by-default: asks to `core/core` require parity; agent-to-agent asks don't trigger tracking + +### FSM Validation + +**What it does**: Enforces state machine rules before message routing. + +**How it works**: +- Type-safe state transitions with guard validation and middleware hooks (pre/post) +- Consumer calls `validateMessageWithFSM()` on all incoming messages BEFORE type-specific routing +- Centralized validation ensures all routing respects mesh-defined FSM rules +- Emits transition history and immutable state snapshots for replay/debugging + +### Mesh Validator -**Human review**: Parity gate violations → reminder injected, if unresolved → surfaced to user. Identity gate kills → logged with reason. Mesh validation errors → block load, user sees what's wrong. Manifest validation failures → surfaced to user with missing/invalid paths. Bash guard violations → logged for forensics, worker killed after 3 strikes. +**What it does**: Catches config errors before a mesh can load. -### Nine 2.5 — Self-Healing & Auto-Recovery +**How it works**: +- Static `validate()` checks mesh config structure, required fields, agent definitions, routing rules, FSM definitions, and manifest entries +- Validates field types, agent presence, entry/exit points, task distribution config, guardrail overrides, and parallelism blocks +- Returns `ValidationResult` with errors and warnings — errors block load, warnings log +- Catches typos early (e.g., agent routing to nonexistent agents) + +### Identity Gate + +**What it does**: Prevents agents from impersonating other agents. + +**How it works**: +- PreToolUse hook intercepts Write tool calls to `.ai/tx/msgs/` +- Extracts `from:` field from message YAML frontmatter, compares against expected agent identity +- Enforces fully-qualified names (rejects bare `worker` when agent is `dev/worker`) to prevent cross-mesh routing leaks +- Strike counter with configurable kill threshold; strict (block) vs warning (allow + feedback) modes + +### Write Gate + +**What it does**: Restricts file writes to declared manifest paths. + +**How it works**: +- PreToolUse hooks intercept Write/Edit/NotebookEdit tools and Bash redirects (`>`, `>>`, `tee`) +- Validates target paths against agent's declared allowed paths from manifest +- Auto-exempts `.ai/tx/msgs/` and `.ai/tx/logs/`; allows `/dev/null` +- Tracks file-tool and bash-redirect strikes separately; kill threshold on accumulated violations + +### Bash Guard + +**What it does**: Docker-like isolation — full Bash inside project, can't escape. + +**How it works**: +- Two security layers: workDir boundary enforcement + catastrophic damage prevention +- Blocks all filesystem operations (read/write/symlink) outside project directory +- Blocks privilege escalation, root destruction, system service manipulation, raw disk ops +- Network access explicitly allowed (Docker parity): curl, wget, ssh, npm publish are safe + +### Manifest Validator + +**What it does**: Validates agent artifacts against declared manifest paths. + +**How it works**: +- Resolves manifest variable references (game-id, campaign-id, etc.) from `session.yaml` with caching +- Builds path context from mesh workspace config (locations, variables, source mappings) +- `validateAgentArtifacts()` checks agent reads/writes against declared manifest entries +- `findWriters()` identifies responsible agents for given file IDs (used in error messages) + +### Guardrail Config Chain + +**What it does**: Unified enforcement with flexible per-agent overrides. + +**How it works**: +- Loads global guardrails from `.ai/tx/data/config.yaml` and mesh-local overrides from mesh config +- Resolution chain: agent-level > mesh-level > global agent > global mesh > global default > hardcoded default +- Each guardrail has `strict` and `warning` flags that resolve independently +- Supports backward-compatible bare numbers or structured `{strict, warning, limit}` objects + +--- + +## Nine 2.5 — Self-Healing & Auto-Recovery Detect stuck states and recover without human intervention where safe. | Feature | What It Does | Where | |---------|-------------|-------| -| **Nudge detector** | Detects when a completing agent fails to forward work; summarizes dead output with Haiku and writes recovery task | `src/worker/nudge-detector.ts` — 15s delay, max 1 nudge/agent | -| **Deadlock breaker** | DFS cycle detection in ask graph; auto-breaks short cycles, escalates deep ones | `src/queue/deadlock-detector.ts` — scans every 60s, `autoBreakDepth: 3` | -| **Stale message cleaner** | TTL-based GC for unprocessed queue entries (missing target, crashed worker) | `src/queue/stale-cleaner.ts` — 30min TTL, warn/archive/delete actions | -| **Quality iteration loops** | Quality hooks evaluate output → inject feedback → agent retries with feedback | `src/hooks/post/quality-evaluate.ts` — configurable gates, max iterations | -| **Session suspend/resume** | Persists suspended session state to SQLite for crash recovery; re-buffers delivered responses on restart | `src/worker/session-manager.ts` — `restoreFromDatabase()` on startup | -| **FSM state persistence + backup** | Saves FSM state with atomic backup-before-update; can restore from latest backup on corruption | `src/mesh/fsm-persistence.ts` | -| **Session store with backfill** | SQLite session persistence with FTS5 search; backfills existing transcripts from filesystem on startup | `src/session/session-store.ts` | +| **Nudge detector** | Detects when a completing agent fails to forward work; summarizes and writes recovery task | `src/worker/nudge-detector.ts` | +| **Deadlock breaker** | DFS cycle detection in ask graph; auto-breaks short cycles, escalates deep ones | `src/queue/deadlock-detector.ts` | +| **Stale message cleaner** | TTL-based GC for unprocessed queue entries (missing target, crashed worker) | `src/queue/stale-cleaner.ts` | +| **Quality iteration loops** | Quality hooks evaluate output → inject feedback → agent retries with feedback | `src/hooks/post/quality-evaluate.ts` | +| **Session suspend/resume** | Persists suspended session state to SQLite for crash recovery | `src/worker/session-manager.ts` | +| **FSM state persistence + backup** | Atomic backup-before-update; auto-restores from backup on corruption | `src/mesh/fsm-persistence.ts` | +| **Session store with backfill** | SQLite session persistence with FTS5 search; backfills from filesystem on startup | `src/session/session-store.ts` | -**Human review**: Nudges are logged and visible in `tx spy`. Deadlock cycles deeper than `autoBreakDepth` (default 3) → escalated to human with cycle visualization. Quality exhaustion (max iterations hit) → presents feedback history and asks user: retry, accept, or drop. Stale message cleanup → logged, user can audit via `tx spy`. +### Nudge Detector -## Quick Start +**What it does**: Auto-recovers from missed route transitions. -```bash -# View reliability dashboard -tx mesh health +**How it works**: +- Scheduled check runs after agent completion (15s delay), evaluates if routing targets received work +- Resolves expected targets using `DispatchRouter` with agent's declared routing rules (default outcome = `complete`) +- Skips terminal agents (core/core targets) and agents with already-sent messages +- Summarizes dead agent output with Haiku and writes recovery task via SystemMessageWriter +- Limits nudges per agent to prevent loops -# View per-mesh reliability -tx mesh health reliability-test +### Deadlock Breaker -# View dead letter queue -tx mesh dlq +**What it does**: Detects and breaks circular wait loops between agents. -# Recover failed work -tx mesh recover reliability-test -``` +**How it works**: +- Periodic DFS-based cycle detection in pending asks graph (~every 60s) using 3-color marking +- Builds adjacency graph from queue pending asks; identifies circular chains (A→B→C→A) +- Auto-breaks cycles up to `autoBreakDepth` (default 3) +- Escalates deeper cycles (5+) to human via SystemMessageWriter with cycle visualization -## Configuration +### Stale Message Cleaner -Set reliability thresholds in `.ai/tx/data/config.yaml`: +**What it does**: Garbage collects unprocessed messages from crashed workers or typos. -```yaml -reliability: - circuitBreaker: - failureThreshold: 3 # Failures before circuit opens - cooldownMs: 30000 # How long circuit stays open - heartbeat: - warnMs: 60000 # Warn after 60s silence - staleMs: 120000 # Stale after 120s - deadMs: 300000 # Kill worker after 300s silence - safeMode: - autoEscalate: true # Auto-restrict on SLI drop - cautiousThreshold: 0.95 - restrictedThreshold: 0.90 - lockdownThreshold: 0.80 - dlq: - maxRetries: 3 -``` +**How it works**: +- Periodic scanner (every 5 minutes) checks queue messages against TTL (30 minutes default) +- Archives stale messages to `stale_messages` table with reason: `ttl_expired`, `no_target_mesh`, or `manual` +- Actions configurable: `warn`, `archive`, or `delete` +- Tracks known meshes to identify messages routed to non-existent targets; preserves audit trail + +### Quality Iteration Loops + +**What it does**: Validates output quality before routing, with iterative refinement. -## Features +**How it works**: +- Post-hook runs quality stack on worker output after message reception +- Runs gates (required + suggested) on output; returns `{passed, feedback}` +- Three failure modes: `halt` (stop), `loop` (retry if under max iterations), `skip` (allow through) +- Injects feedback messages on failure for agent self-correction + +### Session Suspend/Resume -### 1. Circuit Breaker +**What it does**: Non-destructive pause for external input with crash recovery. + +**How it works**: +- Suspends sessions (kills worker, saves state to SQLite) when agent hits ask-human or await-response boundaries +- Buffers incoming responses while awaiting multiple targets (tracks `pendingResponseCount`) +- Persists to `suspended_sessions` table with reason, target agents, and hook context +- Dispatcher handles resume: loading state, creating new runner, wiring event handlers + +### FSM State Persistence + Backup + +**What it does**: Durable state across crashes with automatic corruption recovery. + +**How it works**: +- SQLite tables: `mesh_state` (current) and `mesh_state_backup` (versioned backups) +- `saveState()` creates backup of previous state before updating (atomic via transaction) +- On corruption (JSON parse error), `loadState()` auto-restores from latest backup +- Indexes on `mesh_name + created_at` for efficient backup lookup + +### Session Store with Backfill + +**What it does**: Persistent session metadata with full-text search. + +**How it works**: +- SQLite `sessions` table stores metadata: agent_id, mesh_id, timestamps, transcript path, message counts, final status +- FTS5 virtual table `sessions_fts` enables full-text search on content, headline, tags +- Prepared statements for fast CRUD; cache for summary types (e.g., `file_changes`, `decisions`) +- Backfills existing sessions from disk on startup (migration-friendly) + +--- + +## Nine 3 — Monitoring, Circuit Breaking, DLQ (99.9%) + +Active monitoring, automatic circuit-breaking, and dead letter recovery. + +| Feature | What It Does | Where | +|---------|-------------|-------| +| **Circuit breaker** | Stops spawning agents that keep failing; auto-recovers after cooldown | `src/reliability/circuit-breaker.ts` | +| **Heartbeat monitor** | Detects stuck workers via silence thresholds; kills dead workers | `src/reliability/heartbeat-monitor.ts` | +| **Dead letter queue** | Captures failed work with session context for recovery | `src/reliability/dead-letter-queue.ts` | +| **SLI tracker** | Measures success rate, failure categories, MTTR, nines level | `src/reliability/sli-tracker.ts` | +| **Safe mode** | Restricts agent capabilities when reliability drops | `src/reliability/safe-mode.ts` | +| **Checkpoint log** | Saves session IDs at FSM transitions; enables rewind-to recovery | `src/reliability/checkpoint-log.ts` | +| **Rate limiter** | Token bucket rate limiting for server endpoints | `src/server/rate-limiter.ts` | +| **Worker pool backpressure** | Adaptive polling with concurrency limits | `src/server/worker-pool.ts` | +| **Metrics aggregator** | Per-query metrics with token cost tracking | `src/worker/metrics-aggregator.ts` | +| **Worker lifecycle tracking** | Unique instance IDs for deduplication and debugging | `src/worker/worker-lifecycle.ts` | + +### Circuit Breaker **What it does**: Stops spawning an agent that keeps failing. Prevents cascade failures. @@ -122,7 +323,7 @@ tx mesh health # Shows open/half_open circuits tx spy # Watch for reliability:blocked activity ``` -### 2. Heartbeat Monitor +### Heartbeat Monitor **What it does**: Detects stuck workers and kills them. @@ -142,7 +343,7 @@ tx mesh health # Shows unhealthy agents with silence duration tx logs --component reliability # Heartbeat kill events ``` -### 3. Dead Letter Queue (DLQ) +### Dead Letter Queue (DLQ) **What it does**: Captures failed work with enough context to recover it. @@ -152,22 +353,16 @@ tx logs --component reliability # Heartbeat kill events - `manual`: Retries exhausted → needs human decision. **How entries are created**: -- Worker exhausts all retries → dispatcher calls `reliability.deadLetter()` with the worker's sessionId, messages sent, and failure category +- Worker exhausts all retries → dispatcher calls `reliability.deadLetter()` with sessionId, messages sent, and failure category - Heartbeat kills a stuck worker → recorded as failure, may generate DLQ entry on next retry exhaustion **How recovery works**: -**Important: Recovery requires human review.** The core agent is instructed to always diagnose, present options (resume vs rewind vs drop), and get explicit user confirmation before triggering recovery. This prevents silent re-execution of bad work. - -1. **Automatic on startup**: When `tx start` runs, the dispatcher calls `recoverAll()` — recovers any pending session_resume and requeue entries from the previous run. (This is the only automatic path — it handles crash recovery between restarts.) - -2. **Human-initiated via core agent** (preferred): User asks core to investigate. Core runs `tx mesh health` + `tx mesh dlq`, presents findings with available checkpoints, user picks a recovery strategy, core writes the recovery message. - -3. **CLI**: `tx mesh recover ` sends a SIGUSR2 signal to the running dispatcher. Shows available checkpoints first. - +1. **Automatic on startup**: `tx start` calls `recoverAll()` — recovers pending session_resume and requeue entries from the previous run (crash recovery only). +2. **Human-initiated via core agent** (preferred): User investigates via `tx mesh health` + `tx mesh dlq`, picks recovery strategy, core writes recovery message. +3. **CLI**: `tx mesh recover ` sends SIGUSR2 to running dispatcher. Shows available checkpoints first. 4. **Front-matter message**: Core writes a message with `recover: true` (and optionally `rewind-to: `) to trigger DLQ recovery. - -5. **Fallback**: If the dispatcher isn't running, `tx mesh recover` writes a recovery message to the msgs dir that will be processed on next start. +5. **Fallback**: If dispatcher isn't running, `tx mesh recover` writes a recovery message to msgs dir for next start. **Observe it**: ```bash @@ -182,13 +377,13 @@ tx mesh dlq clear # GC recovered entries **What it does**: Saves session IDs at every FSM state transition. Enables rewinding to any completed state instead of just the crash point. **How checkpoints are saved**: -- Every time an FSM mesh transitions states, the completing agent's session ID is saved to SQLite +- Every FSM mesh state transition saves the completing agent's session ID to SQLite - Checkpoint key: `mesh_name + state_name` → `session_id` -- Multiple checkpoints per state are kept (most recent wins on lookup) +- Multiple checkpoints per state kept (most recent wins on lookup) **How rewind-to works**: -When recovering from the DLQ, you can specify `rewind-to: ` to use a checkpoint's session ID instead of the crash-point session. This means the recovered worker resumes from after that state completed — skipping all the bad work that happened after. +When recovering from the DLQ, specify `rewind-to: ` to use a checkpoint's session ID instead of the crash-point session. The recovered worker resumes from after that state completed — skipping all bad work that happened after. ``` FSM: analyze → build → verify → complete @@ -232,7 +427,7 @@ Available checkpoints (use --rewind-to=): **When checkpoints are cleared**: On mesh completion (`clearMeshState`). Old checkpoints are garbage collected (keeps last 50 per mesh). -### 4. SLI Tracker +### SLI Tracker **What it does**: Measures success rate, failure categories, MTTR, and nines level. @@ -254,7 +449,7 @@ tx mesh health my-mesh # Per-agent success rates tx mesh health --json # Full snapshot ``` -### 5. Safe Mode +### Safe Mode **What it does**: Restricts agent capabilities when reliability drops. @@ -281,7 +476,7 @@ tx mesh health # Shows current safe mode level tx spy # Watch safe-mode:blocked activity events ``` -### 6. Rate Limiter +### Rate Limiter **What it does**: Token bucket rate limiting for server endpoints. Prevents burst overload. @@ -290,9 +485,7 @@ tx spy # Watch safe-mode:blocked activity events - Automatic bucket cleanup every 5 minutes - Smooth rate limiting (not hard cutoff) -**Source**: `src/server/rate-limiter.ts` - -### 7. Worker Pool Backpressure +### Worker Pool Backpressure **What it does**: Adaptive polling with concurrency limits prevents queue overload. @@ -301,19 +494,18 @@ tx spy # Watch safe-mode:blocked activity events - Respects concurrency limits — won't spawn beyond capacity - Graceful shutdown drains active workers before terminating -**Source**: `src/server/worker-pool.ts` - -### 8. Metrics Aggregator +### Metrics Aggregator **What it does**: Per-query metrics collection with token cost tracking. -**Tracks**: input/output tokens, duration, cost per query, aggregate totals for worker lifetime, tool call counts. - -**Source**: `src/worker/metrics-aggregator.ts` +**How it works**: +- Tracks input/output tokens, duration, cost per query +- Aggregate totals for worker lifetime +- Tool call counts per worker -### 9. Worker Lifecycle Tracking +### Worker Lifecycle Tracking -**What it does**: Tracks parallel worker execution with unique instance IDs for deduplication and debugging. +**What it does**: Tracks parallel worker execution with unique instance IDs. **How it works**: - Generates unique worker IDs (`agentId-uuid`) @@ -321,27 +513,28 @@ tx spy # Watch safe-mode:blocked activity events - Persists worker state to disk - Tracks nudge counts and completion frontier -**Source**: `src/worker/worker-lifecycle.ts` - -## Test Mesh - -The `reliability-test` mesh is configured with tight thresholds for quick testing: -- Circuit breaker opens after 2 failures (not 3) -- Heartbeat kills after 120s (not 300s) -- Safe mode auto-escalates at 80%/50%/25% (not 95%/90%/80%) - -```bash -# Run the test mesh -tx msg "Write a hello world function" --to reliability-test/planner +--- -# Monitor reliability during execution -tx mesh health reliability-test +## Configuration -# If failures occur, check DLQ -tx mesh dlq reliability-test +Set reliability thresholds in `.ai/tx/data/config.yaml`: -# Recover failed work -tx mesh recover reliability-test +```yaml +reliability: + circuitBreaker: + failureThreshold: 3 # Failures before circuit opens + cooldownMs: 30000 # How long circuit stays open + heartbeat: + warnMs: 60000 # Warn after 60s silence + staleMs: 120000 # Stale after 120s + deadMs: 300000 # Kill worker after 300s silence + safeMode: + autoEscalate: true # Auto-restrict on SLI drop + cautiousThreshold: 0.95 + restrictedThreshold: 0.90 + lockdownThreshold: 0.80 + dlq: + maxRetries: 3 ``` ## Front-Matter Options @@ -367,6 +560,27 @@ Agents can interact with reliability features via message front-matter: | `tx mesh recover --rewind-to=` | Recover rewinding to a specific FSM state | | `tx mesh recover --all` | Recover all pending DLQ entries | +## Test Mesh + +The `reliability-test` mesh is configured with tight thresholds for quick testing: +- Circuit breaker opens after 2 failures (not 3) +- Heartbeat kills after 120s (not 300s) +- Safe mode auto-escalates at 80%/50%/25% (not 95%/90%/80%) + +```bash +# Run the test mesh +tx msg "Write a hello world function" --to reliability-test/planner + +# Monitor reliability during execution +tx mesh health reliability-test + +# If failures occur, check DLQ +tx mesh dlq reliability-test + +# Recover failed work +tx mesh recover reliability-test +``` + ## Architecture ``` @@ -395,121 +609,47 @@ Agents can interact with reliability features via message front-matter: └────────────┘ └───────────┘ └───────────┘ ``` -## Reliability Roadmap — Human Review Gates - -Every reliability improvement includes human review steps. The system **never** silently changes behavior, retries destructively, or masks failures. - -### Priority 1: Default-On Checkpoints + Replay - -**Impact**: 10x — turns N-step recovery into 1-step problem -**Effort**: Medium - -**What it does**: Every FSM state transition auto-saves a checkpoint. On failure, the user picks which checkpoint to rewind to and replay from. - -**Human review steps**: -1. **Checkpoint notification**: When a mesh completes a state transition, core can optionally surface it: "Mesh X completed 'build' — checkpoint saved." -2. **Replay approval**: Before any rewind-to replay, core presents: - - Which checkpoint to rewind to - - What work will be replayed (states after the checkpoint) - - What work will be discarded (failed states) -3. **Post-replay review**: After replay completes, core presents the result for user approval before the mesh continues to the next state. - -**Never automatic**: Replay does not happen without the user choosing a checkpoint. - ---- +## Roadmap — Nine 4 (99.99%) -### Priority 2: Reliability Metrics Table + Tracking +| Priority | Feature | Impact | Effort | +|----------|---------|--------|--------| +| 1 | Retry-with-variation | 3-5x retry success improvement | Low | +| 2 | Output schema validation | Catches semantic failures early | Medium | +| 3 | Critical/non-critical agent classification | Prevents cascade from optional steps | Low | +| 4 | Aggregate observability dashboard | Finds the long-tail 0.01% | Medium | -**Impact**: Foundation for everything else -**Effort**: Low - -**What it does**: SLI tracker records success rate, failure categories, MTTR, and nines level per mesh and per agent. - -**Human review steps**: -1. **Threshold alerts**: When SLI drops below a configured threshold, core surfaces it: "Mesh X reliability dropped to 94.2% (below 95% cautious threshold). 3 failures in last 10 runs. Categories: 2x model_error, 1x timeout." -2. **Safe mode escalation approval**: Before escalating safe mode (cautious → restricted → lockdown), core presents the SLI data and asks: "Restrict write access for mesh X? Current SLI: 89%." -3. **De-escalation approval**: Safe mode never auto-de-escalates. Core presents current metrics and asks: "SLI recovered to 98%. Clear restricted mode for mesh X?" -4. **Periodic health summary**: On user request (`tx mesh health`), core presents a table of all meshes with SLI, open circuits, DLQ entries, and safe mode level. - -**Never automatic**: Safe mode escalation beyond `cautious` requires user confirmation. SLI data is always visible. - ---- - -### Priority 3: Retry-With-Variation on Routing/Protocol Failures - -**Impact**: 3-5x improvement on retry success -**Effort**: Low +### Retry-With-Variation **What it does**: When a retry fires, it varies the approach — different prompt framing, model fallback, or simplified task scope — instead of repeating the identical failing request. -**Human review steps**: -1. **First failure notification**: On first failure, core reports: "Agent X failed (model_error). Retrying with variation: [describe variation]. Retry 1/3." -2. **Variation transparency**: Each retry logs what changed (e.g., "retry 2: simplified prompt, dropped optional context" or "retry 3: fallback model"). -3. **Retry exhaustion review**: When all retries exhaust, core presents the full retry history: "3 retries failed for agent X. Variations tried: [list]. Recommend: [recovery options]." User decides next step. -4. **Variation strategy approval**: If a new variation strategy is added to config, core surfaces it for review before it takes effect. +**How it will work**: +- First failure retries with variation: simplified prompt, dropped optional context, or fallback model +- Each retry logs what changed for transparency +- Exhausted retries present full retry history with variations tried -**Never automatic**: Retries within the configured limit are automatic (they're cheap and fast), but the user sees what's happening. Exhausted retries always stop and ask. - ---- - -### Priority 4: Output Schema Validation - -**Impact**: Catches semantic failures early -**Effort**: Medium +### Output Schema Validation **What it does**: Validates agent outputs against expected schemas (front-matter structure, required fields, output format) before passing results downstream. -**Human review steps**: -1. **Validation failure notification**: When output fails schema validation, core reports: "Agent X output failed validation: missing required field 'summary'. Output was [N] chars." -2. **Correction approval**: Before asking the agent to retry with validation feedback, core presents: "Ask agent X to fix output? Validation errors: [list]. Or drop this output?" -3. **Schema change review**: When a mesh config adds or modifies `output_schema`, core surfaces: "Mesh X now requires 'summary' field in output. Existing agents may need prompt updates." -4. **Partial pass handling**: When output partially validates (some fields valid, some not), core presents what passed and what failed. User decides: accept partial, retry, or drop. - -**Never automatic**: Schema validation failures are always surfaced. The system does not silently discard or re-request outputs. - ---- - -### Priority 5: Critical / Non-Critical Agent Classification +**How it will work**: +- Mesh config defines `output_schema` per agent +- Post-completion hook validates output against schema +- Partial pass handling: presents what passed and what failed for human decision -**Impact**: Prevents cascade from optional steps -**Effort**: Low +### Critical/Non-Critical Agent Classification -**What it does**: Agents are classified as `critical` (failure blocks mesh) or `non-critical` (failure is logged but mesh continues). Prevents optional agents from taking down the whole workflow. +**What it does**: Agents classified as `critical` (failure blocks mesh) or `non-critical` (failure logged, mesh continues). Prevents optional agents from taking down the whole workflow. -**Human review steps**: -1. **Classification review**: When a mesh is loaded, core can surface agent classifications: "Mesh X: critical=[planner, builder], non-critical=[linter, formatter]." -2. **Non-critical failure notification**: When a non-critical agent fails, core reports: "Non-critical agent 'linter' failed (timeout). Mesh continues. Output from this step will be missing." -3. **Promotion decision**: If a non-critical agent fails repeatedly, core asks: "Agent 'linter' has failed 5 times. Should it be promoted to critical (failures block mesh) or disabled?" -4. **Critical failure escalation**: Critical agent failures always stop the mesh and present recovery options (Priority 1 checkpoints + Priority 3 retry history). +**How it will work**: +- Agent config adds `critical: true|false` field (default: true) +- Non-critical failures logged and surfaced but don't block mesh +- Repeated non-critical failures prompt promotion decision -**Never automatic**: Non-critical failures are always reported. The user is never surprised by missing outputs from skipped agents. - ---- - -### Priority 6: Aggregate Observability Dashboard - -**Impact**: Needed to find the long-tail 0.01% -**Effort**: Medium +### Aggregate Observability Dashboard **What it does**: Unified view across all meshes — SLI trends, failure patterns, cost tracking, and anomaly detection. -**Human review steps**: -1. **Anomaly alerts**: When the dashboard detects anomalies (sudden SLI drop, unusual failure pattern, cost spike), core surfaces: "Anomaly detected: mesh X failure rate spiked from 2% to 15% in last hour. Failure category: model_error." -2. **Trend review**: On request, core presents trend data: "Last 24h: 47 mesh runs, 98.3% success, 1 DLQ entry (recovered). Top failure: timeout (3x in mesh Y)." -3. **Cost review gate**: Before approving expensive recovery (multiple retries, large context replay), core presents estimated cost: "Recovering mesh X with rewind-to will replay ~50k tokens. Proceed?" -4. **Weekly digest**: Core can present a weekly reliability summary: nines achieved, worst-performing meshes, recurring failure patterns, DLQ utilization. - -**Never automatic**: The dashboard is passive — it collects and presents. All actions triggered by dashboard insights go through the standard human review workflow (diagnose → present → confirm → execute). - ---- - -### Human Review Principle - -Across all 6 priorities, the same principle applies: - -> **The system does work. The human makes decisions.** - -- Retries within limits → automatic (but visible) -- Recovery, replay, escalation → always human-approved -- Failures → always surfaced with context and options -- No silent state changes that affect mesh behavior +**How it will work**: +- Anomaly detection: sudden SLI drops, unusual failure patterns, cost spikes +- Trend data: success rates, DLQ utilization, MTTR over time +- Cost estimation before expensive recovery operations