Skip to content

fix(reliability): Address 12 review issues across circuit breaker, DLQ, config merge, and CLI#3

Draft
Copilot wants to merge 2 commits intoclaude/improve-reliability-four-nines-K38rWfrom
copilot/sub-pr-2
Draft

fix(reliability): Address 12 review issues across circuit breaker, DLQ, config merge, and CLI#3
Copilot wants to merge 2 commits intoclaude/improve-reliability-four-nines-K38rWfrom
copilot/sub-pr-2

Conversation

Copy link
Copy Markdown

Copilot AI commented Mar 11, 2026

Addresses all items flagged in the PR review of the four-nines reliability module.

Circuit Breaker

  • half_open state now enforces a single in-flight probe via probeInFlight: boolean on CircuitState — subsequent canExecute() calls return false until the probe resolves or fails

DLQ / Recovery

  • SystemMessageWriter.write() now spreads extraFrontmatter into the SQLite queue payload, so session-id and resume-mesh are readable at nextMsg.payload where the dispatcher expects them
  • deadLetter() now accepts and forwards retryCount/maxRetries from machine.currentContext so retry-exhausted entries are correctly classified as manual (not auto-recoverable)
  • Startup recoverAll() is now gated behind WORKER_AUTO_RECOVER_DLQ=true env var; previously ran unconditionally on every dispatcher start

Config & Safe Mode

  • Reliability config merging is now deep per-section using a mergeSection helper — partial nested overrides (e.g. only circuitBreaker.cooldownMs) no longer silently drop sibling keys
  • evaluateSLI() now uses per-mesh rate (snapshot.byMesh[meshName]?.rate) instead of the global rate to prevent a noisy mesh from escalating safe mode in unrelated meshes
  • Fixed lastChange?.reason?.startsWith('auto:') ?? false in SafeMode.getState() — was crashing when no level change had occurred yet

CLI / Docs

  • tx mesh health now shows a warning when totalEvents === 0 rather than reporting a misleading 100% success rate; persisted circuit breaker and DLQ data still rendered
  • parseFlags() uses an indexed loop for --rewind-to so the value arg is consumed correctly and duplicate flags can't cause mis-reads
  • Removed unused bucketMs from SLIConfig
  • Updated heartbeat-monitor header comment and guardrails.md DLQ section to match actual behavior (no nudge injection at stale; no exponential backoff in DLQ)

💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.

claude and others added 2 commits March 11, 2026 06:39
… review

- Add summary table to Nine 3 (matching Nine 1/2/2.5 format)
- Add detailed explanations for all Nine 1/2/2.5 features
- Extract all human review gates to dedicated HUMAN_REVIEW.md
- Restructure roadmap into table + explanations

https://claude.ai/code/session_012PTx7bZNNh74rAshCcuSQg
Copilot AI changed the title [WIP] Add four-nines reliability framework with circuit breakers, DLQ, and monitoring fix(reliability): Address 12 review issues across circuit breaker, DLQ, config merge, and CLI Mar 11, 2026
@eighteyes eighteyes force-pushed the claude/improve-reliability-four-nines-K38rW branch from 8ee54c1 to abb0fc8 Compare March 24, 2026 18:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants