Skip to content

v2.1.0: Circuit breaker, metrics, event-driven architecture#2

Merged
Will-Luck merged 1 commit intomainfrom
v2.1.0-enhancements
Feb 8, 2026
Merged

v2.1.0: Circuit breaker, metrics, event-driven architecture#2
Will-Luck merged 1 commit intomainfrom
v2.1.0-enhancements

Conversation

@Will-Luck
Copy link
Copy Markdown
Owner

Summary

  • Circuit breaker & restart policy: Exponential backoff, per-container restart budgets, and action labels (autoheal.action = restart/stop/notify/none) to prevent restart storms
  • Prometheus metrics: /metrics endpoint with 9 metric definitions — restart counters, skip counters, notification tracking, event processing histograms, unhealthy/circuit gauges
  • Event-driven architecture: Docker event stream watcher with auto-reconnect and debouncing, replacing polling for instant unhealthy detection (polling fallback preserved for tests)
  • Notification improvements: Rate limiting (configurable window per container), retry with exponential backoff (3 attempts for action events), per-service metrics
  • Testability: Extracted docker.API, clock.Clock, notify.Notifier interfaces with mock implementations. 25+ unit tests achieving 60% coverage on guardian package
  • Bug fixes: WaitGroup goroutine tracking for notifications, Warn-level error logging, empty container names guard, config validation
  • CI hardening: Coverage reporting, govulncheck, Trivy container scanning, 3 new acceptance tests (opt-out, circuit-breaker, custom-label), timeout limits, failure artifact capture

New Environment Variables

Variable Default Description
AUTOHEAL_BACKOFF_MULTIPLIER 2 Backoff multiplier for restart delays
AUTOHEAL_BACKOFF_MAX 300 Maximum backoff delay (seconds)
AUTOHEAL_BACKOFF_RESET_AFTER 600 Healthy duration before backoff resets
AUTOHEAL_RESTART_BUDGET 5 Max restarts per rolling window (0=unlimited)
AUTOHEAL_RESTART_WINDOW 300 Rolling window for restart budget (seconds)
METRICS_PORT 0 Prometheus metrics port (0=disabled)
NOTIFY_RATE_LIMIT 60 Min seconds between notifications per container

Test plan

  • go build ./cmd/guardian compiles
  • golangci-lint run ./... lint clean
  • go test -count=1 ./... all unit tests pass
  • docker build image builds successfully
  • Existing 6 acceptance test suites pass in CI
  • New 3 acceptance tests (opt-out, circuit-breaker, custom-label) pass in CI
  • govulncheck and Trivy scans run in CI

🤖 Generated with Claude Code

Major enhancements to the Go rewrite:

- Circuit breaker with exponential backoff and restart budgets to prevent
  restart storms. Per-container action labels (restart/stop/notify/none).
- Prometheus /metrics endpoint with counters, gauges, and histograms for
  restarts, skips, notifications, and event processing.
- Event-driven Docker watcher with auto-reconnect (polling fallback for
  tests). Debouncing and real-time orchestration tracking.
- Notification rate limiting and retry with exponential backoff (3 attempts).
- Testability interfaces (docker.API, clock.Clock, notify.Notifier) with
  mock implementations and 25+ unit tests (60% guardian coverage).
- Config validation, WaitGroup goroutine management, defensive guards.
- CI: coverage reporting, govulncheck, Trivy scanning, 3 new acceptance
  tests (opt-out, circuit-breaker, custom-label), failure artifact capture.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@Will-Luck Will-Luck merged commit b6ffd99 into main Feb 8, 2026
3 of 4 checks passed
@Will-Luck Will-Luck deleted the v2.1.0-enhancements branch February 8, 2026 20:43
Will-Luck added a commit that referenced this pull request Feb 10, 2026
v2.1.0: Circuit breaker, metrics, event-driven architecture
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant