Prevents restart storms when a container is fundamentally broken:
- Exponential backoff — delays between restarts increase: 10s → 20s → 40s → ... up to a configurable max
- Restart budget — maximum restarts per rolling time window (default: 5 per 300s)
- Circuit open — when budget exhausted, Guardian stops restarting and sends a CRITICAL notification
- Auto-reset — backoff resets after a container stays healthy for a configurable duration
Docker-Guardian subscribes to the Docker event stream for real-time detection:
- Reacts to
health_status: unhealthyevents within seconds (no polling delay) - Detects container
dieevents for instant orphan dependency recovery - Tracks
create/destroyevents for orchestration awareness - Resets backoff when
health_status: healthyis received - Auto-reconnects with exponential backoff if the event stream drops
- Falls back to polling if event stream is unavailable
Auto-detects network dependencies via Docker API — no labels needed. On each event or poll cycle:
- Queries exited containers
- Filters to those using
--network=container:Xnetwork mode - Checks if exit code is 128 (killed by parent exit)
- Verifies parent is running
- Waits configurable delay (parent initialisation time)
- Starts the orphaned dependent
Multi-level dependencies (A→B→C) resolve naturally over multiple cycles.
Handles the case where a network parent restarts (not dies) and dependents lose connectivity:
- Cascade restart — when a
--network=container:Xparent restarts, Guardian automatically restarts all dependents after a configurable settle delay (default 15s). This covers planned restarts (Watchtower updates, manual restarts) where dependents don't exit but lose network. - Network healthcheck — periodic ping check (
docker exec ... ping -c1 -W3 <target>) on containers sharing a network namespace. If the ping fails, Guardian restarts the container. Acts as a safety net for cases where the cascade didn't fire or connectivity degraded silently.
Both features are enabled by default and require no labels.
Detects active orchestration (Watchtower, manual docker-compose up, etc.) via Docker events:
- Watches for container
destroyandcreateevents within a configurable cooldown window (default 300s) - When events are found, pauses all monitoring until the cooldown expires
- Configurable scope: skip all containers (default) or only affected ones
- Configurable events: orchestration only (default, avoids self-triggering) or all lifecycle events
Set AUTOHEAL_WATCHTOWER_COOLDOWN=0 to disable.
Prevents Docker-Guardian from interfering with backup tools like docker-volume-backup:
- Auto-detects running backup containers by image name
- Skips containers labelled with
docker-volume-backup.stop-during-backupwhile backup is active
Skips recently-stopped containers to avoid fighting with:
- Manual stops for maintenance
- Other orchestration tools not covered by Watchtower awareness
Default: 300 seconds. Set to 0 to disable.
Enable with METRICS_PORT:
-e METRICS_PORT=9090 -p 9090:9090Exposed metrics:
| Metric | Type | Labels | Description |
|---|---|---|---|
docker_guardian_restarts_total |
Counter | container, result | Restart attempts (success/failure) |
docker_guardian_skips_total |
Counter | container, reason | Skipped containers (orchestration/grace/backup/circuit/backoff) |
docker_guardian_notifications_total |
Counter | service, result | Notification delivery (success/failure per service) |
docker_guardian_events_processed_total |
Counter | action | Docker events processed by type |
docker_guardian_unhealthy_containers |
Gauge | — | Current unhealthy container count |
docker_guardian_circuit_open_containers |
Gauge | — | Containers with circuit breaker open |
docker_guardian_event_stream_connected |
Gauge | — | Event stream connection status (1/0) |
docker_guardian_restart_duration_seconds |
Histogram | container | Time taken for restart operations |
docker_guardian_event_processing_duration_seconds |
Histogram | — | Time taken to process each event |
Container event received
├── health_status: unhealthy
│ ├── autoheal=False or action=none? → IGNORE
│ ├── State = paused? → SKIP
│ ├── State = restarting? → SKIP
│ ├── Below unhealthy threshold? → SKIP (count N/M)
│ ├── Orchestration active (Watchtower)? → SKIP
│ ├── Within grace period? → SKIP
│ ├── Backup-managed + backup running? → SKIP
│ ├── action=notify? → NOTIFY ONLY
│ ├── Circuit breaker open (budget exhausted)? → NOTIFY [CRITICAL]
│ ├── Backoff active? → SKIP (wait for backoff)
│ ├── action=stop? → Stop container (quarantine)
│ └── Restart container
│
├── health_status: healthy
│ └── Reset backoff for container
│
├── die (exit code 128, NetworkMode=container:X)
│ ├── Parent not running? → SKIP
│ ├── Orchestration active? → SKIP
│ ├── Within grace period? → SKIP
│ ├── Backup-managed + backup running? → SKIP
│ ├── Wait start delay...
│ ├── Parent still running? → Start container
│ └── Parent stopped? → SKIP
│
├── start (container is a network parent)
│ ├── CASCADE_RESTART disabled? → SKIP
│ ├── Wait settle delay...
│ └── Restart all dependents using --network=container:X
│
├── periodic scan (network healthcheck)
│ ├── NETWORK_HEALTHCHECK disabled? → SKIP
│ ├── Container uses --network=container:X? → Ping target
│ │ ├── Ping succeeds → OK
│ │ └── Ping fails → Restart container
│ └── Not a network dependent → SKIP
│
└── create/destroy
└── Record orchestration activity