Features

Circuit Breaker & Restart Policy

Prevents restart storms when a container is fundamentally broken:

Exponential backoff — delays between restarts increase: 10s → 20s → 40s → ... up to a configurable max
Restart budget — maximum restarts per rolling time window (default: 5 per 300s)
Circuit open — when budget exhausted, Guardian stops restarting and sends a CRITICAL notification
Auto-reset — backoff resets after a container stays healthy for a configurable duration

Event-Driven Detection

Docker-Guardian subscribes to the Docker event stream for real-time detection:

Reacts to health_status: unhealthy events within seconds (no polling delay)
Detects container die events for instant orphan dependency recovery
Tracks create/destroy events for orchestration awareness
Resets backoff when health_status: healthy is received
Auto-reconnects with exponential backoff if the event stream drops
Falls back to polling if event stream is unavailable

Dependency Monitoring

Auto-detects network dependencies via Docker API — no labels needed. On each event or poll cycle:

Queries exited containers
Filters to those using --network=container:X network mode
Checks if exit code is 128 (killed by parent exit)
Verifies parent is running
Waits configurable delay (parent initialisation time)
Starts the orphaned dependent

Multi-level dependencies (A→B→C) resolve naturally over multiple cycles.

Cascade Restart & Network Healthcheck

Handles the case where a network parent restarts (not dies) and dependents lose connectivity:

Cascade restart — when a --network=container:X parent restarts, Guardian automatically restarts all dependents after a configurable settle delay (default 15s). This covers planned restarts (Watchtower updates, manual restarts) where dependents don't exit but lose network.
Network healthcheck — periodic ping check (docker exec ... ping -c1 -W3 <target>) on containers sharing a network namespace. If the ping fails, Guardian restarts the container. Acts as a safety net for cases where the cascade didn't fire or connectivity degraded silently.

Both features are enabled by default and require no labels.

Watchtower Awareness

Detects active orchestration (Watchtower, manual docker-compose up, etc.) via Docker events:

Watches for container destroy and create events within a configurable cooldown window (default 300s)
When events are found, pauses all monitoring until the cooldown expires
Configurable scope: skip all containers (default) or only affected ones
Configurable events: orchestration only (default, avoids self-triggering) or all lifecycle events

Set AUTOHEAL_WATCHTOWER_COOLDOWN=0 to disable.

Backup Awareness

Prevents Docker-Guardian from interfering with backup tools like docker-volume-backup:

Auto-detects running backup containers by image name
Skips containers labelled with docker-volume-backup.stop-during-backup while backup is active

Grace Period

Skips recently-stopped containers to avoid fighting with:

Manual stops for maintenance
Other orchestration tools not covered by Watchtower awareness

Default: 300 seconds. Set to 0 to disable.

Prometheus Metrics

Enable with METRICS_PORT:

-e METRICS_PORT=9090 -p 9090:9090

Exposed metrics:

Metric	Type	Labels	Description
`docker_guardian_restarts_total`	Counter	container, result	Restart attempts (success/failure)
`docker_guardian_skips_total`	Counter	container, reason	Skipped containers (orchestration/grace/backup/circuit/backoff)
`docker_guardian_notifications_total`	Counter	service, result	Notification delivery (success/failure per service)
`docker_guardian_events_processed_total`	Counter	action	Docker events processed by type
`docker_guardian_unhealthy_containers`	Gauge	—	Current unhealthy container count
`docker_guardian_circuit_open_containers`	Gauge	—	Containers with circuit breaker open
`docker_guardian_event_stream_connected`	Gauge	—	Event stream connection status (1/0)
`docker_guardian_restart_duration_seconds`	Histogram	container	Time taken for restart operations
`docker_guardian_event_processing_duration_seconds`	Histogram	—	Time taken to process each event

Decision Flowchart

Container event received
├── health_status: unhealthy
│   ├── autoheal=False or action=none? → IGNORE
│   ├── State = paused? → SKIP
│   ├── State = restarting? → SKIP
│   ├── Below unhealthy threshold? → SKIP (count N/M)
│   ├── Orchestration active (Watchtower)? → SKIP
│   ├── Within grace period? → SKIP
│   ├── Backup-managed + backup running? → SKIP
│   ├── action=notify? → NOTIFY ONLY
│   ├── Circuit breaker open (budget exhausted)? → NOTIFY [CRITICAL]
│   ├── Backoff active? → SKIP (wait for backoff)
│   ├── action=stop? → Stop container (quarantine)
│   └── Restart container
│
├── health_status: healthy
│   └── Reset backoff for container
│
├── die (exit code 128, NetworkMode=container:X)
│   ├── Parent not running? → SKIP
│   ├── Orchestration active? → SKIP
│   ├── Within grace period? → SKIP
│   ├── Backup-managed + backup running? → SKIP
│   ├── Wait start delay...
│   ├── Parent still running? → Start container
│   └── Parent stopped? → SKIP
│
├── start (container is a network parent)
│   ├── CASCADE_RESTART disabled? → SKIP
│   ├── Wait settle delay...
│   └── Restart all dependents using --network=container:X
│
├── periodic scan (network healthcheck)
│   ├── NETWORK_HEALTHCHECK disabled? → SKIP
│   ├── Container uses --network=container:X? → Ping target
│   │   ├── Ping succeeds → OK
│   │   └── Ping fails → Restart container
│   └── Not a network dependent → SKIP
│
└── create/destroy
    └── Record orchestration activity

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Features

Circuit Breaker & Restart Policy

Event-Driven Detection

Dependency Monitoring

Cascade Restart & Network Healthcheck

Watchtower Awareness

Backup Awareness

Grace Period

Prometheus Metrics

Decision Flowchart

FilesExpand file tree

features.md

Latest commit

History

features.md

File metadata and controls

Features

Circuit Breaker & Restart Policy

Event-Driven Detection

Dependency Monitoring

Cascade Restart & Network Healthcheck

Watchtower Awareness

Backup Awareness

Grace Period

Prometheus Metrics

Decision Flowchart