Skip to content

Cluster Mode

Will Luck edited this page Apr 13, 2026 · 5 revisions

Cluster Mode

Cluster mode lets a single Sentinel server manage containers across multiple Docker hosts via lightweight agents. Communication uses gRPC bidirectional streaming over mutual TLS (TLS 1.3 minimum), with protobuf-serialised messages.

Cluster


Architecture

                    ┌──────────────────────────────┐
                    │  Server                       │
                    │  - Web UI          :8080      │
                    │  - gRPC            :9443      │
                    │  - Certificate Authority      │
                    │  - Host registry (BoltDB)     │
                    └──────┬───────────┬────────────┘
                           │           │
                    gRPC/mTLS    gRPC/mTLS
                           │           │
              ┌────────────┘           └────────────┐
              ▼                                     ▼
     ┌─────────────────┐               ┌─────────────────┐
     │  Agent (host-a)  │               │  Agent (host-b)  │
     │  Reports state   │               │  Reports state   │
     │  Executes cmds   │               │  Executes cmds   │
     └─────────────────┘               └─────────────────┘
Component Role
Server Central dashboard, certificate authority, command dispatcher, host registry
Agent Connects to server, reports container state, executes commands locally
Transport gRPC with protobuf serialisation, bidirectional streaming
Security Mutual TLS with built-in CA. Server signs each agent's certificate.

gRPC Services

Defined in internal/cluster/proto/sentinel.proto.

Service RPC Type Description
EnrollmentService Enroll Unary Agent presents token + CSR, receives signed cert
AgentService Channel Bidi-stream Persistent command/event channel
AgentService ReportState Unary Full container state snapshot

Message Types

Server to Agent:

Message Purpose
ListContainersRequest Request a fresh container list
UpdateContainerRequest Trigger a container update (pull, stop, recreate, start)
ContainerActionRequest Stop, start, or restart a container
FetchLogsRequest Retrieve container logs
PullImageRequest Pre-pull an image
RunHookRequest Execute a command inside a container
PolicySync Push policy updates to the agent cache
SettingsSync Push operational settings to the agent cache
CertRenewalResponse Deliver a renewed certificate

Agent to Server:

Message Purpose
Heartbeat Periodic keepalive with version and feature flags
ContainerList Container state snapshot
UpdateResult Outcome of an update operation
ContainerActionResult Outcome of a stop/start/restart
FetchLogsResult Container log output
HookResult Hook execution result with exit code
RollbackResult Outcome of a rollback
OfflineJournal Batch replay of actions taken while disconnected
CertRenewalCSR Certificate renewal request

Server Setup

docker run -d \
  --name sentinel-server \
  --restart unless-stopped \
  -v /var/run/docker.sock:/var/run/docker.sock:ro \
  -v sentinel-data:/data \
  -p 8080:8080 \
  -p 9443:9443 \
  -e SENTINEL_CLUSTER=true \
  ghcr.io/will-luck/docker-sentinel:latest
Port Purpose
8080 Web UI and REST API
9443 gRPC cluster endpoint (mTLS)

Alternatively, run without SENTINEL_CLUSTER=true and select Server role with cluster mode enabled during the setup wizard at http://server:8080/setup.

On first start, the server:

  1. Creates a self-signed CA in SENTINEL_CLUSTER_DIR (default /data/cluster).
  2. Generates an HMAC signing key for enrollment tokens.
  3. Issues an ephemeral server certificate from the CA.
  4. Starts the gRPC listener on SENTINEL_CLUSTER_PORT.

Agent Enrollment

Method 1: Setup Wizard

  1. On the server, go to the Cluster page and click Generate Enrollment Token. Tokens are single-use and expire after the configured duration.
  2. Start a plain Sentinel container on the agent host:
docker run -d \
  --name sentinel-agent \
  --restart unless-stopped \
  -v /var/run/docker.sock:/var/run/docker.sock:ro \
  -v sentinel-agent-data:/data \
  -p 8080:8080 \
  ghcr.io/will-luck/docker-sentinel:latest
  1. Navigate to http://agent:8080/setup, select Agent role, enter the server address and enrollment token, and complete the wizard.

Method 2: Pre-Configured (Headless)

Set SENTINEL_ENROLL_TOKEN and the agent auto-enrolls on startup with no wizard needed.

docker run -d \
  --name sentinel-agent \
  --restart unless-stopped \
  -v /var/run/docker.sock:/var/run/docker.sock:ro \
  -v sentinel-agent-data:/data \
  -e SENTINEL_MODE=agent \
  -e SENTINEL_SERVER_ADDR=10.0.0.10:9443 \
  -e SENTINEL_ENROLL_TOKEN=<token> \
  -e SENTINEL_HOST_NAME=worker-1 \
  ghcr.io/will-luck/docker-sentinel:latest

Enrollment Flow (Internal)

  1. Server admin generates a one-time token (POST /api/cluster/enroll-token). Only the HMAC-SHA256 hash is stored; the plaintext is shown once.
  2. Agent generates an ECDSA P-256 key pair and creates a PKCS#10 CSR.
  3. Agent connects to the server without a client certificate (TLS with InsecureSkipVerify since it doesn't have the CA cert yet).
  4. Agent sends Enroll RPC with the token and CSR.
  5. Server validates the token (HMAC comparison), marks it used, signs the CSR.
  6. Server returns: host ID, CA certificate PEM, signed agent certificate PEM.
  7. Agent persists ca.pem, agent.pem, agent-key.pem, and host-id to SENTINEL_CLUSTER_DIR.
  8. All subsequent connections use mTLS with the enrolled certificate.

mTLS Security

Property Detail
CA algorithm ECDSA P-256
CA validity 10 years
Certificate validity 1 year (server and agent)
Minimum TLS version 1.3
Server cert key usage ServerAuth + ClientAuth
Agent cert key usage ClientAuth only
Server SAN localhost, 127.0.0.1, ::1, plus all private IPs from host interfaces, plus any entries from SENTINEL_CLUSTER_ADVERTISE
Token signing HMAC-SHA256 with a random 32-byte key (persisted to hmac-key.bin)
Revocation Certificate serial added to BoltDB CRL; checked at TLS handshake and per-RPC
Certificate renewal Agent sends a new CSR when cert approaches expiry; server signs and delivers it inline

The enrollment token signing key is a dedicated random secret (not derived from the CA certificate). It is generated on first run and stored with 0600 permissions.


Host Lifecycle

State Description
Active Agent connected. Containers visible on the dashboard. Commands dispatched normally.
Paused No new updates dispatched. In-progress operations finish. Agent remains connected.
Decommissioned Certificate revoked. Agent cannot reconnect without re-enrolling.

Actions

Action API Effect
Pause POST /api/cluster/hosts/{id}/pause Stop scheduling updates on this host
Remove DELETE /api/cluster/hosts/{id} Disconnect agent, delete from registry
Revoke POST /api/cluster/hosts/{id}/revoke Add cert serial to CRL, disconnect, delete from registry

Enrolling a New Agent

Cluster Enrol

Click Generate Token on the Cluster page to produce a one-time enrolment token, then run the displayed docker run command on the new host.

Connectors

Connectors NPM

Connectors Portainer

SENTINEL_CLUSTER_ADVERTISE

By default, the server TLS certificate includes localhost, 127.0.0.1, ::1, and all private IPs detected on host network interfaces. If agents connect via an address that is not in this set (e.g. a Tailscale IP, a DNS name, or a public IP), TLS verification will fail because the server address does not match any certificate SAN.

Set SENTINEL_CLUSTER_ADVERTISE to a comma-separated list of additional IPs or hostnames that should be included in the server certificate:

-e SENTINEL_CLUSTER_ADVERTISE="100.64.0.5,sentinel.example.com"

The values are parsed at certificate generation time. IP addresses become IP SANs; hostnames become DNS SANs. If the server certificate already exists, changing this variable takes effect on the next certificate renewal or after deleting the existing certificate files from SENTINEL_CLUSTER_DIR.

This can also be configured at runtime via Settings > Cluster in the web UI (the advertise_addr field).


Engine ID Deduplication

When multiple sources monitor the same Docker daemon (e.g. the local socket, a Portainer endpoint, and a cluster agent all pointing at the same host), Sentinel can detect the overlap and automatically prevent duplicate container entries.

Each Docker daemon has a unique Engine ID. Sentinel collects this ID from:

  • The local Docker socket on startup (stored as local_engine_id in the database).
  • Each cluster agent, which reports its Engine ID during heartbeats (stored in the host registry).
  • Each Portainer endpoint, which exposes the Engine ID via the Portainer API.

When a cluster agent reports its Engine ID, the server compares it against all configured Portainer endpoints. If a Portainer endpoint has the same Engine ID as a connected agent, the endpoint is automatically flagged to prevent scanning the same containers twice. A source_overlap event is emitted via SSE to notify the dashboard.

This is fully automatic and requires no configuration. The deduplication check runs whenever an agent reports its Engine ID or a Portainer endpoint is added.


Remote Container Management

From the server dashboard, operators can manage containers on any connected agent host:

Operation Description
List containers Real-time container list with state, image, ports
Update Pull new image, stop, remove, recreate with same config
Stop / Start / Restart Container lifecycle actions
View logs Fetch the last N lines of container output (max 500)
Run hooks Execute commands inside containers

All operations use synchronous request/response over the bidirectional gRPC stream. The server registers a response channel before sending a command and blocks until the agent replies or a timeout occurs.

Update Flow (Remote)

  1. Server sends UpdateContainerRequest with container name, target image, and optional digest.
  2. Agent inspects the running container to capture its full configuration.
  3. Agent pulls the target image.
  4. Agent stops, removes, and recreates the container with the new image (preserving env vars, volumes, ports, networks).
  5. Agent starts the new container.
  6. Agent pushes a fresh container list followed by the UpdateResult.

Autonomous Mode

If the agent loses connectivity for longer than SENTINEL_GRACE_PERIOD_OFFLINE (default 30m), it enters autonomous mode:

  • Monitors containers locally using its cached copy of the server's policies and settings.
  • Does not attempt container updates (registry checks require the server).
  • Journals all observed state changes to a JSON file on disk.
  • Reconnects automatically when the server is reachable again, using exponential backoff (1s, 2s, 4s, ... capped at 30s).

On reconnection:

  1. Agent sends its full state report.
  2. Agent replays the offline journal to the server via the OfflineJournal message.
  3. Agent clears the local journal.
  4. Normal heartbeat/command loop resumes.

Policy Cache

The agent caches policies and settings pushed by the server via PolicySync and SettingsSync messages. The cache is persisted to policy_cache.json in the agent's data directory so it survives agent restarts.

Policy resolution order (highest priority first):

  1. Container labels (sentinel.policy)
  2. Server-pushed per-container overrides
  3. Server-pushed default policy
  4. Hardcoded fallback: manual

Agent Auto-Update

The server automatically updates agents running a different version. On each poll cycle, it compares its own version against the version reported in each agent's heartbeat. If they differ, the server:

  1. Finds the container with sentinel.self=true label on the agent host.
  2. Sends an UpdateContainerRequest with the server's version tag.
  3. The agent pulls the new image, recreates its own container, and the new process reconnects.

Dev builds (version dev or empty) are skipped.


Network Requirements

Direction Port Protocol Purpose
Agent to Server 9443 (configurable) TCP/TLS 1.3 gRPC cluster communication
Server to Agent None Server does not initiate connections; agents connect outbound

The gRPC connection is persistent (long-lived bidirectional stream). Agents reconnect automatically on any interruption. Firewalls must allow agents to reach the server on SENTINEL_CLUSTER_PORT.


Cluster Settings in the Web UI

The Cluster page in the web UI provides:

  • Host list with connection status, last seen time, agent version, and container counts.
  • Generate Enrollment Token button for adding new agents.
  • Per-host actions: pause, remove, revoke.
  • Host grouping in the dashboard, with containers visually grouped by their host.
  • Connection status indicators: connected (green), disconnected with timestamp and disconnect reason (network, cert, server).

Supported Agent Features

Agents advertise their capabilities during heartbeat. Current feature set:

Feature Description
update Container update lifecycle (pull, stop, recreate, start)
hooks Execute commands inside containers
pull Pre-pull images
list List containers
logs Fetch container logs

Environment Variables

Server

Variable Default Description
SENTINEL_CLUSTER false Enable the gRPC cluster listener
SENTINEL_CLUSTER_PORT 9443 gRPC listen port
SENTINEL_CLUSTER_DIR /data/cluster CA, certificates, and HMAC key storage
SENTINEL_CLUSTER_ADVERTISE (empty) Extra IPs or hostnames added to the server TLS certificate as Subject Alternative Names (comma-separated)

Agent

Variable Default Description
SENTINEL_MODE (auto) Set to agent for agent-only mode
SENTINEL_SERVER_ADDR Server host:port for gRPC connection
SENTINEL_ENROLL_TOKEN One-time enrollment token (consumed on first start)
SENTINEL_HOST_NAME Human-readable agent name displayed in the dashboard
SENTINEL_GRACE_PERIOD_OFFLINE 30m Time offline before autonomous mode activates
SENTINEL_CLUSTER_DIR /data/cluster Local certificate and state storage

Troubleshooting

Agent cannot connect

Symptom Cause Fix
connection refused Server not listening or wrong port Verify SENTINEL_CLUSTER=true on server; check SENTINEL_CLUSTER_PORT matches
certificate has been revoked Agent cert was revoked via the UI Re-enroll with a new token
host not registered Agent's host ID not in server registry (data loss or manual deletion) Re-enroll with a new token and fresh data volume
transport is closing / repeated reconnects Network instability Check firewalls, MTU, and connectivity between agent and server
TLS handshake failure Clock skew or expired certificate Verify system clocks are synchronised (NTP); check cert validity dates

Agent shows as disconnected

The server marks an agent as disconnected when its gRPC stream ends. Disconnect reasons are classified:

Category Meaning
network Connection lost (EOF, timeout, reset, transport closed)
cert Permission denied or host not registered (certificate issue)
server Stream cancelled by server (e.g. host revoked or replaced)

Stale agent stream

If an agent reconnects while the server still has an old stream open (e.g. after a network partition), the server automatically cancels the old stream and registers the new one. A "replaced stale stream" log entry is emitted.

Enrollment token issues

Problem Fix
Token expired Generate a new token from the Cluster page
Token already used Each token is single-use; generate a new one
Token too short Tokens are 64 hex characters; verify the full token was copied

CA certificate mismatch

If the server's CA was regenerated (e.g. the cluster data volume was deleted, or the server was migrated to a new host), agents that still have the old CA certificate cached locally will fail to connect. The agent logs this once:

server CA mismatch -- the server's TLS certificate has changed since this agent enrolled

The log entry includes a fix field with the resolution steps, and a data_dir field showing where the agent stores its certificate data.

Resolution:

  1. Stop the agent container.
  2. Delete the agent's cluster data directory (default: /data/cluster inside the container, or the path set by SENTINEL_CLUSTER_DIR).
  3. Generate a new enrolment token on the server via the Cluster page.
  4. Restart the agent with SENTINEL_ENROLL_TOKEN set to the new token.

The agent will re-enrol with the server's new CA and receive a fresh certificate. This message is logged once per agent lifecycle to avoid log spam.

Agent version mismatch

If the server and agent are running different versions, the server automatically triggers an update. Check server logs for "agent version mismatch" entries. The agent container must have the sentinel.self=true label for auto-update to locate it.

Clone this wiki locally