feat: Fleet Message Bus (ADR-005) — MQTT coordination for hatchlings#24
feat: Fleet Message Bus (ADR-005) — MQTT coordination for hatchlings#24
Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds control plane functionality for managing a fleet of OpenClaw hatchling instances, including health monitoring, state snapshots, and deployment documentation. The changes appear to be derived from real deployment experience (Todd/CHC deployment in February 2026) and aim to make fleet management more robust.
Changes:
- Added fleet monitoring command with health checks and JSON output support
- Added snapshot command to capture container state from running instances
- Added
openclaw doctor --fixself-healing to container entrypoint - Switched Docker base image from
node:22-alpinetonode:22-slimfor better compatibility - Added comprehensive deployment SOP documentation based on real-world deployment
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 11 comments.
| File | Description |
|---|---|
| template/Dockerfile | Base image switched from alpine to slim, added ARG for version pinning |
| template/entrypoint.sh | Added self-healing via openclaw doctor --fix before startup |
| scripts/fleet.sh | Added monitor and snapshot commands for fleet health and state management |
| docs/DEPLOYMENT-SOP.md | New comprehensive deployment guide based on February 2026 deployment experience |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| snapshot) | ||
| name="${2:?Usage: fleet.sh snapshot <instance-name>}" | ||
| container="hatchery-${name}" | ||
| timestamp=$(date -u +%Y%m%dT%H%M%SZ) | ||
| tmp_dir="/tmp/snapshot-${name}-${timestamp}" | ||
| snapshot_dir="$ROOT_DIR/snapshots" | ||
| snapshot_path="${snapshot_dir}/${name}-${timestamp}.tar.gz" | ||
|
|
||
| # Use NAS Docker env vars | ||
| export DOCKER_HOST="${DOCKER_HOST:-tcp://192.168.1.2:2376}" | ||
| export DOCKER_TLS_VERIFY="${DOCKER_TLS_VERIFY:-1}" | ||
| export DOCKER_CERT_PATH="${DOCKER_CERT_PATH:-$HOME/.docker/nas}" | ||
|
|
||
| echo "📸 Snapshotting $name ($container)..." | ||
|
|
||
| # Copy openclaw state from container | ||
| docker cp "${container}:/home/openclaw/.openclaw" "$tmp_dir" | ||
| if [[ $? -ne 0 ]]; then | ||
| echo "Error: Failed to copy from container '$container'" >&2 | ||
| exit 1 | ||
| fi | ||
|
|
||
| # Tar it up | ||
| mkdir -p "$snapshot_dir" | ||
| tar -czf "$snapshot_path" -C "/tmp" "snapshot-${name}-${timestamp}" | ||
| rm -rf "$tmp_dir" | ||
|
|
||
| echo "✅ Snapshot saved: $snapshot_path" | ||
| ;; |
There was a problem hiding this comment.
The new snapshot command is not covered by tests. The codebase has comprehensive test coverage for fleet commands in scripts/test-fleet.sh, and this new functionality should have corresponding tests to ensure it works correctly.
scripts/fleet.sh
Outdated
| # Use NAS Docker env vars | ||
| export DOCKER_HOST="${DOCKER_HOST:-tcp://192.168.1.2:2376}" | ||
| export DOCKER_TLS_VERIFY="${DOCKER_TLS_VERIFY:-1}" | ||
| export DOCKER_CERT_PATH="${DOCKER_CERT_PATH:-$HOME/.docker/nas}" |
There was a problem hiding this comment.
The hardcoded IP address "192.168.1.2" appears to be environment-specific and should not be committed. This value would fail for any deployment not using this exact NAS configuration. Consider using environment variables without defaults, or documenting that users must set DOCKER_HOST, DOCKER_TLS_VERIFY, and DOCKER_CERT_PATH before using the snapshot command.
| # Use NAS Docker env vars | |
| export DOCKER_HOST="${DOCKER_HOST:-tcp://192.168.1.2:2376}" | |
| export DOCKER_TLS_VERIFY="${DOCKER_TLS_VERIFY:-1}" | |
| export DOCKER_CERT_PATH="${DOCKER_CERT_PATH:-$HOME/.docker/nas}" | |
| # Require Docker environment variables for snapshot operations | |
| if [[ -z "${DOCKER_HOST:-}" || -z "${DOCKER_TLS_VERIFY:-}" || -z "${DOCKER_CERT_PATH:-}" ]]; then | |
| echo "Error: DOCKER_HOST, DOCKER_TLS_VERIFY, and DOCKER_CERT_PATH must be set before using the snapshot command." >&2 | |
| exit 1 | |
| fi |
| echo "📸 Snapshotting $name ($container)..." | ||
|
|
||
| # Copy openclaw state from container | ||
| docker cp "${container}:/home/openclaw/.openclaw" "$tmp_dir" |
There was a problem hiding this comment.
The snapshot command does not verify that the container exists before attempting to copy from it. Consider adding a check similar to the one in the logs command (lines 132-135 of fleet.sh) to provide a clearer error message if the container doesn't exist.
docs/DEPLOYMENT-SOP.md
Outdated
| ssh zeph@revere | ||
| mkdir -p /share/Container/<hatchling-name> |
There was a problem hiding this comment.
The snapshot command contains hardcoded environment-specific information (ssh zeph@revere, /share/Container/) that would not work in other deployments. This documentation should either use placeholder values (like @ and ) or be clearly marked as example-specific.
| monitor) | ||
| # Monitor fleet health — similar to health but with pass/fail logic | ||
| FLEET_HOST="${FLEET_HOST:-localhost}" | ||
| JSON_OUTPUT=false | ||
| shift | ||
| while [[ $# -gt 0 ]]; do | ||
| case "$1" in | ||
| --json) JSON_OUTPUT=true; shift ;; | ||
| *) shift ;; | ||
| esac | ||
| done | ||
|
|
||
| if [[ ! -f "$FLEET_REGISTRY" ]]; then | ||
| if [[ "$JSON_OUTPUT" == "true" ]]; then | ||
| echo '{"status":"error","message":"no fleet.json found"}' | ||
| else | ||
| echo "Error: no fleet.json found" >&2 | ||
| fi | ||
| exit 1 | ||
| fi | ||
|
|
||
| instances=$(jq -r '.instances | to_entries[] | select(.value.status == "running") | .key' "$FLEET_REGISTRY" 2>/dev/null || true) | ||
| if [[ -z "$instances" ]]; then | ||
| if [[ "$JSON_OUTPUT" == "true" ]]; then | ||
| echo '{"status":"ok","message":"no running instances","results":[]}' | ||
| else | ||
| echo "No running instances to monitor." | ||
| fi | ||
| exit 0 | ||
| fi | ||
|
|
||
| all_healthy=true | ||
| json_results="[]" | ||
|
|
||
| while IFS= read -r name; do | ||
| port=$(jq -r ".instances[\"$name\"].port" "$FLEET_REGISTRY") | ||
| host=$(jq -r ".instances[\"$name\"].host // \"$FLEET_HOST\"" "$FLEET_REGISTRY") | ||
| url="http://${host}:${port}/health" | ||
|
|
||
| http_code=$(curl -s -o /dev/null -w "%{http_code}" --connect-timeout 5 --max-time 10 "$url" 2>/dev/null || echo "000") | ||
|
|
||
| if [[ "$http_code" == "200" ]]; then | ||
| healthy=true | ||
| else | ||
| healthy=false | ||
| all_healthy=false | ||
| fi | ||
|
|
||
| if [[ "$JSON_OUTPUT" == "true" ]]; then | ||
| json_results=$(echo "$json_results" | jq --arg n "$name" --arg h "$healthy" --arg c "$http_code" \ | ||
| '. + [{"name":$n,"healthy":($h == "true"),"http_status":($c | tonumber)}]') | ||
| else | ||
| if [[ "$healthy" == "false" ]]; then | ||
| echo "$name UNHEALTHY (HTTP $http_code)" | ||
| fi | ||
| fi | ||
| done <<< "$instances" | ||
|
|
||
| if [[ "$JSON_OUTPUT" == "true" ]]; then | ||
| jq -n --argjson results "$json_results" --arg healthy "$all_healthy" \ | ||
| '{"status":(if $healthy == "true" then "healthy" else "unhealthy" end),"results":$results}' | ||
| else | ||
| if [[ "$all_healthy" == "true" ]]; then | ||
| echo "All instances healthy" | ||
| fi | ||
| fi | ||
| ;; |
There was a problem hiding this comment.
The new monitor command is not covered by tests. The codebase has comprehensive test coverage for fleet commands in scripts/test-fleet.sh, and this new functionality should have corresponding tests to ensure it works correctly.
| exit 1 | ||
| fi | ||
|
|
||
| instances=$(jq -r '.instances | to_entries[] | select(.value.status == "running") | .key' "$FLEET_REGISTRY" 2>/dev/null || true) |
There was a problem hiding this comment.
According to the PR description, "fleet.json schema update - Added missing fields: name, host, status, version to all instances" but there are no corresponding changes in the hatch.sh script or any other visible code that would add these fields to the fleet.json registry when instances are created or managed. This is a discrepancy between the PR description and the actual code changes.
template/Dockerfile
Outdated
| FROM node:22-slim | ||
|
|
||
| RUN apk add --no-cache git curl bash gettext | ||
| RUN apt-get update && apt-get install -y --no-install-recommends git curl bash ca-certificates && rm -rf /var/lib/apt/lists/* |
There was a problem hiding this comment.
The package gettext (which provides envsubst) is missing from the installation command. The envsubst command is required by the entrypoint.sh script (line 89) to process the openclaw.template.json configuration file. Without it, the container will fail at runtime when trying to generate the configuration.
| RUN apt-get update && apt-get install -y --no-install-recommends git curl bash ca-certificates && rm -rf /var/lib/apt/lists/* | |
| RUN apt-get update && apt-get install -y --no-install-recommends git curl bash ca-certificates gettext && rm -rf /var/lib/apt/lists/* |
docs/DEPLOYMENT-SOP.md
Outdated
| Port allocation: | ||
| 18789 — Zephyr (primary) | ||
| 18790 — Seraphel | ||
| 18791+ — next hatchling | ||
| ``` |
There was a problem hiding this comment.
The hardcoded port numbers (18789, 18790, 18791) and instance names (Zephyr, Seraphel) are environment-specific and should be presented as examples rather than as prescriptive values. Consider adding a note that these are example values from a specific deployment.
docs/DEPLOYMENT-SOP.md
Outdated
|
|
||
| ## 4. Post-deployment | ||
|
|
||
| - [ ] **Create `#<client>` channel** in BCM server for operator visibility/monitoring |
There was a problem hiding this comment.
The reference to "BCM server" appears to be organization-specific and may not be clear to other users. Consider either removing this reference, explaining what BCM stands for, or making it clear this is an example from a specific deployment.
| - [ ] **Create `#<client>` channel** in BCM server for operator visibility/monitoring | |
| - [ ] **Create `#<client>` channel** in your internal operator/monitoring Discord server for visibility/monitoring |
docs/DEPLOYMENT-SOP.md
Outdated
| 2. **Credential transfer:** Apple Notes via iCloud sharing works well when Tailscale MacBook→mini is broken. Never send tokens in Discord. | ||
| 3. **Anthropic API key is shared** across all hatchlings — same subscription, one key. | ||
| 4. **Each client gets their own Discord server** — this is the security boundary. Never put two clients' bots in the same server. | ||
| 5. **Port allocation:** Start at `18789`, increment per instance. Track in `fleet.json` to avoid conflicts. | ||
| 6. **Docker on QNAP:** Volume mounts resolve on the NAS filesystem. Use `docker cp` if injecting files from the Mac. | ||
| 7. **Container user is UID 1001.** If permission errors: `docker exec -u root <container> chown -R 1001:1001 /app/workspace` |
There was a problem hiding this comment.
The "Lessons Learned" section contains references to specific people and tools that are environment-specific (Tailscale MacBook→mini, Apple Notes, iCloud). While these may be valuable for the original deployment context, they should be clearly labeled as context-specific examples or generalized for broader use.
| 2. **Credential transfer:** Apple Notes via iCloud sharing works well when Tailscale MacBook→mini is broken. Never send tokens in Discord. | |
| 3. **Anthropic API key is shared** across all hatchlings — same subscription, one key. | |
| 4. **Each client gets their own Discord server** — this is the security boundary. Never put two clients' bots in the same server. | |
| 5. **Port allocation:** Start at `18789`, increment per instance. Track in `fleet.json` to avoid conflicts. | |
| 6. **Docker on QNAP:** Volume mounts resolve on the NAS filesystem. Use `docker cp` if injecting files from the Mac. | |
| 7. **Container user is UID 1001.** If permission errors: `docker exec -u root <container> chown -R 1001:1001 /app/workspace` | |
| 2. **Credential transfer:** Always use a secure, out‑of‑band channel for sharing tokens or secrets. Never send tokens in Discord or other chat. _Example (Todd/CHC deployment): Apple Notes via iCloud sharing worked well when a Tailscale MacBook→mini link was unavailable._ | |
| 3. **Anthropic API key is shared** across all hatchlings — same subscription, one key. | |
| 4. **Each client gets their own Discord server** — this is the security boundary. Never put two clients' bots in the same server. | |
| 5. **Port allocation:** Start at `18789`, increment per instance. Track in `fleet.json` to avoid conflicts. | |
| 6. **Docker on QNAP (example environment):** In the Todd/CHC deployment, volume mounts resolved on the NAS filesystem, and `docker cp` was used to inject files from a Mac. Behavior may differ on other hosts; adjust to your storage platform. | |
| 7. **Container user UID may vary by environment.** In the Todd/CHC deployment, the container ran as UID `1001`; if you hit permission errors there, you can run: `docker exec -u root <container> chown -R 1001:1001 /app/workspace`. In other environments, first inspect the container's user (`docker exec <container> id`) and adjust the UID/GID accordingly. |
|
@copilot Two issues: 1. SECURITY: Remove all personal/infrastructure details. This is a PUBLIC repo. The following must be generic/placeholder, not real:
The DEPLOYMENT-SOP.md and any hardcoded references in scripts must use placeholder values only. 2. BUILD FAILURE: E2E test failed — same openclaw-not-on-npm issue. Make sure the e2e test and any new CI steps handle this gracefully (continue-on-error or graceful exit like the existing e2e-test.sh pattern). No separate validation report docs. Keep docs minimal. |
b5328e9 to
439edc7
Compare
- Install taskwarrior in Dockerfile (apt-get) - Add .taskrc template with PARA project conventions - data.location → workspace/.task (persists across restarts) - Priority scheme: H/M/L - Custom reports: next (urgency-sorted), by_project - Common project/tag vocabulary documented - entrypoint.sh: wire workspace/.taskrc → ~/.taskrc at startup - hatch.sh: copy .taskrc into new instance workspace - AGENTS.md: document task CLI usage for all hatchlings Every hatchling now has its own local Taskwarrior with task data persisting in the workspace volume.
- Bake weasyprint + cups-client into Dockerfile for PDF generation - Add scripts/ COPY step so print-doc lands in hatchling workspace - Hatchlings on NAS can now print to Brother HL-2140 via CUPS at 192.168.1.2:631 - print-doc converts markdown → styled PDF → lpr (clean or discord style) - Supports --preview, --force, --json, --copies, --style flags - Page-count gate: >2 pages requires --force (configurable via --pages-ok)
…for PDF tools only
- Copy kokoro CLI script into template/scripts/ - Set KOKORO_HOST env var pointing to Mac mini API (port 18795) - Set default voice to af_heart - Hatchlings can now: kokoro say 'text' -o output.wav
- Add mosquitto-clients to Dockerfile - Add MQTT_HOST/HATCHLING_NAME env vars - Add fleet-bus network (mosquitto_default) to docker-compose - Add bus-publish and bus-heartbeat scripts to template - Auto-set HATCHLING_NAME in hatch.sh scaffold - Update .env.example with MQTT config - Update DEPLOYMENT-SOP with fleet bus section
- bus-command-handler: subscribes to fleet/{name}/command + zephyr/broadcast
- Validates commands against fixed vocabulary (health-check, version-report, config-reload, status-report)
- entrypoint.sh starts handler as background process when MQTT_HOST + HATCHLING_NAME are set
- Tested: broadcast version-report → both hatchlings responded with OC versions
- Add cron to apt-get install in Dockerfile - Add bus-command-handler-keepalive to template scripts - Entrypoint sets up cron keepalive for command handler - Ensures handler survives crashes without container restart
- Stable client ID (fleet-handler-{name}) + --disable-clean-session + QoS 1
- Broker queues commands while handler is offline, delivers on reconnect
- Verified: kill handler → send command → restart → queued delivery works
- Fixes fire-and-forget gap identified during AFGE restart incident (Mar 19)
dahifi
left a comment
There was a problem hiding this comment.
Code Review — PR #24: Fleet Message Bus (ADR-005)
Overall: Ship it. Solid work. Tini, persistent MQTT sessions, fixed command vocabulary, deployment SOP — all the right calls. A few nits below.
1. bus-publish default MQTT_HOST mismatch
template/scripts/bus-publish line 11:
MQTT_HOST="${MQTT_HOST:-192.168.1.2}"bus-command-handler and bus-heartbeat both default to mosquitto (Docker DNS). The Dockerfile ENV also sets MQTT_HOST=mosquitto. This script is the outlier — should be mosquitto for consistency since hatchlings run on the fleet-bus network.
2. Cron keepalive may silently fail (permissions)
template/entrypoint.sh:
echo "*/5 * * * * openclaw ..." > /etc/cron.d/bus-handler 2>/dev/null || true
cron 2>/dev/null || trueThe entrypoint runs as UID 1001 (USER openclaw in Dockerfile), but writing to /etc/cron.d/ requires root. The 2>/dev/null || true swallows the permission error — the keepalive cron never actually installs. Options:
- Move the cron setup before
USER openclaw(run as root during build) - Use
crontab -l | ... | crontab -(user-level crontab) - Accept it and rely on tini + the handler's own resilience (it's a
while readloop that only dies on broker disconnect)
3. stat -f%z in fleet.sh snapshot is macOS-only
file_size=$(stat -f%z "$tmpfile" 2>/dev/null || stat -c%s "$tmpfile" 2>/dev/null)Since this runs on Linux, the macOS syntax will always fail first. Not a bug (fallback works), but you could swap the order or just use stat -c%s since the target is Linux. Same pattern appears in kokoro and print-doc.
4. Scope note
PR includes Taskwarrior, kokoro TTS, print-doc, hledger, weasyprint, and pipx alongside the bus work. Branch is feat/control-plane — might be worth noting in the PR description that this is "control plane + accumulated template improvements" so the commit history makes sense later.
5. alpine → slim base image
Good call for apt package availability (mosquitto-clients, weasyprint, cron). Image size will roughly double — acceptable tradeoff for the tooling.
No blockers. All nits are minor.
- bus-publish: default MQTT_HOST to 'mosquitto' (Docker DNS) not IP address, consistent with bus-command-handler and bus-heartbeat - entrypoint.sh: use 'crontab -' for keepalive cron instead of writing to /etc/cron.d/ which requires root; entrypoint runs as UID 1001 Addresses review feedback on PR #24.
dahifi
left a comment
There was a problem hiding this comment.
Addressed in latest commit (767c408):
- bus-publish MQTT_HOST — changed default from
192.168.1.2tomosquittofor Docker DNS consistency - Cron keepalive permissions — switched from
/etc/cron.d/bus-handler(requires root) tocrontab -(user-level, works as UID 1001)
The stat -f%z order issue isn't present in this branch — the snapshot command uses tar directly without stat. That note may have been about a different file. If you're seeing it elsewhere, point me at the line.
PR #25 (Copilot's security scrub) will clean up the infra details separately.
Summary
Deploy Eclipse Mosquitto as a fleet message bus for hatchling coordination. Hub-and-spoke topology with Zephyr as fleet manager.
What's included
bus.wade.internal)mosquitto-clientsbaked into Dockerfilebus-publish/bus-heartbeatscripts in templatebus-command-handler— hatchlings subscribe to commands and respond.env.exampleincludes MQTT_HOST and HATCHLING_NAMEhatch.shauto-sets HATCHLING_NAME during scaffoldADR-005: Key decisions
Verified
Not in this PR (future)