Skip to content

feat: Fleet Message Bus (ADR-005) — MQTT coordination for hatchlings#24

Merged
dahifi merged 11 commits intomainfrom
feat/control-plane
Apr 4, 2026
Merged

feat: Fleet Message Bus (ADR-005) — MQTT coordination for hatchlings#24
dahifi merged 11 commits intomainfrom
feat/control-plane

Conversation

@dahifi
Copy link
Copy Markdown
Owner

@dahifi dahifi commented Feb 21, 2026

Summary

Deploy Eclipse Mosquitto as a fleet message bus for hatchling coordination. Hub-and-spoke topology with Zephyr as fleet manager.

What's included

  • Mosquitto broker deployed on NAS (port 1883, bus.wade.internal)
  • mosquitto-clients baked into Dockerfile
  • bus-publish / bus-heartbeat scripts in template
  • bus-command-handler — hatchlings subscribe to commands and respond
  • Docker networking — fleet-bus network (mosquitto_default) for DNS resolution
  • Entrypoint auto-starts command handler when MQTT_HOST is configured
  • .env.example includes MQTT_HOST and HATCHLING_NAME
  • hatch.sh auto-sets HATCHLING_NAME during scaffold

ADR-005: Key decisions

  • Partially supersedes ADR-004 (total isolation → data isolation with operational telemetry)
  • No personal data, vault contents, CRM, or session transcripts on the bus
  • Fixed command vocabulary: health-check, version-report, config-reload, status-report
  • Hub-and-spoke only — no lateral hatchling-to-hatchling traffic

Verified

  • Bidirectional pub/sub from both active hatchlings (Mason + Rosie)
  • Broadcast version-report → both responded with OC versions
  • Retained heartbeats with system metrics (uptime, memory, load, OC health)
  • Docker DNS resolution (mosquitto hostname) from containers

Not in this PR (future)

  • Per-hatchling ACL auth (Phase 2)
  • Prometheus MQTT exporter bridge
  • Bus traffic audit automation

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds control plane functionality for managing a fleet of OpenClaw hatchling instances, including health monitoring, state snapshots, and deployment documentation. The changes appear to be derived from real deployment experience (Todd/CHC deployment in February 2026) and aim to make fleet management more robust.

Changes:

  • Added fleet monitoring command with health checks and JSON output support
  • Added snapshot command to capture container state from running instances
  • Added openclaw doctor --fix self-healing to container entrypoint
  • Switched Docker base image from node:22-alpine to node:22-slim for better compatibility
  • Added comprehensive deployment SOP documentation based on real-world deployment

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 11 comments.

File Description
template/Dockerfile Base image switched from alpine to slim, added ARG for version pinning
template/entrypoint.sh Added self-healing via openclaw doctor --fix before startup
scripts/fleet.sh Added monitor and snapshot commands for fleet health and state management
docs/DEPLOYMENT-SOP.md New comprehensive deployment guide based on February 2026 deployment experience

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +362 to +390
snapshot)
name="${2:?Usage: fleet.sh snapshot <instance-name>}"
container="hatchery-${name}"
timestamp=$(date -u +%Y%m%dT%H%M%SZ)
tmp_dir="/tmp/snapshot-${name}-${timestamp}"
snapshot_dir="$ROOT_DIR/snapshots"
snapshot_path="${snapshot_dir}/${name}-${timestamp}.tar.gz"

# Use NAS Docker env vars
export DOCKER_HOST="${DOCKER_HOST:-tcp://192.168.1.2:2376}"
export DOCKER_TLS_VERIFY="${DOCKER_TLS_VERIFY:-1}"
export DOCKER_CERT_PATH="${DOCKER_CERT_PATH:-$HOME/.docker/nas}"

echo "📸 Snapshotting $name ($container)..."

# Copy openclaw state from container
docker cp "${container}:/home/openclaw/.openclaw" "$tmp_dir"
if [[ $? -ne 0 ]]; then
echo "Error: Failed to copy from container '$container'" >&2
exit 1
fi

# Tar it up
mkdir -p "$snapshot_dir"
tar -czf "$snapshot_path" -C "/tmp" "snapshot-${name}-${timestamp}"
rm -rf "$tmp_dir"

echo "✅ Snapshot saved: $snapshot_path"
;;
Copy link

Copilot AI Feb 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new snapshot command is not covered by tests. The codebase has comprehensive test coverage for fleet commands in scripts/test-fleet.sh, and this new functionality should have corresponding tests to ensure it works correctly.

Copilot uses AI. Check for mistakes.
scripts/fleet.sh Outdated
Comment on lines +370 to +373
# Use NAS Docker env vars
export DOCKER_HOST="${DOCKER_HOST:-tcp://192.168.1.2:2376}"
export DOCKER_TLS_VERIFY="${DOCKER_TLS_VERIFY:-1}"
export DOCKER_CERT_PATH="${DOCKER_CERT_PATH:-$HOME/.docker/nas}"
Copy link

Copilot AI Feb 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hardcoded IP address "192.168.1.2" appears to be environment-specific and should not be committed. This value would fail for any deployment not using this exact NAS configuration. Consider using environment variables without defaults, or documenting that users must set DOCKER_HOST, DOCKER_TLS_VERIFY, and DOCKER_CERT_PATH before using the snapshot command.

Suggested change
# Use NAS Docker env vars
export DOCKER_HOST="${DOCKER_HOST:-tcp://192.168.1.2:2376}"
export DOCKER_TLS_VERIFY="${DOCKER_TLS_VERIFY:-1}"
export DOCKER_CERT_PATH="${DOCKER_CERT_PATH:-$HOME/.docker/nas}"
# Require Docker environment variables for snapshot operations
if [[ -z "${DOCKER_HOST:-}" || -z "${DOCKER_TLS_VERIFY:-}" || -z "${DOCKER_CERT_PATH:-}" ]]; then
echo "Error: DOCKER_HOST, DOCKER_TLS_VERIFY, and DOCKER_CERT_PATH must be set before using the snapshot command." >&2
exit 1
fi

Copilot uses AI. Check for mistakes.
echo "📸 Snapshotting $name ($container)..."

# Copy openclaw state from container
docker cp "${container}:/home/openclaw/.openclaw" "$tmp_dir"
Copy link

Copilot AI Feb 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The snapshot command does not verify that the container exists before attempting to copy from it. Consider adding a check similar to the one in the logs command (lines 132-135 of fleet.sh) to provide a clearer error message if the container doesn't exist.

Copilot uses AI. Check for mistakes.
Comment on lines +40 to +41
ssh zeph@revere
mkdir -p /share/Container/<hatchling-name>
Copy link

Copilot AI Feb 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The snapshot command contains hardcoded environment-specific information (ssh zeph@revere, /share/Container/) that would not work in other deployments. This documentation should either use placeholder values (like @ and ) or be clearly marked as example-specific.

Copilot uses AI. Check for mistakes.
Comment on lines +294 to +360
monitor)
# Monitor fleet health — similar to health but with pass/fail logic
FLEET_HOST="${FLEET_HOST:-localhost}"
JSON_OUTPUT=false
shift
while [[ $# -gt 0 ]]; do
case "$1" in
--json) JSON_OUTPUT=true; shift ;;
*) shift ;;
esac
done

if [[ ! -f "$FLEET_REGISTRY" ]]; then
if [[ "$JSON_OUTPUT" == "true" ]]; then
echo '{"status":"error","message":"no fleet.json found"}'
else
echo "Error: no fleet.json found" >&2
fi
exit 1
fi

instances=$(jq -r '.instances | to_entries[] | select(.value.status == "running") | .key' "$FLEET_REGISTRY" 2>/dev/null || true)
if [[ -z "$instances" ]]; then
if [[ "$JSON_OUTPUT" == "true" ]]; then
echo '{"status":"ok","message":"no running instances","results":[]}'
else
echo "No running instances to monitor."
fi
exit 0
fi

all_healthy=true
json_results="[]"

while IFS= read -r name; do
port=$(jq -r ".instances[\"$name\"].port" "$FLEET_REGISTRY")
host=$(jq -r ".instances[\"$name\"].host // \"$FLEET_HOST\"" "$FLEET_REGISTRY")
url="http://${host}:${port}/health"

http_code=$(curl -s -o /dev/null -w "%{http_code}" --connect-timeout 5 --max-time 10 "$url" 2>/dev/null || echo "000")

if [[ "$http_code" == "200" ]]; then
healthy=true
else
healthy=false
all_healthy=false
fi

if [[ "$JSON_OUTPUT" == "true" ]]; then
json_results=$(echo "$json_results" | jq --arg n "$name" --arg h "$healthy" --arg c "$http_code" \
'. + [{"name":$n,"healthy":($h == "true"),"http_status":($c | tonumber)}]')
else
if [[ "$healthy" == "false" ]]; then
echo "$name UNHEALTHY (HTTP $http_code)"
fi
fi
done <<< "$instances"

if [[ "$JSON_OUTPUT" == "true" ]]; then
jq -n --argjson results "$json_results" --arg healthy "$all_healthy" \
'{"status":(if $healthy == "true" then "healthy" else "unhealthy" end),"results":$results}'
else
if [[ "$all_healthy" == "true" ]]; then
echo "All instances healthy"
fi
fi
;;
Copy link

Copilot AI Feb 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new monitor command is not covered by tests. The codebase has comprehensive test coverage for fleet commands in scripts/test-fleet.sh, and this new functionality should have corresponding tests to ensure it works correctly.

Copilot uses AI. Check for mistakes.
exit 1
fi

instances=$(jq -r '.instances | to_entries[] | select(.value.status == "running") | .key' "$FLEET_REGISTRY" 2>/dev/null || true)
Copy link

Copilot AI Feb 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the PR description, "fleet.json schema update - Added missing fields: name, host, status, version to all instances" but there are no corresponding changes in the hatch.sh script or any other visible code that would add these fields to the fleet.json registry when instances are created or managed. This is a discrepancy between the PR description and the actual code changes.

Copilot uses AI. Check for mistakes.
FROM node:22-slim

RUN apk add --no-cache git curl bash gettext
RUN apt-get update && apt-get install -y --no-install-recommends git curl bash ca-certificates && rm -rf /var/lib/apt/lists/*
Copy link

Copilot AI Feb 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The package gettext (which provides envsubst) is missing from the installation command. The envsubst command is required by the entrypoint.sh script (line 89) to process the openclaw.template.json configuration file. Without it, the container will fail at runtime when trying to generate the configuration.

Suggested change
RUN apt-get update && apt-get install -y --no-install-recommends git curl bash ca-certificates && rm -rf /var/lib/apt/lists/*
RUN apt-get update && apt-get install -y --no-install-recommends git curl bash ca-certificates gettext && rm -rf /var/lib/apt/lists/*

Copilot uses AI. Check for mistakes.
Comment on lines +95 to +99
Port allocation:
18789 — Zephyr (primary)
18790 — Seraphel
18791+ — next hatchling
```
Copy link

Copilot AI Feb 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hardcoded port numbers (18789, 18790, 18791) and instance names (Zephyr, Seraphel) are environment-specific and should be presented as examples rather than as prescriptive values. Consider adding a note that these are example values from a specific deployment.

Copilot uses AI. Check for mistakes.

## 4. Post-deployment

- [ ] **Create `#<client>` channel** in BCM server for operator visibility/monitoring
Copy link

Copilot AI Feb 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reference to "BCM server" appears to be organization-specific and may not be clear to other users. Consider either removing this reference, explaining what BCM stands for, or making it clear this is an example from a specific deployment.

Suggested change
- [ ] **Create `#<client>` channel** in BCM server for operator visibility/monitoring
- [ ] **Create `#<client>` channel** in your internal operator/monitoring Discord server for visibility/monitoring

Copilot uses AI. Check for mistakes.
Comment on lines +198 to +203
2. **Credential transfer:** Apple Notes via iCloud sharing works well when Tailscale MacBook→mini is broken. Never send tokens in Discord.
3. **Anthropic API key is shared** across all hatchlings — same subscription, one key.
4. **Each client gets their own Discord server** — this is the security boundary. Never put two clients' bots in the same server.
5. **Port allocation:** Start at `18789`, increment per instance. Track in `fleet.json` to avoid conflicts.
6. **Docker on QNAP:** Volume mounts resolve on the NAS filesystem. Use `docker cp` if injecting files from the Mac.
7. **Container user is UID 1001.** If permission errors: `docker exec -u root <container> chown -R 1001:1001 /app/workspace`
Copy link

Copilot AI Feb 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "Lessons Learned" section contains references to specific people and tools that are environment-specific (Tailscale MacBook→mini, Apple Notes, iCloud). While these may be valuable for the original deployment context, they should be clearly labeled as context-specific examples or generalized for broader use.

Suggested change
2. **Credential transfer:** Apple Notes via iCloud sharing works well when Tailscale MacBook→mini is broken. Never send tokens in Discord.
3. **Anthropic API key is shared** across all hatchlings — same subscription, one key.
4. **Each client gets their own Discord server** — this is the security boundary. Never put two clients' bots in the same server.
5. **Port allocation:** Start at `18789`, increment per instance. Track in `fleet.json` to avoid conflicts.
6. **Docker on QNAP:** Volume mounts resolve on the NAS filesystem. Use `docker cp` if injecting files from the Mac.
7. **Container user is UID 1001.** If permission errors: `docker exec -u root <container> chown -R 1001:1001 /app/workspace`
2. **Credential transfer:** Always use a secure, out‑of‑band channel for sharing tokens or secrets. Never send tokens in Discord or other chat. _Example (Todd/CHC deployment): Apple Notes via iCloud sharing worked well when a Tailscale MacBook→mini link was unavailable._
3. **Anthropic API key is shared** across all hatchlings — same subscription, one key.
4. **Each client gets their own Discord server** — this is the security boundary. Never put two clients' bots in the same server.
5. **Port allocation:** Start at `18789`, increment per instance. Track in `fleet.json` to avoid conflicts.
6. **Docker on QNAP (example environment):** In the Todd/CHC deployment, volume mounts resolved on the NAS filesystem, and `docker cp` was used to inject files from a Mac. Behavior may differ on other hosts; adjust to your storage platform.
7. **Container user UID may vary by environment.** In the Todd/CHC deployment, the container ran as UID `1001`; if you hit permission errors there, you can run: `docker exec -u root <container> chown -R 1001:1001 /app/workspace`. In other environments, first inspect the container's user (`docker exec <container> id`) and adjust the UID/GID accordingly.

Copilot uses AI. Check for mistakes.
@dahifi
Copy link
Copy Markdown
Owner Author

dahifi commented Feb 23, 2026

@copilot Two issues:

1. SECURITY: Remove all personal/infrastructure details. This is a PUBLIC repo. The following must be generic/placeholder, not real:

  • ssh zeph@revere → use ssh user@your-host
  • 192.168.1.2:2376 → use your-docker-host:2376
  • $HOME/.docker/nas → use $HOME/.docker/remote
  • References to "Seraphel", "Todd", "CHC", "Certainty Home Consulting" → use generic example names like "my-advisor", "my-instance"
  • "Apple Notes via iCloud" → remove entirely, just say "secure channel"
  • "QNAP", "NAS", "Tailscale", "MacBook→mini" → use generic terms like "remote Docker host"
  • Port 18790/Seraphel, port 18791 → use example ports

The DEPLOYMENT-SOP.md and any hardcoded references in scripts must use placeholder values only.

2. BUILD FAILURE: E2E test failed — same openclaw-not-on-npm issue. Make sure the e2e test and any new CI steps handle this gracefully (continue-on-error or graceful exit like the existing e2e-test.sh pattern).

No separate validation report docs. Keep docs minimal.

Copy link
Copy Markdown
Contributor

Copilot AI commented Feb 23, 2026

@dahifi I've opened a new pull request, #25, to work on those changes. Once the pull request is ready, I'll request review from you.

@dahifi dahifi force-pushed the feat/control-plane branch from b5328e9 to 439edc7 Compare February 23, 2026 14:27
dahifi added 7 commits March 5, 2026 20:07
- Install taskwarrior in Dockerfile (apt-get)
- Add .taskrc template with PARA project conventions
  - data.location → workspace/.task (persists across restarts)
  - Priority scheme: H/M/L
  - Custom reports: next (urgency-sorted), by_project
  - Common project/tag vocabulary documented
- entrypoint.sh: wire workspace/.taskrc → ~/.taskrc at startup
- hatch.sh: copy .taskrc into new instance workspace
- AGENTS.md: document task CLI usage for all hatchlings

Every hatchling now has its own local Taskwarrior with
task data persisting in the workspace volume.
- Bake weasyprint + cups-client into Dockerfile for PDF generation
- Add scripts/ COPY step so print-doc lands in hatchling workspace
- Hatchlings on NAS can now print to Brother HL-2140 via CUPS at 192.168.1.2:631
- print-doc converts markdown → styled PDF → lpr (clean or discord style)
- Supports --preview, --force, --json, --copies, --style flags
- Page-count gate: >2 pages requires --force (configurable via --pages-ok)
- Copy kokoro CLI script into template/scripts/
- Set KOKORO_HOST env var pointing to Mac mini API (port 18795)
- Set default voice to af_heart
- Hatchlings can now: kokoro say 'text' -o output.wav
- Add mosquitto-clients to Dockerfile
- Add MQTT_HOST/HATCHLING_NAME env vars
- Add fleet-bus network (mosquitto_default) to docker-compose
- Add bus-publish and bus-heartbeat scripts to template
- Auto-set HATCHLING_NAME in hatch.sh scaffold
- Update .env.example with MQTT config
- Update DEPLOYMENT-SOP with fleet bus section
- bus-command-handler: subscribes to fleet/{name}/command + zephyr/broadcast
- Validates commands against fixed vocabulary (health-check, version-report, config-reload, status-report)
- entrypoint.sh starts handler as background process when MQTT_HOST + HATCHLING_NAME are set
- Tested: broadcast version-report → both hatchlings responded with OC versions
@dahifi dahifi changed the title feat: control plane for hatchling fleet management feat: Fleet Message Bus (ADR-005) — MQTT coordination for hatchlings Mar 19, 2026
dahifi added 2 commits March 18, 2026 22:25
- Add cron to apt-get install in Dockerfile
- Add bus-command-handler-keepalive to template scripts
- Entrypoint sets up cron keepalive for command handler
- Ensures handler survives crashes without container restart
- Stable client ID (fleet-handler-{name}) + --disable-clean-session + QoS 1
- Broker queues commands while handler is offline, delivers on reconnect
- Verified: kill handler → send command → restart → queued delivery works
- Fixes fire-and-forget gap identified during AFGE restart incident (Mar 19)
Copy link
Copy Markdown
Owner Author

@dahifi dahifi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review — PR #24: Fleet Message Bus (ADR-005)

Overall: Ship it. Solid work. Tini, persistent MQTT sessions, fixed command vocabulary, deployment SOP — all the right calls. A few nits below.


1. bus-publish default MQTT_HOST mismatch

template/scripts/bus-publish line 11:

MQTT_HOST="${MQTT_HOST:-192.168.1.2}"

bus-command-handler and bus-heartbeat both default to mosquitto (Docker DNS). The Dockerfile ENV also sets MQTT_HOST=mosquitto. This script is the outlier — should be mosquitto for consistency since hatchlings run on the fleet-bus network.

2. Cron keepalive may silently fail (permissions)

template/entrypoint.sh:

echo "*/5 * * * * openclaw ..." > /etc/cron.d/bus-handler 2>/dev/null || true
cron 2>/dev/null || true

The entrypoint runs as UID 1001 (USER openclaw in Dockerfile), but writing to /etc/cron.d/ requires root. The 2>/dev/null || true swallows the permission error — the keepalive cron never actually installs. Options:

  • Move the cron setup before USER openclaw (run as root during build)
  • Use crontab -l | ... | crontab - (user-level crontab)
  • Accept it and rely on tini + the handler's own resilience (it's a while read loop that only dies on broker disconnect)

3. stat -f%z in fleet.sh snapshot is macOS-only

file_size=$(stat -f%z "$tmpfile" 2>/dev/null || stat -c%s "$tmpfile" 2>/dev/null)

Since this runs on Linux, the macOS syntax will always fail first. Not a bug (fallback works), but you could swap the order or just use stat -c%s since the target is Linux. Same pattern appears in kokoro and print-doc.

4. Scope note

PR includes Taskwarrior, kokoro TTS, print-doc, hledger, weasyprint, and pipx alongside the bus work. Branch is feat/control-plane — might be worth noting in the PR description that this is "control plane + accumulated template improvements" so the commit history makes sense later.

5. alpine → slim base image

Good call for apt package availability (mosquitto-clients, weasyprint, cron). Image size will roughly double — acceptable tradeoff for the tooling.


No blockers. All nits are minor.

- bus-publish: default MQTT_HOST to 'mosquitto' (Docker DNS) not IP
  address, consistent with bus-command-handler and bus-heartbeat
- entrypoint.sh: use 'crontab -' for keepalive cron instead of writing
  to /etc/cron.d/ which requires root; entrypoint runs as UID 1001

Addresses review feedback on PR #24.
Copy link
Copy Markdown
Owner Author

@dahifi dahifi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in latest commit (767c408):

  1. bus-publish MQTT_HOST — changed default from 192.168.1.2 to mosquitto for Docker DNS consistency
  2. Cron keepalive permissions — switched from /etc/cron.d/bus-handler (requires root) to crontab - (user-level, works as UID 1001)

The stat -f%z order issue isn't present in this branch — the snapshot command uses tar directly without stat. That note may have been about a different file. If you're seeing it elsewhere, point me at the line.

PR #25 (Copilot's security scrub) will clean up the infra details separately.

@dahifi dahifi merged commit 2d459d2 into main Apr 4, 2026
5 checks passed
@dahifi dahifi deleted the feat/control-plane branch April 4, 2026 00:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants