-
Notifications
You must be signed in to change notification settings - Fork 0
feat: Fleet Message Bus (ADR-005) — MQTT coordination for hatchlings #24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
439edc7
dbcd23a
51dcbf2
7d1b020
e33180c
83a8435
17e4a9e
bbd930d
b7536e6
d20cb0e
767c408
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -17,3 +17,4 @@ fleet.json | |
| .DS_Store | ||
| instances/ | ||
| archives/ | ||
| .idea/ | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,259 @@ | ||
| # Hatchery Deployment SOP | ||
|
|
||
| Standard Operating Procedure for deploying a new OpenClaw hatchling instance. | ||
|
|
||
| --- | ||
|
|
||
| ## 1. Pre-deployment — Client Side | ||
|
|
||
| The client completes these steps. Send them this section. | ||
|
|
||
| - [ ] **Create a Discord server** for their business | ||
| - [ ] **Create a Discord Application** at https://discord.com/developers/applications | ||
| - Click "New Application", name it after the bot | ||
| - [ ] **Create a Bot** under the application | ||
| - Go to Bot settings → click "Add Bot" | ||
| - [ ] **Enable MESSAGE CONTENT INTENT** | ||
| - Bot settings → Privileged Gateway Intents → toggle **Message Content Intent** ON | ||
| - ⚠️ **This is the #1 deployment failure.** Without it, the bot gets error `4014` and cannot read messages. | ||
| - [ ] **Copy the bot token** — Bot settings → "Reset Token" → copy | ||
| - [ ] **Note the Client ID** — OAuth2 → General → "Client ID" | ||
| - [ ] **Invite the bot to their server** | ||
| - OAuth2 → URL Generator | ||
| - Scopes: `bot`, `applications.commands` | ||
| - Permissions: Send Messages, Read Message History, Embed Links, Attach Files, Add Reactions, Use Slash Commands | ||
| - Copy generated URL, open in browser, select server | ||
| - [ ] **Send credentials securely** | ||
| - ✅ Secure messaging (Signal, iMessage, or shared notes) | ||
| - ❌ **NEVER** plaintext in Discord channels | ||
| - Send: bot token, client ID, server ID, target channel ID(s) | ||
|
|
||
| --- | ||
|
|
||
| ## 2. Pre-deployment — Operator Side | ||
|
|
||
| ### 2.1 Prepare the container directory | ||
|
|
||
| ```bash | ||
| ssh user@your-docker-host | ||
| mkdir -p /path/to/instances/<hatchling-name> | ||
| ``` | ||
|
|
||
| ### 2.2 Copy template files | ||
|
|
||
| Use `hatch.sh` (recommended): | ||
| ```bash | ||
| ./scripts/hatch.sh <hatchling-name> [--port PORT] | ||
| ``` | ||
|
|
||
| Or manually copy from the template directory. | ||
|
|
||
| ### 2.3 Seed workspace files | ||
|
|
||
| Ensure these exist in the workspace directory: | ||
| - `SOUL.md` — personality, role, tone | ||
| - `USER.md` — client profile | ||
| - `AGENTS.md` — session behavior | ||
| - `IDENTITY.md` — name, purpose | ||
| - `HEARTBEAT.md` — periodic check instructions | ||
| - `MEMORY.md` — initial context | ||
| - `TOOLS.md` — available tools/integrations | ||
|
|
||
| Customize each for the client. At minimum, edit `SOUL.md`, `USER.md`, and `IDENTITY.md`. | ||
|
|
||
| ### 2.4 Create `.env` | ||
|
|
||
| ```bash | ||
| cp .env.example .env | ||
| # Edit .env and add: | ||
| # - LLM API key (ANTHROPIC_API_KEY or OPENAI_API_KEY) | ||
| # - DISCORD_BOT_TOKEN | ||
| # - DISCORD_CLIENT_ID | ||
| ``` | ||
|
|
||
| - LLM API keys can be shared across hatchlings (same subscription). | ||
| - Each hatchling gets its own Discord bot token and client ID. | ||
|
|
||
| ### 2.5 Configure `openclaw.template.json` | ||
|
|
||
| Edit the template to set: | ||
| - `model` — e.g., `anthropic/claude-sonnet-4-20250514` | ||
| - `agent.name` — the hatchling's name | ||
| - Discord channel bindings — map channel IDs to behaviors | ||
| - Tool permissions / allowlists | ||
|
|
||
| ### 2.6 Verify port assignment | ||
|
|
||
| `hatch.sh` auto-assigns ports from `fleet.json`. To specify manually: | ||
|
|
||
| ```bash | ||
| ./scripts/hatch.sh <name> --port <port> | ||
| ``` | ||
|
|
||
| Check existing allocations: `cat fleet.json` | ||
|
|
||
| --- | ||
|
|
||
| ## 3. Deployment | ||
|
|
||
| ```bash | ||
| cd instances/<hatchling-name> | ||
|
|
||
| # Build and start | ||
| docker compose up -d --build | ||
|
|
||
| # Wait for startup | ||
| sleep 30 | ||
|
|
||
| # Health check | ||
| curl http://localhost:<port>/health | ||
| ``` | ||
|
|
||
| - [ ] Health endpoint returns OK | ||
| - [ ] Bot appears online in Discord (green dot) | ||
| - [ ] Send a test message in the bound channel | ||
| - [ ] Bot responds correctly | ||
|
|
||
| ### Troubleshooting: Error 4014 | ||
|
|
||
| ``` | ||
| Error: Used disallowed intents | ||
| ``` | ||
|
|
||
| **Fix:** | ||
| 1. Go to https://discord.com/developers/applications | ||
| 2. Select the application → Bot → Privileged Gateway Intents | ||
| 3. Enable **Message Content Intent** | ||
| 4. Restart the container: `docker compose restart` | ||
|
|
||
| --- | ||
|
|
||
| ## 4. Post-deployment | ||
|
|
||
| - [ ] **Add to `fleet.json`** — happens automatically with `hatch.sh` | ||
| - [ ] **Update monitoring** — `fleet.sh health` to verify endpoint | ||
| - [ ] **Schedule follow-up checks:** | ||
| - 24 hours — confirm stable operation, check logs | ||
| - 1 week — review usage, adjust personality/config if needed | ||
|
|
||
| ### 4.1 HTTPS + Dashy access pattern (REPEATABLE) | ||
|
|
||
| Use this for each new hatchling (AFGE, CHC, Ali, etc.). | ||
|
|
||
| 1) **Add Caddy route** | ||
| ```caddy | ||
| <slug>.wade.internal { | ||
| reverse_proxy <container-name>:18789 | ||
| } | ||
| ``` | ||
| Reload Caddy after editing: | ||
| ```bash | ||
| docker exec caddy caddy reload --config /etc/caddy/Caddyfile | ||
| ``` | ||
|
|
||
| 2) **Ensure container can be resolved by Caddy** | ||
| - Container must be on `wade_internal` network (same as Caddy) | ||
| ```bash | ||
| docker network connect wade_internal <container-name> # idempotent | ||
| ``` | ||
|
|
||
| 3) **Set proxy-safe Control UI config in hatchling** | ||
| In hatchling `openclaw.json`: | ||
| ```json | ||
| "gateway": { | ||
| "trustedProxies": ["172.29.0.0/16"], | ||
| "controlUi": { | ||
| "allowedOrigins": ["https://<slug>.wade.internal"] | ||
| } | ||
| } | ||
| ``` | ||
| Restart hatchling after config change. | ||
|
|
||
| 4) **Use gateway token in Dashy URL (FRAGMENT, not query)** | ||
| - Read token from hatchling `openclaw.json` (`gateway.auth.token`) | ||
| - Set Dashy tile URL to: | ||
| ```text | ||
| https://<slug>.wade.internal/#token=<gateway-token> | ||
| ``` | ||
| - ⚠️ Do **not** use `?token=`. Control UI bootstrap expects fragment token (`#token=`). | ||
| - Restart Dashy to force refresh if UI cache is stale. | ||
|
|
||
| 5) **Approve device pairing once from operator browser** | ||
| When first opening from a new browser/device, OpenClaw may show `pairing required`. | ||
| Approve inside the hatchling: | ||
| ```bash | ||
| docker exec <container-name> openclaw devices list | ||
| docker exec <container-name> openclaw devices approve <requestId> | ||
| ``` | ||
|
|
||
| 6) **Validation checks** | ||
| - `https://<slug>.wade.internal/health` returns JSON `{"ok":true,"status":"live"}` | ||
| - Dashy tile opens gateway directly | ||
| - No `502` from Caddy | ||
| - No websocket `pairing required` loop after approval | ||
|
|
||
| --- | ||
|
|
||
| ## 5. Maintenance | ||
|
|
||
| ### OC Updates | ||
| ```bash | ||
| cd instances/<hatchling-name> | ||
| docker compose pull | ||
| docker compose up -d --force-recreate | ||
| ``` | ||
|
|
||
| ### Config Changes | ||
| Edit workspace files (`SOUL.md`, `TOOLS.md`, etc.) then restart: | ||
| ```bash | ||
| docker compose restart | ||
| ``` | ||
|
|
||
| ### Monitoring | ||
| ```bash | ||
| ./scripts/fleet.sh status # all instances | ||
| ./scripts/fleet.sh health # endpoint health | ||
| ./scripts/fleet.sh logs <name> # container logs | ||
| ``` | ||
|
|
||
| --- | ||
|
|
||
| ## Lessons Learned | ||
|
|
||
| 1. **Message Content Intent is the #1 gotcha.** Every deployment hits this if forgotten. Emphasize it in client instructions. | ||
| 2. **Never send credentials via Discord.** Use secure messaging or shared notes. | ||
| 3. **Each client gets their own Discord server** — this is the security boundary. Never put two clients' bots in the same server. | ||
| 4. **Port allocation is tracked in `fleet.json`** — use `hatch.sh` to avoid conflicts. | ||
| 5. **Container user is UID 1001.** If permission errors: `docker exec -u root <container> chown -R 1001:1001 /path` | ||
|
|
||
| ## Config Change Workflow (IMPORTANT — do not skip) | ||
|
|
||
| **Always** edit the hatchery config, never docker exec/cp directly: | ||
|
|
||
| ```bash | ||
| # 1. Edit the config | ||
| nano ~/projects/oc-hatchery/instances/<name>/data/openclaw.json | ||
|
|
||
| # 2. Deploy via fleet (NOT docker exec/cp) | ||
| ./scripts/fleet.sh update <name> | ||
| ``` | ||
|
|
||
| **Why this matters:** Direct container edits are wiped on any `fleet.sh update` or container recreate. The hatchery `instances/<name>/data/openclaw.json` is the source of truth. | ||
|
|
||
| **Note on gitignore:** `instances/` is gitignored (contains tokens). The config file is NOT version-controlled — it lives only on the mini. Back it up manually if making major changes. | ||
|
|
||
| ## Channel Allowlist Schema | ||
|
|
||
| With `groupPolicy: "allowlist"`, a guild entry with NO `channels` key = Mason responds in all channels in that guild. Add a `channels` key to restrict to specific channels: | ||
|
|
||
| ```json | ||
| "guilds": { | ||
| "<guild_id>": { | ||
| "slug": "my-server", | ||
| "requireMention": false, | ||
| "channels": { | ||
| "<channel_id>": { "allow": true, "requireMention": false } | ||
| } | ||
| } | ||
| } | ||
| ``` |
| Original file line number | Diff line number | Diff line change | ||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -291,8 +291,106 @@ case "$CMD" in | |||||||||||||||||
| echo "✓ Instance '$name' destroyed." | ||||||||||||||||||
| ;; | ||||||||||||||||||
|
|
||||||||||||||||||
| monitor) | ||||||||||||||||||
| # Monitor fleet health — similar to health but with pass/fail logic | ||||||||||||||||||
| FLEET_HOST="${FLEET_HOST:-localhost}" | ||||||||||||||||||
| JSON_OUTPUT=false | ||||||||||||||||||
| shift | ||||||||||||||||||
| while [[ $# -gt 0 ]]; do | ||||||||||||||||||
| case "$1" in | ||||||||||||||||||
| --json) JSON_OUTPUT=true; shift ;; | ||||||||||||||||||
| *) shift ;; | ||||||||||||||||||
| esac | ||||||||||||||||||
| done | ||||||||||||||||||
|
|
||||||||||||||||||
| if [[ ! -f "$FLEET_REGISTRY" ]]; then | ||||||||||||||||||
| if [[ "$JSON_OUTPUT" == "true" ]]; then | ||||||||||||||||||
| echo '{"status":"error","message":"no fleet.json found"}' | ||||||||||||||||||
| else | ||||||||||||||||||
| echo "Error: no fleet.json found" >&2 | ||||||||||||||||||
| fi | ||||||||||||||||||
| exit 1 | ||||||||||||||||||
| fi | ||||||||||||||||||
|
|
||||||||||||||||||
| instances=$(jq -r '.instances | to_entries[] | select(.value.status == "running") | .key' "$FLEET_REGISTRY" 2>/dev/null || true) | ||||||||||||||||||
| if [[ -z "$instances" ]]; then | ||||||||||||||||||
| if [[ "$JSON_OUTPUT" == "true" ]]; then | ||||||||||||||||||
| echo '{"status":"ok","message":"no running instances","results":[]}' | ||||||||||||||||||
| else | ||||||||||||||||||
| echo "No running instances to monitor." | ||||||||||||||||||
| fi | ||||||||||||||||||
| exit 0 | ||||||||||||||||||
| fi | ||||||||||||||||||
|
|
||||||||||||||||||
| all_healthy=true | ||||||||||||||||||
| json_results="[]" | ||||||||||||||||||
|
|
||||||||||||||||||
| while IFS= read -r name; do | ||||||||||||||||||
| port=$(jq -r ".instances[\"$name\"].port" "$FLEET_REGISTRY") | ||||||||||||||||||
| host=$(jq -r ".instances[\"$name\"].host // \"$FLEET_HOST\"" "$FLEET_REGISTRY") | ||||||||||||||||||
| url="http://${host}:${port}/health" | ||||||||||||||||||
|
|
||||||||||||||||||
| http_code=$(curl -s -o /dev/null -w "%{http_code}" --connect-timeout 5 --max-time 10 "$url" 2>/dev/null || echo "000") | ||||||||||||||||||
|
|
||||||||||||||||||
| if [[ "$http_code" == "200" ]]; then | ||||||||||||||||||
| healthy=true | ||||||||||||||||||
| else | ||||||||||||||||||
| healthy=false | ||||||||||||||||||
| all_healthy=false | ||||||||||||||||||
| fi | ||||||||||||||||||
|
|
||||||||||||||||||
| if [[ "$JSON_OUTPUT" == "true" ]]; then | ||||||||||||||||||
| json_results=$(echo "$json_results" | jq --arg n "$name" --arg h "$healthy" --arg c "$http_code" \ | ||||||||||||||||||
| '. + [{"name":$n,"healthy":($h == "true"),"http_status":($c | tonumber)}]') | ||||||||||||||||||
| else | ||||||||||||||||||
| if [[ "$healthy" == "false" ]]; then | ||||||||||||||||||
| echo "$name UNHEALTHY (HTTP $http_code)" | ||||||||||||||||||
| fi | ||||||||||||||||||
| fi | ||||||||||||||||||
| done <<< "$instances" | ||||||||||||||||||
|
|
||||||||||||||||||
| if [[ "$JSON_OUTPUT" == "true" ]]; then | ||||||||||||||||||
| jq -n --argjson results "$json_results" --arg healthy "$all_healthy" \ | ||||||||||||||||||
| '{"status":(if $healthy == "true" then "healthy" else "unhealthy" end),"results":$results}' | ||||||||||||||||||
| else | ||||||||||||||||||
| if [[ "$all_healthy" == "true" ]]; then | ||||||||||||||||||
| echo "All instances healthy" | ||||||||||||||||||
| fi | ||||||||||||||||||
| fi | ||||||||||||||||||
|
||||||||||||||||||
| fi | |
| fi | |
| if [[ "$all_healthy" == "true" ]]; then | |
| exit 0 | |
| else | |
| exit 1 | |
| fi |
Copilot
AI
Feb 21, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new monitor command is not covered by tests. The codebase has comprehensive test coverage for fleet commands in scripts/test-fleet.sh, and this new functionality should have corresponding tests to ensure it works correctly.
Copilot
AI
Feb 21, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The snapshot command does not verify that the container exists before attempting to copy from it. Consider adding a check similar to the one in the logs command (lines 132-135 of fleet.sh) to provide a clearer error message if the container doesn't exist.
Copilot
AI
Feb 21, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new snapshot command is not covered by tests. The codebase has comprehensive test coverage for fleet commands in scripts/test-fleet.sh, and this new functionality should have corresponding tests to ensure it works correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to the PR description, "fleet.json schema update - Added missing fields: name, host, status, version to all instances" but there are no corresponding changes in the hatch.sh script or any other visible code that would add these fields to the fleet.json registry when instances are created or managed. This is a discrepancy between the PR description and the actual code changes.