dahifi · dahifi · Apr 4, 2026 · Feb 21, 2026 · Mar 6, 2026 · Mar 7, 2026
diff --git a/.gitignore b/.gitignore
@@ -17,3 +17,4 @@ fleet.json
 .DS_Store
 instances/
 archives/
+.idea/
diff --git a/docs/DEPLOYMENT-SOP.md b/docs/DEPLOYMENT-SOP.md
@@ -0,0 +1,259 @@
+# Hatchery Deployment SOP
+
+Standard Operating Procedure for deploying a new OpenClaw hatchling instance.
+
+---
+
+## 1. Pre-deployment — Client Side
+
+The client completes these steps. Send them this section.
+
+- [ ] **Create a Discord server** for their business
+- [ ] **Create a Discord Application** at https://discord.com/developers/applications
+  - Click "New Application", name it after the bot
+- [ ] **Create a Bot** under the application
+  - Go to Bot settings → click "Add Bot"
+- [ ] **Enable MESSAGE CONTENT INTENT**
+  - Bot settings → Privileged Gateway Intents → toggle **Message Content Intent** ON
+  - ⚠️ **This is the #1 deployment failure.** Without it, the bot gets error `4014` and cannot read messages.
+- [ ] **Copy the bot token** — Bot settings → "Reset Token" → copy
+- [ ] **Note the Client ID** — OAuth2 → General → "Client ID"
+- [ ] **Invite the bot to their server**
+  - OAuth2 → URL Generator
+  - Scopes: `bot`, `applications.commands`
+  - Permissions: Send Messages, Read Message History, Embed Links, Attach Files, Add Reactions, Use Slash Commands
+  - Copy generated URL, open in browser, select server
+- [ ] **Send credentials securely**
+  - ✅ Secure messaging (Signal, iMessage, or shared notes)
+  - ❌ **NEVER** plaintext in Discord channels
+  - Send: bot token, client ID, server ID, target channel ID(s)
+
+---
+
+## 2. Pre-deployment — Operator Side
+
+### 2.1 Prepare the container directory
+
+```bash
+ssh user@your-docker-host
+mkdir -p /path/to/instances/<hatchling-name>
+```
+
+### 2.2 Copy template files
+
+Use `hatch.sh` (recommended):
+```bash
+./scripts/hatch.sh <hatchling-name> [--port PORT]
+```
+
+Or manually copy from the template directory.
+
+### 2.3 Seed workspace files
+
+Ensure these exist in the workspace directory:
+- `SOUL.md` — personality, role, tone
+- `USER.md` — client profile
+- `AGENTS.md` — session behavior
+- `IDENTITY.md` — name, purpose
+- `HEARTBEAT.md` — periodic check instructions
+- `MEMORY.md` — initial context
+- `TOOLS.md` — available tools/integrations
+
+Customize each for the client. At minimum, edit `SOUL.md`, `USER.md`, and `IDENTITY.md`.
+
+### 2.4 Create `.env`
+
+```bash
+cp .env.example .env
+# Edit .env and add:
+#   - LLM API key (ANTHROPIC_API_KEY or OPENAI_API_KEY)
+#   - DISCORD_BOT_TOKEN
+#   - DISCORD_CLIENT_ID
+```
+
+- LLM API keys can be shared across hatchlings (same subscription).
+- Each hatchling gets its own Discord bot token and client ID.
+
+### 2.5 Configure `openclaw.template.json`
+
+Edit the template to set:
+- `model` — e.g., `anthropic/claude-sonnet-4-20250514`
+- `agent.name` — the hatchling's name
+- Discord channel bindings — map channel IDs to behaviors
+- Tool permissions / allowlists
+
+### 2.6 Verify port assignment
+
+`hatch.sh` auto-assigns ports from `fleet.json`. To specify manually:
+
+```bash
+./scripts/hatch.sh <name> --port <port>
+```
+
+Check existing allocations: `cat fleet.json`
+
+---
+
+## 3. Deployment
+
+```bash
+cd instances/<hatchling-name>
+
+# Build and start
+docker compose up -d --build
+
+# Wait for startup
+sleep 30
+
+# Health check
+curl http://localhost:<port>/health
+```
+
+- [ ] Health endpoint returns OK
+- [ ] Bot appears online in Discord (green dot)
+- [ ] Send a test message in the bound channel
+- [ ] Bot responds correctly
+
+### Troubleshooting: Error 4014
+
+```
+Error: Used disallowed intents
+```
+
+**Fix:**
+1. Go to https://discord.com/developers/applications
+2. Select the application → Bot → Privileged Gateway Intents
+3. Enable **Message Content Intent**
+4. Restart the container: `docker compose restart`
+
+---
+
+## 4. Post-deployment
+
+- [ ] **Add to `fleet.json`** — happens automatically with `hatch.sh`
+- [ ] **Update monitoring** — `fleet.sh health` to verify endpoint
+- [ ] **Schedule follow-up checks:**
+  - 24 hours — confirm stable operation, check logs
+  - 1 week — review usage, adjust personality/config if needed
+
+### 4.1 HTTPS + Dashy access pattern (REPEATABLE)
+
+Use this for each new hatchling (AFGE, CHC, Ali, etc.).
+
+1) **Add Caddy route**
+```caddy
+<slug>.wade.internal {
+  reverse_proxy <container-name>:18789
+}
+```
+Reload Caddy after editing:
+```bash
+docker exec caddy caddy reload --config /etc/caddy/Caddyfile
+```
+
+2) **Ensure container can be resolved by Caddy**
+- Container must be on `wade_internal` network (same as Caddy)
+```bash
+docker network connect wade_internal <container-name>  # idempotent
+```
+
+3) **Set proxy-safe Control UI config in hatchling**
+In hatchling `openclaw.json`:
+```json
+"gateway": {
+  "trustedProxies": ["172.29.0.0/16"],
+  "controlUi": {
+    "allowedOrigins": ["https://<slug>.wade.internal"]
+  }
+}
+```
+Restart hatchling after config change.
+
+4) **Use gateway token in Dashy URL (FRAGMENT, not query)**
+- Read token from hatchling `openclaw.json` (`gateway.auth.token`)
+- Set Dashy tile URL to:
+```text
+https://<slug>.wade.internal/#token=<gateway-token>
+```
+- ⚠️ Do **not** use `?token=`. Control UI bootstrap expects fragment token (`#token=`).
+- Restart Dashy to force refresh if UI cache is stale.
+
+5) **Approve device pairing once from operator browser**
+When first opening from a new browser/device, OpenClaw may show `pairing required`.
+Approve inside the hatchling:
+```bash
+docker exec <container-name> openclaw devices list
+docker exec <container-name> openclaw devices approve <requestId>
+```
+
+6) **Validation checks**
+- `https://<slug>.wade.internal/health` returns JSON `{"ok":true,"status":"live"}`
+- Dashy tile opens gateway directly
+- No `502` from Caddy
+- No websocket `pairing required` loop after approval
+
+---
+
+## 5. Maintenance
+
+### OC Updates
+```bash
+cd instances/<hatchling-name>
+docker compose pull
+docker compose up -d --force-recreate
+```
+
+### Config Changes
+Edit workspace files (`SOUL.md`, `TOOLS.md`, etc.) then restart:
+```bash
+docker compose restart
+```
+
+### Monitoring
+```bash
+./scripts/fleet.sh status          # all instances
+./scripts/fleet.sh health          # endpoint health
+./scripts/fleet.sh logs <name>     # container logs
+```
+
+---
+
+## Lessons Learned
+
+1. **Message Content Intent is the #1 gotcha.** Every deployment hits this if forgotten. Emphasize it in client instructions.
+2. **Never send credentials via Discord.** Use secure messaging or shared notes.
+3. **Each client gets their own Discord server** — this is the security boundary. Never put two clients' bots in the same server.
+4. **Port allocation is tracked in `fleet.json`** — use `hatch.sh` to avoid conflicts.
+5. **Container user is UID 1001.** If permission errors: `docker exec -u root <container> chown -R 1001:1001 /path`
+
+## Config Change Workflow (IMPORTANT — do not skip)
+
+**Always** edit the hatchery config, never docker exec/cp directly:
+
+```bash
+# 1. Edit the config
+nano ~/projects/oc-hatchery/instances/<name>/data/openclaw.json
+
+# 2. Deploy via fleet (NOT docker exec/cp)
+./scripts/fleet.sh update <name>
+```
+
+**Why this matters:** Direct container edits are wiped on any `fleet.sh update` or container recreate. The hatchery `instances/<name>/data/openclaw.json` is the source of truth.
+
+**Note on gitignore:** `instances/` is gitignored (contains tokens). The config file is NOT version-controlled — it lives only on the mini. Back it up manually if making major changes.
+
+## Channel Allowlist Schema
+
+With `groupPolicy: "allowlist"`, a guild entry with NO `channels` key = Mason responds in all channels in that guild. Add a `channels` key to restrict to specific channels:
+
+```json
+"guilds": {
+  "<guild_id>": {
+    "slug": "my-server",
+    "requireMention": false,
+    "channels": {
+      "<channel_id>": { "allow": true, "requireMention": false }
+    }
+  }
+}
+```
diff --git a/scripts/fleet.sh b/scripts/fleet.sh
@@ -291,8 +291,106 @@ case "$CMD" in
     echo "✓ Instance '$name' destroyed."
     ;;
 
+  monitor)
+    # Monitor fleet health — similar to health but with pass/fail logic
+    FLEET_HOST="${FLEET_HOST:-localhost}"
+    JSON_OUTPUT=false
+    shift
+    while [[ $# -gt 0 ]]; do
+      case "$1" in
+        --json) JSON_OUTPUT=true; shift ;;
+        *) shift ;;
+      esac
+    done
+
+    if [[ ! -f "$FLEET_REGISTRY" ]]; then
+      if [[ "$JSON_OUTPUT" == "true" ]]; then
+        echo '{"status":"error","message":"no fleet.json found"}'
+      else
+        echo "Error: no fleet.json found" >&2
+      fi
+      exit 1
+    fi
+
+    instances=$(jq -r '.instances | to_entries[] | select(.value.status == "running") | .key' "$FLEET_REGISTRY" 2>/dev/null || true)
+    if [[ -z "$instances" ]]; then
+      if [[ "$JSON_OUTPUT" == "true" ]]; then
+        echo '{"status":"ok","message":"no running instances","results":[]}'
+      else
+        echo "No running instances to monitor."
+      fi
+      exit 0
+    fi
+
+    all_healthy=true
+    json_results="[]"
+
+    while IFS= read -r name; do
+      port=$(jq -r ".instances[\"$name\"].port" "$FLEET_REGISTRY")
+      host=$(jq -r ".instances[\"$name\"].host // \"$FLEET_HOST\"" "$FLEET_REGISTRY")
+      url="http://${host}:${port}/health"
+
+      http_code=$(curl -s -o /dev/null -w "%{http_code}" --connect-timeout 5 --max-time 10 "$url" 2>/dev/null || echo "000")
+
+      if [[ "$http_code" == "200" ]]; then
+        healthy=true
+      else
+        healthy=false
+        all_healthy=false
+      fi
+
+      if [[ "$JSON_OUTPUT" == "true" ]]; then
+        json_results=$(echo "$json_results" | jq --arg n "$name" --arg h "$healthy" --arg c "$http_code" \
+          '. + [{"name":$n,"healthy":($h == "true"),"http_status":($c | tonumber)}]')
+      else
+        if [[ "$healthy" == "false" ]]; then
+          echo "$name UNHEALTHY (HTTP $http_code)"
+        fi
+      fi
+    done <<< "$instances"
+
+    if [[ "$JSON_OUTPUT" == "true" ]]; then
+      jq -n --argjson results "$json_results" --arg healthy "$all_healthy" \
+        '{"status":(if $healthy == "true" then "healthy" else "unhealthy" end),"results":$results}'
+    else
+      if [[ "$all_healthy" == "true" ]]; then
+        echo "All instances healthy"
+      fi
+    fi
-    fi
+    fi
+
+    if [[ "$all_healthy" == "true" ]]; then
+      exit 0
+    else
+      exit 1
+    fi
-    fi
+    fi
+
+    if [[ "$all_healthy" == "true" ]]; then
+      exit 0
+    else
+      exit 1
+    fi
+    ;;
+
+  snapshot)
+    name="${2:?Usage: fleet.sh snapshot <instance-name>}"
+    container="hatchery-${name}"
+    timestamp=$(date -u +%Y%m%dT%H%M%SZ)
+    tmp_dir="/tmp/snapshot-${name}-${timestamp}"
+    snapshot_dir="$ROOT_DIR/snapshots"
+    snapshot_path="${snapshot_dir}/${name}-${timestamp}.tar.gz"
+
+    # Use remote Docker env vars (set these in your environment or .env)
+    # export DOCKER_HOST="tcp://your-docker-host:2376"
+    # export DOCKER_TLS_VERIFY=1
+    # export DOCKER_CERT_PATH="$HOME/.docker/remote"
+
+    echo "📸 Snapshotting $name ($container)..."
+
+    # Copy openclaw state from container
+    docker cp "${container}:/home/openclaw/.openclaw" "$tmp_dir"
+    if [[ $? -ne 0 ]]; then
+      echo "Error: Failed to copy from container '$container'" >&2
+      exit 1
+    fi
+
+    # Tar it up
+    mkdir -p "$snapshot_dir"
+    tar -czf "$snapshot_path" -C "/tmp" "snapshot-${name}-${timestamp}"
+    rm -rf "$tmp_dir"
+
+    echo "✅ Snapshot saved: $snapshot_path"
+    ;;
+
   *)
-    echo "Usage: fleet.sh {status|health|start [name]|stop [name]|logs <name>|update <name|--all>|destroy <name> [--force] [--archive]}" >&2
+    echo "Usage: fleet.sh {status|health|start|stop|logs|update|destroy|monitor|snapshot}" >&2
     exit 1
     ;;
 esac