Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -17,3 +17,4 @@ fleet.json
.DS_Store
instances/
archives/
.idea/
259 changes: 259 additions & 0 deletions docs/DEPLOYMENT-SOP.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,259 @@
# Hatchery Deployment SOP

Standard Operating Procedure for deploying a new OpenClaw hatchling instance.

---

## 1. Pre-deployment — Client Side

The client completes these steps. Send them this section.

- [ ] **Create a Discord server** for their business
- [ ] **Create a Discord Application** at https://discord.com/developers/applications
- Click "New Application", name it after the bot
- [ ] **Create a Bot** under the application
- Go to Bot settings → click "Add Bot"
- [ ] **Enable MESSAGE CONTENT INTENT**
- Bot settings → Privileged Gateway Intents → toggle **Message Content Intent** ON
- ⚠️ **This is the #1 deployment failure.** Without it, the bot gets error `4014` and cannot read messages.
- [ ] **Copy the bot token** — Bot settings → "Reset Token" → copy
- [ ] **Note the Client ID** — OAuth2 → General → "Client ID"
- [ ] **Invite the bot to their server**
- OAuth2 → URL Generator
- Scopes: `bot`, `applications.commands`
- Permissions: Send Messages, Read Message History, Embed Links, Attach Files, Add Reactions, Use Slash Commands
- Copy generated URL, open in browser, select server
- [ ] **Send credentials securely**
- ✅ Secure messaging (Signal, iMessage, or shared notes)
- ❌ **NEVER** plaintext in Discord channels
- Send: bot token, client ID, server ID, target channel ID(s)

---

## 2. Pre-deployment — Operator Side

### 2.1 Prepare the container directory

```bash
ssh user@your-docker-host
mkdir -p /path/to/instances/<hatchling-name>
```

### 2.2 Copy template files

Use `hatch.sh` (recommended):
```bash
./scripts/hatch.sh <hatchling-name> [--port PORT]
```

Or manually copy from the template directory.

### 2.3 Seed workspace files

Ensure these exist in the workspace directory:
- `SOUL.md` — personality, role, tone
- `USER.md` — client profile
- `AGENTS.md` — session behavior
- `IDENTITY.md` — name, purpose
- `HEARTBEAT.md` — periodic check instructions
- `MEMORY.md` — initial context
- `TOOLS.md` — available tools/integrations

Customize each for the client. At minimum, edit `SOUL.md`, `USER.md`, and `IDENTITY.md`.

### 2.4 Create `.env`

```bash
cp .env.example .env
# Edit .env and add:
# - LLM API key (ANTHROPIC_API_KEY or OPENAI_API_KEY)
# - DISCORD_BOT_TOKEN
# - DISCORD_CLIENT_ID
```

- LLM API keys can be shared across hatchlings (same subscription).
- Each hatchling gets its own Discord bot token and client ID.

### 2.5 Configure `openclaw.template.json`

Edit the template to set:
- `model` — e.g., `anthropic/claude-sonnet-4-20250514`
- `agent.name` — the hatchling's name
- Discord channel bindings — map channel IDs to behaviors
- Tool permissions / allowlists

### 2.6 Verify port assignment

`hatch.sh` auto-assigns ports from `fleet.json`. To specify manually:

```bash
./scripts/hatch.sh <name> --port <port>
```

Check existing allocations: `cat fleet.json`

---

## 3. Deployment

```bash
cd instances/<hatchling-name>

# Build and start
docker compose up -d --build

# Wait for startup
sleep 30

# Health check
curl http://localhost:<port>/health
```

- [ ] Health endpoint returns OK
- [ ] Bot appears online in Discord (green dot)
- [ ] Send a test message in the bound channel
- [ ] Bot responds correctly

### Troubleshooting: Error 4014

```
Error: Used disallowed intents
```

**Fix:**
1. Go to https://discord.com/developers/applications
2. Select the application → Bot → Privileged Gateway Intents
3. Enable **Message Content Intent**
4. Restart the container: `docker compose restart`

---

## 4. Post-deployment

- [ ] **Add to `fleet.json`** — happens automatically with `hatch.sh`
- [ ] **Update monitoring** — `fleet.sh health` to verify endpoint
- [ ] **Schedule follow-up checks:**
- 24 hours — confirm stable operation, check logs
- 1 week — review usage, adjust personality/config if needed

### 4.1 HTTPS + Dashy access pattern (REPEATABLE)

Use this for each new hatchling (AFGE, CHC, Ali, etc.).

1) **Add Caddy route**
```caddy
<slug>.wade.internal {
reverse_proxy <container-name>:18789
}
```
Reload Caddy after editing:
```bash
docker exec caddy caddy reload --config /etc/caddy/Caddyfile
```

2) **Ensure container can be resolved by Caddy**
- Container must be on `wade_internal` network (same as Caddy)
```bash
docker network connect wade_internal <container-name> # idempotent
```

3) **Set proxy-safe Control UI config in hatchling**
In hatchling `openclaw.json`:
```json
"gateway": {
"trustedProxies": ["172.29.0.0/16"],
"controlUi": {
"allowedOrigins": ["https://<slug>.wade.internal"]
}
}
```
Restart hatchling after config change.

4) **Use gateway token in Dashy URL (FRAGMENT, not query)**
- Read token from hatchling `openclaw.json` (`gateway.auth.token`)
- Set Dashy tile URL to:
```text
https://<slug>.wade.internal/#token=<gateway-token>
```
- ⚠️ Do **not** use `?token=`. Control UI bootstrap expects fragment token (`#token=`).
- Restart Dashy to force refresh if UI cache is stale.

5) **Approve device pairing once from operator browser**
When first opening from a new browser/device, OpenClaw may show `pairing required`.
Approve inside the hatchling:
```bash
docker exec <container-name> openclaw devices list
docker exec <container-name> openclaw devices approve <requestId>
```

6) **Validation checks**
- `https://<slug>.wade.internal/health` returns JSON `{"ok":true,"status":"live"}`
- Dashy tile opens gateway directly
- No `502` from Caddy
- No websocket `pairing required` loop after approval

---

## 5. Maintenance

### OC Updates
```bash
cd instances/<hatchling-name>
docker compose pull
docker compose up -d --force-recreate
```

### Config Changes
Edit workspace files (`SOUL.md`, `TOOLS.md`, etc.) then restart:
```bash
docker compose restart
```

### Monitoring
```bash
./scripts/fleet.sh status # all instances
./scripts/fleet.sh health # endpoint health
./scripts/fleet.sh logs <name> # container logs
```

---

## Lessons Learned

1. **Message Content Intent is the #1 gotcha.** Every deployment hits this if forgotten. Emphasize it in client instructions.
2. **Never send credentials via Discord.** Use secure messaging or shared notes.
3. **Each client gets their own Discord server** — this is the security boundary. Never put two clients' bots in the same server.
4. **Port allocation is tracked in `fleet.json`** — use `hatch.sh` to avoid conflicts.
5. **Container user is UID 1001.** If permission errors: `docker exec -u root <container> chown -R 1001:1001 /path`

## Config Change Workflow (IMPORTANT — do not skip)

**Always** edit the hatchery config, never docker exec/cp directly:

```bash
# 1. Edit the config
nano ~/projects/oc-hatchery/instances/<name>/data/openclaw.json

# 2. Deploy via fleet (NOT docker exec/cp)
./scripts/fleet.sh update <name>
```

**Why this matters:** Direct container edits are wiped on any `fleet.sh update` or container recreate. The hatchery `instances/<name>/data/openclaw.json` is the source of truth.

**Note on gitignore:** `instances/` is gitignored (contains tokens). The config file is NOT version-controlled — it lives only on the mini. Back it up manually if making major changes.

## Channel Allowlist Schema

With `groupPolicy: "allowlist"`, a guild entry with NO `channels` key = Mason responds in all channels in that guild. Add a `channels` key to restrict to specific channels:

```json
"guilds": {
"<guild_id>": {
"slug": "my-server",
"requireMention": false,
"channels": {
"<channel_id>": { "allow": true, "requireMention": false }
}
}
}
```
100 changes: 99 additions & 1 deletion scripts/fleet.sh
Original file line number Diff line number Diff line change
Expand Up @@ -291,8 +291,106 @@ case "$CMD" in
echo "✓ Instance '$name' destroyed."
;;

monitor)
# Monitor fleet health — similar to health but with pass/fail logic
FLEET_HOST="${FLEET_HOST:-localhost}"
JSON_OUTPUT=false
shift
while [[ $# -gt 0 ]]; do
case "$1" in
--json) JSON_OUTPUT=true; shift ;;
*) shift ;;
esac
done

if [[ ! -f "$FLEET_REGISTRY" ]]; then
if [[ "$JSON_OUTPUT" == "true" ]]; then
echo '{"status":"error","message":"no fleet.json found"}'
else
echo "Error: no fleet.json found" >&2
fi
exit 1
fi

instances=$(jq -r '.instances | to_entries[] | select(.value.status == "running") | .key' "$FLEET_REGISTRY" 2>/dev/null || true)
Copy link

Copilot AI Feb 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the PR description, "fleet.json schema update - Added missing fields: name, host, status, version to all instances" but there are no corresponding changes in the hatch.sh script or any other visible code that would add these fields to the fleet.json registry when instances are created or managed. This is a discrepancy between the PR description and the actual code changes.

Copilot uses AI. Check for mistakes.
if [[ -z "$instances" ]]; then
if [[ "$JSON_OUTPUT" == "true" ]]; then
echo '{"status":"ok","message":"no running instances","results":[]}'
else
echo "No running instances to monitor."
fi
exit 0
fi

all_healthy=true
json_results="[]"

while IFS= read -r name; do
port=$(jq -r ".instances[\"$name\"].port" "$FLEET_REGISTRY")
host=$(jq -r ".instances[\"$name\"].host // \"$FLEET_HOST\"" "$FLEET_REGISTRY")
url="http://${host}:${port}/health"

http_code=$(curl -s -o /dev/null -w "%{http_code}" --connect-timeout 5 --max-time 10 "$url" 2>/dev/null || echo "000")

if [[ "$http_code" == "200" ]]; then
healthy=true
else
healthy=false
all_healthy=false
fi

if [[ "$JSON_OUTPUT" == "true" ]]; then
json_results=$(echo "$json_results" | jq --arg n "$name" --arg h "$healthy" --arg c "$http_code" \
'. + [{"name":$n,"healthy":($h == "true"),"http_status":($c | tonumber)}]')
else
if [[ "$healthy" == "false" ]]; then
echo "$name UNHEALTHY (HTTP $http_code)"
fi
fi
done <<< "$instances"

if [[ "$JSON_OUTPUT" == "true" ]]; then
jq -n --argjson results "$json_results" --arg healthy "$all_healthy" \
'{"status":(if $healthy == "true" then "healthy" else "unhealthy" end),"results":$results}'
else
if [[ "$all_healthy" == "true" ]]; then
echo "All instances healthy"
fi
fi
Copy link

Copilot AI Feb 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The monitor command does not exit with a non-zero status when instances are unhealthy. For monitoring purposes, the command should exit with code 1 when all_healthy is false, so that monitoring systems can detect failures. This is especially important given the PR description mentions "pass/fail logic".

Suggested change
fi
fi
if [[ "$all_healthy" == "true" ]]; then
exit 0
else
exit 1
fi

Copilot uses AI. Check for mistakes.
;;
Comment on lines +294 to +360
Copy link

Copilot AI Feb 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new monitor command is not covered by tests. The codebase has comprehensive test coverage for fleet commands in scripts/test-fleet.sh, and this new functionality should have corresponding tests to ensure it works correctly.

Copilot uses AI. Check for mistakes.

snapshot)
name="${2:?Usage: fleet.sh snapshot <instance-name>}"
container="hatchery-${name}"
timestamp=$(date -u +%Y%m%dT%H%M%SZ)
tmp_dir="/tmp/snapshot-${name}-${timestamp}"
snapshot_dir="$ROOT_DIR/snapshots"
snapshot_path="${snapshot_dir}/${name}-${timestamp}.tar.gz"

# Use remote Docker env vars (set these in your environment or .env)
# export DOCKER_HOST="tcp://your-docker-host:2376"
# export DOCKER_TLS_VERIFY=1
# export DOCKER_CERT_PATH="$HOME/.docker/remote"

echo "📸 Snapshotting $name ($container)..."

# Copy openclaw state from container
docker cp "${container}:/home/openclaw/.openclaw" "$tmp_dir"
Copy link

Copilot AI Feb 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The snapshot command does not verify that the container exists before attempting to copy from it. Consider adding a check similar to the one in the logs command (lines 132-135 of fleet.sh) to provide a clearer error message if the container doesn't exist.

Copilot uses AI. Check for mistakes.
if [[ $? -ne 0 ]]; then
echo "Error: Failed to copy from container '$container'" >&2
exit 1
fi

# Tar it up
mkdir -p "$snapshot_dir"
tar -czf "$snapshot_path" -C "/tmp" "snapshot-${name}-${timestamp}"
rm -rf "$tmp_dir"

echo "✅ Snapshot saved: $snapshot_path"
;;
Comment on lines +362 to +390
Copy link

Copilot AI Feb 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new snapshot command is not covered by tests. The codebase has comprehensive test coverage for fleet commands in scripts/test-fleet.sh, and this new functionality should have corresponding tests to ensure it works correctly.

Copilot uses AI. Check for mistakes.

*)
echo "Usage: fleet.sh {status|health|start [name]|stop [name]|logs <name>|update <name|--all>|destroy <name> [--force] [--archive]}" >&2
echo "Usage: fleet.sh {status|health|start|stop|logs|update|destroy|monitor|snapshot}" >&2
exit 1
;;
esac
Loading
Loading