Skip to content

feat: gatekeeper sidecar — Rust implementation + design doc + CI#9

Open
thepagent wants to merge 3 commits intomainfrom
docs/gatekeeper-design
Open

feat: gatekeeper sidecar — Rust implementation + design doc + CI#9
thepagent wants to merge 3 commits intomainfrom
docs/gatekeeper-design

Conversation

@thepagent
Copy link
Copy Markdown
Owner

@thepagent thepagent commented Mar 14, 2026

Problem

OpenClaw is an AI agent gateway — the agent can execute arbitrary bash, Python, and HTTP tools as part of its reasoning loop. This creates a fundamental security tension:

Giving the agent access to its own secrets means a compromised or misbehaving agent can exfiltrate them.

With naive approaches (env vars, mounted K8s Secrets), the agent and its secrets share the same trust boundary. There is no audit trail and no human in the loop.

Solution

Move secrets entirely out of the agent's trust boundary using a Gatekeeper sidecar.

Pod-level view (container isolation):

┌─────────────────── K8s Pod ───────────────────────┐
│                                                   │
│  ┌─────────────────┐    ┌──────────────────────┐  │
│  │  main (OpenClaw) │    │  gatekeeper (sidecar) │  │
│  │                 │    │                      │  │
│  │  ❌ no secrets   │───►│  ✅ holds secrets     │  │
│  │  ❌ no IAM       │    │  ✅ IAM Role (AWS SM) │  │
│  │                 │◄───│  ✅ Telegram approval │  │
│  └─────────────────┘    └──────────────────────┘  │
│         Unix socket on shared emptyDir             │
└───────────────────────────────────────────────────┘

Full 3-tier view (end-to-end flow):

┌─── Tier 1 ───────┐   ┌─── Tier 2 ──────────────────────────┐   ┌─── Tier 3 ───┐
│  AWS Secrets     │   │  K8s Pod                            │   │  Operator    │
│  Manager         │   │                                     │   │  📱 Telegram │
│                  │   │  ┌────────────┐  ┌───────────────┐  │   │              │
│  openclaw/tokens │   │  │    main    │  │  gatekeeper   │  │   │  [✅ Approve] │
│  - TG_TOKEN      │   │  │ (OpenClaw) │  │   (sidecar)   │  │   │  [❌ Deny]   │
│  - GW_TOKEN      │   │  │            │  │               │  │   │              │
│                  │   │  │ 1. request ──► 2. notify ──────────►  3. operator  │
│                  │   │  │            │  │               │  │   │     taps     │
│                  │◄──────────────────── 5. fetch  ◄──────────── 4. approved  │
│                  │   │  │ 6. secret ◄── 6. return  │    │   │              │
│                  │   │  │  in memory │  │               │  │   │              │
│                  │   │  └────────────┘  └───────────────┘  │   └──────────────┘
│  CloudTrail logs │   │  no IAM, no secrets                 │
│  every access    │   └─────────────────────────────────────┘
└──────────────────┘

Three layers of protection:

  1. AWS Secrets Manager — secrets never stored in K8s etcd or on disk; every access logged in CloudTrail
  2. Gatekeeper sidecar — only container with IAM access; agent cannot exec into it; filesystem isolated
  3. Telegram approval gate — operator must tap Approve on their phone before any secret is returned

Changes

  • docs/gatekeeper.md — full design doc: problem statement, architecture, threat model, Helm values design
  • gatekeeper/src/main.rs — Rust implementation: Unix socket server, Telegram approval gate, AWS Secrets Manager fetch, zeroize memory cleanup, rate limiting
  • gatekeeper/Cargo.toml — dependencies + release profile (size-optimized)
  • gatekeeper/Dockerfile — musl static build → Alpine (~10MB image)
  • .github/workflows/gatekeeper-image.yml — CI: build & push to ghcr.io on changes to gatekeeper/**

@thepagent thepagent changed the title docs: add gatekeeper sidecar design doc feat: gatekeeper sidecar — Rust implementation + design doc + CI Mar 14, 2026
Copy link
Copy Markdown

@masami-agent masami-agent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Mutex held during Telegram approval (up to 60s)

fetch_with_approval is called while holding the state Mutex lock (line ~78: let mut s = state.lock().await). Inside fetch_with_approval, request_approval polls Telegram for up to 60 seconds — meaning the lock is held for the entire approval window.

During this time, every other incoming socket connection blocks on state.lock().await, making the gatekeeper effectively single-threaded and unresponsive.

Fix: release the lock before waiting for Telegram approval. Extract the fields needed (bot token, chat id, rate limit state) before dropping the lock, then re-acquire only to update last_request after the approval resolves:

async fn handle(stream: UnixStream, state: Arc<Mutex<State>>) -> Result<()> {
    // ...parse req...

    // 1. check rate limit and extract config — hold lock briefly
    let (bot_token, chat_id, secret_name) = {
        let mut s = state.lock().await;
        if let Some(last) = s.last_request {
            if last.elapsed() < Duration::from_secs(RATE_LIMIT_SECS) {
                // write error response and return early
            }
        }
        s.last_request = Some(Instant::now());
        (s.tg_bot_token.clone(), s.tg_chat_id.clone(), req.name.clone())
    }; // lock released here

    // 2. wait for Telegram approval — no lock held
    let approved = request_approval(&bot_token, &chat_id, &secret_name).await?;

    // 3. fetch from AWS if approved — lock not needed (sm_client is Send+Sync)
    // ...
}

This also requires making sm_client accessible without the Mutex (it's already Clone + Send + Sync), or wrapping it in its own Arc.

@auto-machine
Copy link
Copy Markdown
Contributor

Code Review

Sidecar 程式碼、設計文件、CI workflow 都很完整,但目前 PR 只有 gatekeeper 本體,還缺 Helm chart 整合才能實際部署。以下是建議的補充項目:

1. 缺少 Cargo.lock

Dockerfile 第 4 行 COPY Cargo.toml Cargo.lock ./,但 PR 沒有附 Cargo.lock,build 會失敗。需要先 cargo generate-lockfile 後一起提交。

2. values.yaml — 新增 gatekeeper 區塊

gatekeeper:
  enabled: false                  # opt-in
  image:
    repository: ghcr.io/thepagent/openclaw-gatekeeper
    tag: latest
    pullPolicy: IfNotPresent
  aws:
    region: ap-northeast-1
    secretsManagerPath: openclaw/tokens
  telegram:
    approvalTimeoutSeconds: 60
    rateLimitMinutes: 5
  # gatekeeper 自身需要的 secrets(TG bot token、chat ID)
  # 可以從現有 K8s Secret 注入,或直接在 values 設定
  secretRef: ""                   # 引用已存在的 K8s Secret
  resources:
    limits:
      cpu: 100m
      memory: 64Mi
    requests:
      cpu: 50m
      memory: 32Mi

3. deployment.yaml — 條件注入 sidecar

gatekeeper.enabled=true 時:

  • 加入 gatekeeper sidecar container,掛 env(GATEKEEPER_TG_BOT_TOKENGATEKEEPER_TG_CHAT_ID
  • 新增 shared emptyDir volume gatekeeper-sock,掛載到獨立路徑(見注意事項第一點)
  • 主容器也掛載同一個 volume
{{- if .Values.gatekeeper.enabled }}
- name: gatekeeper
  image: "{{ .Values.gatekeeper.image.repository }}:{{ .Values.gatekeeper.image.tag }}"
  volumeMounts:
  - name: gatekeeper-sock
    mountPath: /var/run/gatekeeper
  envFrom:
  {{- if .Values.gatekeeper.secretRef }}
  - secretRef:
      name: {{ .Values.gatekeeper.secretRef }}
  {{- end }}
  resources:
    {{- toYaml .Values.gatekeeper.resources | nindent 10 }}
{{- end }}

4. AWS 憑證注入方式

設計文件假設了 IRSA(EKS),但不是所有人都部署在 EKS 上。建議支援多種方式:

環境 做法
EKS IRSA 或 EKS Pod Identity — SA annotation
非 EKS K8s / K3s 透過 gatekeeper.secretRef 注入 AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY 環境變數
不用 AWS SM 未來可擴充其他 backend(Vault、GCP SM),目前先用 secretRef 注入所有需要的 env

values 可以加一個通用的 serviceAccount 區塊,讓使用者自行決定是否帶 annotation:

gatekeeper:
  serviceAccount:
    create: false                 # 預設不建立,用 Pod 的 SA
    name: ""
    annotations: {}               # EKS 使用者可加 eks.amazonaws.com/role-arn

這樣無論是 EKS IRSA、靜態 key、還是其他雲的 workload identity 都能支援。

5. 注意事項

  • Socket 路徑衝突:目前主容器已經用 emptyDir 掛載 /tmp,gatekeeper 也寫 /tmp/gatekeeper.sock。建議把 socket 改到 /var/run/gatekeeper/gatekeeper.sock,主容器和 sidecar 共享這個 emptyDir,避免跟主容器的 tmp 混在一起。對應 main.rsSOCKET_PATH 常數也要改。
  • Telegram token 來源:gatekeeper 需要 GATEKEEPER_TG_BOT_TOKENGATEKEEPER_TG_CHAT_ID,這兩個是 bootstrap secret(gatekeeper 啟動前就要有),無法從 AWS SM 拿。需要透過 gatekeeper.secretRef 從 K8s Secret 注入。
  • SA 是 Pod 級別的:K8s 的 serviceAccountName 作用在 Pod 而非 Container,所以無法讓主容器和 sidecar 用不同 SA。如果要做 IAM 隔離,可以:(1) 在非 EKS 環境用 secretRef 只注入到 sidecar container 的 env,主容器拿不到;(2) 在 EKS 環境用 Pod Identity 的 container-level credential。

建議做法

可以在這個 PR 繼續追加 Helm 整合的 commits,或者先合併現有程式碼,再開 follow-up PR 處理 chart 整合。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants