From 26237b065462d839d842842a2022b3e06b4888e2 Mon Sep 17 00:00:00 2001 From: johnxie Date: Wed, 18 Mar 2026 01:56:36 -0700 Subject: [PATCH] =?UTF-8?q?docs:=20add=20tutorial=20foundation=20=E2=80=94?= =?UTF-8?q?=20architecture,=20browse=20engine,=20snapshots,=20commands?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Add comprehensive onboarding documentation covering gstack's core infrastructure. Five files forming a self-contained tutorial that takes readers from "what is gstack?" to understanding every browse command. - index.md: Tutorial entry point with Mermaid architecture flowchart - 01_architecture.md: Three-layer design, virtual team, project structure - 02_browse_engine.md: Client-server model, lifecycle, security, buffers - 03_snapshot_and_refs.md: Accessibility tree, @ref system, staleness - 04_command_system.md: All 52 commands (read/write/meta), error handling All technical claims verified against source code (commands.ts, snapshot.ts, server.ts, browser-manager.ts, buffers.ts). --- docs/01_architecture.md | 205 ++++++++++++++++++ docs/02_browse_engine.md | 301 +++++++++++++++++++++++++++ docs/03_snapshot_and_refs.md | 304 +++++++++++++++++++++++++++ docs/04_command_system.md | 389 +++++++++++++++++++++++++++++++++++ docs/index.md | 69 +++++++ 5 files changed, 1268 insertions(+) create mode 100644 docs/01_architecture.md create mode 100644 docs/02_browse_engine.md create mode 100644 docs/03_snapshot_and_refs.md create mode 100644 docs/04_command_system.md create mode 100644 docs/index.md diff --git a/docs/01_architecture.md b/docs/01_architecture.md new file mode 100644 index 00000000..b476383a --- /dev/null +++ b/docs/01_architecture.md @@ -0,0 +1,205 @@ +--- +layout: default +title: "Chapter 1: Architecture Overview" +parent: "gstack" +nav_order: 1 +--- + +# Chapter 1: Architecture Overview + +Welcome to the gstack tutorial! In this first chapter, we'll explore the big picture — what gstack is, why it exists, and how its pieces fit together. By the end, you'll have a mental map of the entire system. + +## What Problem Does gstack Solve? + +Imagine you're building a web application. You need someone to plan the architecture, someone to review the design, someone to write tests, someone to QA the live site, someone to review the code, and someone to ship it. That's a whole team — and coordinating them takes time. + +gstack turns **one Claude Code session** into that entire team. Each "team member" is a **skill** — a Markdown-based workflow prompt that gives Claude a specific role, personality, and checklist. And they all share a secret weapon: a persistent headless browser that can click buttons, fill forms, and take screenshots in ~100ms. + +## The Virtual Team + +Here's who's on your team: + +| Role | Skill | What They Do | +|------|-------|-------------| +| CEO/Founder | `/plan-ceo-review` | Rethink the problem, find 10-star products | +| Eng Manager | `/plan-eng-review` | Lock architecture, edge cases, test plans | +| Senior Designer | `/plan-design-review` | 80-item design audit, AI slop detection | +| QA Engineer | `/browse` | Headless browser: real clicks, real screenshots | +| QA Lead | `/qa`, `/qa-only` | Find bugs, generate regression tests | +| Designer Who Codes | `/design-review` | Visual QA + atomic CSS fixes | +| Staff Engineer | `/review` | Find bugs that pass CI | +| Release Engineer | `/ship` | Tests → version bump → CHANGELOG → PR | +| Technical Writer | `/document-release` | Update all docs post-ship | +| Eng Manager (Retro) | `/retro` | Weekly retro with trends and streaks | + +## The Three Layers + +gstack has three architectural layers, each building on the one below: + +```mermaid +flowchart TB + subgraph Layer3["Layer 3: Skills (The Team)"] + direction LR + S1["Plan Skills"] + S2["Build Skills"] + S3["Ship Skills"] + end + + subgraph Layer2["Layer 2: Template Engine (The Factory)"] + direction LR + TMPL[".tmpl Templates"] + GEN["gen-skill-docs"] + MD["Generated SKILL.md"] + TMPL --> GEN --> MD + end + + subgraph Layer1["Layer 1: Browse Engine (The Eyes & Hands)"] + direction LR + CLI["CLI Client"] + SRV["HTTP Server"] + PW["Playwright + Chromium"] + CLI --> SRV --> PW + end + + Layer3 --> Layer2 + Layer2 --> Layer1 +``` + +### Layer 1: Browse Engine (The Eyes & Hands) + +At the foundation is a **persistent headless browser** — a Chromium instance managed by Playwright, exposed as a CLI tool. When a skill needs to visit a page, click a button, or take a screenshot, it calls the browse binary (`$B`). + +The key insight is **persistence**. The browser stays running between commands. First call takes ~3 seconds (startup); every subsequent call takes ~100-200ms. This makes real browser testing practical inside AI workflows. + +**Key files:** +- `browse/src/cli.ts` — CLI wrapper that talks to the server +- `browse/src/server.ts` — HTTP daemon that manages Chromium +- `browse/src/browser-manager.ts` — Lifecycle, tabs, refs, dialogs +- `browse/src/snapshot.ts` — Accessibility tree extraction + +→ Deep dive: [Chapter 2: Browse Engine](02_browse_engine.md) + +### Layer 2: Template Engine (The Factory) + +Skills are written as `.tmpl` template files containing Markdown with `{{PLACEHOLDER}}` tokens. At build time, `gen-skill-docs.ts` resolves these placeholders — pulling command references from `commands.ts`, snapshot flags from `snapshot.ts`, and shared methodology blocks from the generator itself. + +This means skill documentation is **always in sync** with the source code. Add a new browse command → rebuild → every skill that references commands gets updated. + +**Key files:** +- `scripts/gen-skill-docs.ts` — Template compiler +- `SKILL.md.tmpl` — Root skill template +- `{skill-dir}/SKILL.md.tmpl` — Per-skill templates + +→ Deep dive: [Chapter 6: Template Engine](06_template_engine.md) + +### Layer 3: Skills (The Team) + +Each skill is a generated `SKILL.md` file that Claude reads when you invoke a slash command. Skills define: +- **Who** Claude is (role, personality, cognitive patterns) +- **What** Claude should do (step-by-step workflow) +- **How** Claude should decide (decision frameworks, AskUserQuestion format) +- **When** to stop (completion criteria, risk heuristics) + +→ Deep dive: [Chapter 5: Skill System](05_skill_system.md) + +## The Development Workflow + +A typical gstack-powered development cycle looks like this: + +```mermaid +sequenceDiagram + participant User + participant Plan as Planning Phase + participant Build as Build Phase + participant Ship as Ship Phase + + User->>Plan: /plan-ceo-review + Plan-->>User: Scope decisions, 10-star vision + User->>Plan: /plan-eng-review + Plan-->>User: Architecture, test plan, edge cases + User->>Plan: /plan-design-review + Plan-->>User: Design audit, interaction states + + User->>Build: Implement features + User->>Build: /qa (or /qa-only) + Build-->>User: Bug reports + fixes + tests + User->>Build: /design-review + Build-->>User: Visual fixes + before/after screenshots + + User->>Ship: /ship + Ship-->>User: Tests ✓ → Version bump → CHANGELOG → PR + User->>Ship: /document-release + Ship-->>User: Docs updated, PR body enriched +``` + +## Project Structure + +Here's how the codebase is organized: + +``` +gstack/ +├── browse/ # Layer 1: Headless browser engine +│ ├── src/ # 14 TypeScript source files +│ │ ├── cli.ts # CLI client +│ │ ├── server.ts # HTTP daemon +│ │ ├── commands.ts # Command registry (single source of truth) +│ │ └── snapshot.ts # Accessibility tree + @ref system +│ ├── test/ # Browser integration tests +│ └── dist/ # Compiled ~58MB binary +│ +├── scripts/ # Layer 2: Build tooling +│ ├── gen-skill-docs.ts # Template → SKILL.md compiler +│ ├── skill-check.ts # Health dashboard +│ └── dev-skill.ts # Watch mode for template development +│ +├── {14 skill dirs}/ # Layer 3: One directory per skill +│ ├── SKILL.md.tmpl # Template (human-edited) +│ └── SKILL.md # Generated (committed, never hand-edited) +│ +├── test/ # 3-tier test infrastructure +│ ├── helpers/ # Parser, runner, judge, eval store +│ └── fixtures/ # Ground truth, planted bugs, HTML fixtures +│ +├── bin/ # Helper scripts (update check, config, etc.) +├── CLAUDE.md # Development instructions +└── package.json # Build scripts + dependencies +``` + +## Key Design Decisions + +### Why Bun? + +gstack compiles to a single ~58MB binary using `bun build --compile`. No `node_modules` at runtime. Bun also provides: +- Native SQLite (for cookie decryption) +- Native TypeScript (no compilation step) +- ~1ms startup (vs ~100ms for Node) +- Built-in HTTP server + +### Why Markdown Skills? + +Skills are Markdown because Claude reads Markdown. There's no runtime, no SDK, no framework — just a `.md` file that tells Claude what to do. This makes skills: +- **Readable** by humans and AI alike +- **Versionable** in git +- **Testable** via static validation, E2E sessions, and LLM judges +- **Composable** via template placeholders + +### Why a Persistent Server? + +Most browser automation tools start and stop a browser for each test. gstack keeps one Chromium instance running, with an HTTP server in front of it. This gives you: +- **Sub-second commands** after the initial ~3s startup +- **Shared state** (cookies, tabs, localStorage) across commands +- **Crash recovery** — the CLI detects a dead server and auto-restarts + +### Why Committed Generated Files? + +The generated `SKILL.md` files are committed to git (not `.gitignore`d) because: +1. Claude reads `SKILL.md` at skill load time — no build step needed +2. CI can validate freshness (`gen:skill-docs --dry-run`) +3. `git blame` works on the generated output + +## What's Next? + +Now that you have the big picture, let's dive into the foundation — the browse engine that gives gstack its "eyes and hands." + +→ Next: [Chapter 2: Browse Engine](02_browse_engine.md) + diff --git a/docs/02_browse_engine.md b/docs/02_browse_engine.md new file mode 100644 index 00000000..2a494fb7 --- /dev/null +++ b/docs/02_browse_engine.md @@ -0,0 +1,301 @@ +--- +layout: default +title: "Chapter 2: Browse Engine" +parent: "gstack" +nav_order: 2 +--- + +# Chapter 2: Browse Engine + +Welcome to the browse engine — the foundation that gives gstack its ability to see and interact with real web pages. If the skills are the "brains" of your virtual team, the browse engine is the "eyes and hands." + +## What Problem Does This Solve? + +When an AI agent needs to test a web application, it typically has two options: read the source code and guess what the UI looks like, or use slow, fragile browser tools that take seconds per interaction. + +gstack's browse engine solves this with a **persistent Chromium daemon**. The browser starts once and stays running. Every subsequent command — clicking a button, reading text, taking a screenshot — completes in ~100-200ms. This makes real browser testing practical inside AI workflows where you might need dozens of interactions. + +Think of it like the difference between starting your car's engine every time you want to move vs. leaving it idling. The first approach wastes time; the second lets you react instantly. + +## Client-Server Architecture + +The browse engine uses a classic client-server pattern: + +```mermaid +sequenceDiagram + participant Skill as Skill ($B goto ...) + participant CLI as CLI (cli.ts) + participant State as .gstack/browse.json + participant Server as HTTP Server (server.ts) + participant Browser as Playwright + Chromium + + Skill->>CLI: $B goto https://example.com + CLI->>State: Read pid, port, token + CLI->>Server: POST /command {cmd: "goto", args: ["https://example.com"]} + Server->>Browser: page.goto("https://example.com") + Browser-->>Server: Page loaded + Server-->>CLI: {output: "Navigated to https://example.com"} + CLI-->>Skill: Navigated to https://example.com +``` + +### The CLI (`browse/src/cli.ts`) + +The CLI is a **thin wrapper** — it doesn't touch the browser directly. Instead, it: + +1. Reads the state file (`.gstack/browse.json`) to find the server +2. Starts the server if it's not running +3. Sends an HTTP POST to `/command` +4. Prints the response + +```typescript +// Simplified from cli.ts +async function sendCommand(port: number, token: string, cmd: string, args: string[]) { + const response = await fetch(`http://127.0.0.1:${port}/command`, { + method: 'POST', + headers: { + 'Content-Type': 'application/json', + 'Authorization': `Bearer ${token}`, + }, + body: JSON.stringify({ command: cmd, args }), + }); + return response.json(); +} +``` + +The state file looks like this: + +```json +{ + "pid": 12345, + "port": 34567, + "token": "a1b2c3d4-e5f6-...", + "startedAt": "2026-03-18T10:00:00Z", + "binaryVersion": "716e4c9" +} +``` + +### The Server (`browse/src/server.ts`) + +The server is a `Bun.serve()` HTTP daemon that: +- Listens on a random port (10000-60000) +- Authenticates requests via Bearer token +- Dispatches commands to the appropriate handler +- Auto-shuts down after 30 minutes of idle time + +```typescript +// Simplified from server.ts +const server = Bun.serve({ + port: randomPort(), + async fetch(req) { + if (new URL(req.url).pathname === '/health') { + return Response.json({ status: 'ok', uptime, tabs, currentUrl }); + } + + if (!validateAuth(req)) { + return new Response('Unauthorized', { status: 401 }); + } + + const { command, args } = await req.json(); + + if (READ_COMMANDS.has(command)) return handleReadCommand(command, args); + if (WRITE_COMMANDS.has(command)) return handleWriteCommand(command, args); + if (META_COMMANDS.has(command)) return handleMetaCommand(command, args); + + return Response.json({ error: `Unknown command: ${command}` }); + }, +}); +``` + +### The Browser Manager (`browse/src/browser-manager.ts`) + +The `BrowserManager` class wraps Playwright's browser context and adds: + +- **Tab management** — open, close, switch between tabs +- **Ref map** — maps `@e1`, `@e2` references to Playwright Locators (see [Chapter 3](03_snapshot_and_refs.md)) +- **Dialog handling** — auto-accepts `alert()`/`confirm()`/`prompt()` to prevent lockups +- **Event wiring** — captures console logs, network requests, and navigation events + +```mermaid +flowchart TD + BM["BrowserManager"] + BM --> CTX["Browser Context\n(viewport 1280x720)"] + CTX --> T1["Tab 1 (Page)"] + CTX --> T2["Tab 2 (Page)"] + CTX --> T3["Tab N (Page)"] + + BM --> REF["Ref Map\n@e1 → Locator\n@e2 → Locator"] + BM --> BUF["Buffers\n(console, network, dialog)"] + BM --> DLG["Dialog Handler\n(auto-accept)"] + + T1 --> |"framenavigated"| REF + T1 --> |"console"| BUF + T1 --> |"request/response"| BUF + T1 --> |"dialog"| DLG +``` + +## Lifecycle: Start → Command → Idle Shutdown + +Here's how the entire lifecycle works: + +### 1. First Command (~3 seconds) + +When no server is running, the CLI starts one: + +```mermaid +sequenceDiagram + participant CLI + participant Server + participant Chromium + + CLI->>CLI: Read .gstack/browse.json → not found + CLI->>Server: Spawn as detached background process + Server->>Chromium: Launch headless Chromium + Chromium-->>Server: Ready + Server->>Server: Write browse.json (pid, port, token) + CLI->>Server: Health check (poll until ready, max 8s) + Server-->>CLI: { status: "ok" } + CLI->>Server: POST /command + Server-->>CLI: Result +``` + +### 2. Subsequent Commands (~100-200ms) + +The server is already running. The CLI reads the state file, verifies the process is alive, and sends the command directly. + +### 3. Idle Shutdown (30 minutes) + +After 30 minutes with no commands, the server shuts itself down. The next command will trigger a fresh start. + +### 4. Crash Recovery + +If Chromium crashes, the server exits immediately (`process.exit(1)`). The CLI detects the dead process on the next command and auto-restarts. + +```typescript +// From browser-manager.ts — fail-fast, no self-healing +browser.on('disconnected', () => { + process.exit(1); +}); +``` + +This **fail-fast** design is intentional. Rather than trying to recover from a corrupted browser state, gstack exits cleanly and lets the CLI start fresh. + +### 5. Binary Version Mismatch + +When the browse binary is rebuilt, the CLI detects the version mismatch (git SHA in `.version` file) and auto-restarts the server with the new binary. + +## Circular Buffers: Console, Network, Dialog + +The server captures three streams of browser events in **circular buffers** — fixed-size ring buffers that overwrite the oldest entries when full: + +```typescript +// From buffers.ts +class CircularBuffer { + private items: (T | undefined)[]; + private head = 0; + private _size = 0; + private _totalAdded = 0; + + push(entry: T) { + this.items[this.head] = entry; + this.head = (this.head + 1) % this.capacity; + if (this._size < this.capacity) this._size++; + this._totalAdded++; + } + + last(n: number): T[] { /* return most recent n entries */ } +} +``` + +Each buffer holds up to **50,000 entries** and flushes to disk every second: + +| Buffer | File | Entry Type | +|--------|------|-----------| +| Console | `.gstack/browse-console.log` | `{ timestamp, level, text }` | +| Network | `.gstack/browse-network.log` | `{ method, url, status, duration, size }` | +| Dialog | `.gstack/browse-dialog.log` | `{ type, message, action, response }` | + +Flush failures are **non-fatal** — the buffers persist in memory even if disk writes fail. This ensures that browser commands never hang because of a logging issue. + +## Security Model + +The browse engine has several security layers: + +| Layer | What It Protects | +|-------|-----------------| +| **Localhost only** | Server binds to `127.0.0.1`, never `0.0.0.0` | +| **Bearer token** | Random UUID in state file (mode `0o600`, owner-only) | +| **Path validation** | File I/O restricted to `/tmp` and `process.cwd()` — no `..` traversal | +| **No shell injection** | Hardcoded browser paths; `Bun.spawn()` with argument arrays | +| **Cookie security** | Values truncated in all logs; keychain keys cached per-session only | + +## Configuration (`browse/src/config.ts`) + +The config system resolves paths in this priority: + +1. `BROWSE_STATE_FILE` environment variable (set by CLI for server) +2. Git root → `/.gstack/` +3. Current working directory fallback (non-git environments) + +```typescript +// Simplified from config.ts +function resolveConfig(): BrowseConfig { + const gitRoot = getGitRoot(); // git rev-parse --show-toplevel + const projectDir = gitRoot ?? process.cwd(); + const stateDir = path.join(projectDir, '.gstack'); + + return { + projectDir, + stateDir, + stateFile: path.join(stateDir, 'browse.json'), + consoleLog: path.join(stateDir, 'browse-console.log'), + networkLog: path.join(stateDir, 'browse-network.log'), + dialogLog: path.join(stateDir, 'browse-dialog.log'), + }; +} +``` + +## How It Works Under the Hood + +Let's trace a complete command — from skill invocation to browser action and back: + +```mermaid +sequenceDiagram + participant Skill as /qa skill + participant Shell as Bash Shell + participant CLI as browse CLI + participant HTTP as HTTP Server + participant Dispatch as Command Dispatch + participant Handler as Write Handler + participant PW as Playwright + participant Chrome as Chromium + + Skill->>Shell: $B click @e3 + Shell->>CLI: ./browse click @e3 + CLI->>CLI: Read .gstack/browse.json + CLI->>HTTP: POST /command {"command":"click","args":["@e3"]} + HTTP->>HTTP: Validate Bearer token + HTTP->>Dispatch: Route to WRITE_COMMANDS + Dispatch->>Handler: handleWriteCommand("click", ["@e3"]) + Handler->>Handler: resolveRef("@e3") → Locator + Handler->>PW: locator.click({timeout: 15000}) + PW->>Chrome: CDP click event + Chrome-->>PW: Click completed + PW-->>Handler: Success + Handler-->>HTTP: {output: "Clicked @e3"} + HTTP-->>CLI: 200 OK + CLI-->>Shell: Clicked @e3 + Shell-->>Skill: Clicked @e3 +``` + +Key things to notice: +1. **The `$B` alias** — skills invoke the browse binary via `$B`, which is resolved to the actual binary path at skill load time +2. **Ref resolution** — `@e3` is looked up in the ref map (set by the last `snapshot` command) to get a Playwright Locator +3. **Timeout** — all interactions have a 15-second timeout with AI-friendly error messages +4. **Error wrapping** — Playwright errors are translated into messages like "Ref @e3 is stale. Run `snapshot` to get fresh refs." + +## What's Next? + +Now that you understand the browse engine's architecture, let's look at how the snapshot system gives AI agents a structured view of web pages. + +→ Next: [Chapter 3: Snapshot & Ref System](03_snapshot_and_refs.md) + diff --git a/docs/03_snapshot_and_refs.md b/docs/03_snapshot_and_refs.md new file mode 100644 index 00000000..277db71b --- /dev/null +++ b/docs/03_snapshot_and_refs.md @@ -0,0 +1,304 @@ +--- +layout: default +title: "Chapter 3: Snapshot & Ref System" +parent: "gstack" +nav_order: 3 +--- + +# Chapter 3: Snapshot & Ref System + +Welcome to the snapshot system — the mechanism that lets AI agents "see" a web page's structure and interact with specific elements by name. If the browse engine is gstack's "eyes and hands," the snapshot system is the "map" that tells the hands where to reach. + +## What Problem Does This Solve? + +When a human looks at a web page, they immediately see buttons, links, form fields, and headings. They can point and say "click that blue button." An AI agent, on the other hand, gets raw HTML — thousands of nested `
` tags with no obvious way to identify what matters. + +The snapshot system solves this by extracting the page's **accessibility tree** (the same structure screen readers use) and assigning **numbered references** like `@e1`, `@e2`, `@e3` to each interactive or meaningful element. The agent can then say `click @e3` instead of writing a fragile CSS selector. + +Think of it like a theater program. Instead of saying "the person in the red shirt, third from the left," you can say "Actor #3." Simple, unambiguous, and fast. + +## A Simple Example + +Here's what happens when you run `$B snapshot` on a login page: + +``` +- heading "Sign In" [level=1] +- textbox "Email" @e1 +- textbox "Password" @e2 +- button "Sign In" @e3 +- link "Forgot password?" @e4 +- paragraph: Don't have an account? + - link "Sign up" @e5 +``` + +Now the agent can: +```bash +$B fill @e1 "user@example.com" +$B fill @e2 "hunter2" +$B click @e3 +``` + +No CSS selectors. No XPath. No fragile DOM queries. Just meaningful references that map directly to what a user would see. + +## How Snapshots Work + +The snapshot system is built on Playwright's `page.accessibility.snapshot()`, which returns the browser's accessibility tree — the same tree that assistive technologies use. + +```mermaid +flowchart LR + PAGE["Web Page\n(HTML + DOM)"] + A11Y["Accessibility Tree\n(Playwright)"] + PARSE["Parse & Assign\n@e1, @e2, ..."] + OUTPUT["YAML-like\nSnapshot Text"] + REFS["Ref Map\n@e1 → Locator\n@e2 → Locator"] + + PAGE --> A11Y --> PARSE + PARSE --> OUTPUT + PARSE --> REFS +``` + +### Step 1: Extract the Accessibility Tree + +Playwright's `ariaSnapshot()` method returns a YAML-like representation of the page's semantic structure: + +```typescript +// From snapshot.ts +const tree = await page.locator(scope).ariaSnapshot({ interestingOnly }); +``` + +The `interestingOnly` parameter (toggled by the `-i` flag) controls whether to include all elements or just interactive ones (buttons, links, inputs, etc.). + +### Step 2: Parse and Assign Refs + +Each line of the tree is parsed to identify its role and name. Interactive elements get sequential `@e` references: + +```typescript +// Simplified from snapshot.ts +let refCounter = 0; +const refMap = new Map(); + +for (const line of treeLines) { + const { role, name, depth } = parseLine(line); + + if (isInteractive(role)) { + refCounter++; + const ref = `@e${refCounter}`; + const locator = buildLocator(page, role, name); + refMap.set(ref, { locator, role, name }); + // Append ref to the output line + } +} +``` + +### Step 3: Build Playwright Locators + +For each ref, the system builds a Playwright Locator using `getByRole()`. When multiple elements have the same role and name, it uses `.nth()` for disambiguation: + +```typescript +// Build a locator for "button named Submit" +const locator = page.getByRole('button', { name: 'Submit' }); + +// If there are two "Submit" buttons, use nth() +const locator = page.getByRole('button', { name: 'Submit' }).nth(1); +``` + +This approach is **external to the DOM** — no JavaScript is injected into the page, no CSP issues, no framework conflicts. + +### Step 4: Store in Ref Map + +The ref map is stored in the `BrowserManager` and persists until the next navigation event: + +```mermaid +sequenceDiagram + participant Agent as AI Agent + participant Snap as snapshot command + participant BM as BrowserManager + participant Page as Chromium Page + + Agent->>Snap: $B snapshot + Snap->>Page: ariaSnapshot() + Page-->>Snap: YAML tree + Snap->>Snap: Parse, assign @e1..@eN + Snap->>BM: setRefMap(refs) + Snap-->>Agent: YAML output with @refs + + Note over BM: Refs persist until navigation + + Agent->>BM: click @e3 + BM->>BM: resolveRef("@e3") → Locator + BM->>Page: locator.click() +``` + +## Ref Resolution and Staleness + +When a command uses an `@e` reference, the browser manager resolves it: + +```typescript +// From browser-manager.ts +resolveRef(selector: string) { + if (selector.startsWith('@e') || selector.startsWith('@c')) { + const entry = this.refMap.get(selector); + if (!entry) throw new Error(`Unknown ref: ${selector}`); + + // Staleness check: is the element still in the DOM? + const count = await entry.locator.count(); + if (count === 0) { + throw new Error(`Ref ${selector} is stale. Run \`snapshot\` to get fresh refs.`); + } + + return entry.locator; + } + + // Fall back to CSS selector + return page.locator(selector); +} +``` + +Refs become **stale** when the page navigates (the `framenavigated` event clears the ref map). This is intentional — after navigation, the DOM has changed, so old refs might point to the wrong elements. + +**The rule is simple:** after any navigation (clicking a link, submitting a form, calling `goto`), run `snapshot` again to get fresh refs. + +## Snapshot Flags + +The snapshot command supports several flags that control what you see: + +| Flag | Short | Description | Example | +|------|-------|------------|---------| +| `--interactive` | `-i` | Only show interactive elements (buttons, links, inputs) | `$B snapshot -i` | +| `--compact` | `-c` | Remove empty structural nodes | `$B snapshot -c` | +| `--depth` | `-d N` | Limit tree depth | `$B snapshot -d 3` | +| `--selector` | `-s sel` | Scope to a CSS selector | `$B snapshot -s "#main"` | +| `--diff` | `-D` | Show unified diff vs. previous snapshot | `$B snapshot -D` | +| `--annotate` | `-a` | Take screenshot with red overlay boxes at each ref | `$B snapshot -a` | +| `--output` | `-o path` | Save annotated screenshot to file | `$B snapshot -a -o /tmp/snap.png` | +| `--cursor-interactive` | `-C` | Include `@c` refs for cursor:pointer/onclick elements | `$B snapshot -C` | + +These flags are defined in the `SNAPSHOT_FLAGS` metadata array in `browse/src/snapshot.ts` — the single source of truth used by both the CLI parser and the documentation generator. + +### Interactive-Only Mode (`-i`) + +For complex pages with hundreds of elements, `-i` filters down to just the interactive ones: + +``` +Full snapshot (150 elements): +- banner + - navigation "Main" + - list + - listitem + - link "Home" @e1 + - link "Products" @e2 + ... + +Interactive-only (12 elements): +- link "Home" @e1 +- link "Products" @e2 +- textbox "Search" @e3 +- button "Search" @e4 +... +``` + +### Diff Mode (`-D`) + +After making a change (clicking a button, filling a form), you can see exactly what changed: + +```bash +$B snapshot # Baseline +$B click @e3 # Make a change +$B snapshot -D # See what's different +``` + +Output: +```diff +- button "Submit" @e3 ++ paragraph: "Form submitted successfully!" ++ link "Back to home" @e4 +``` + +### Annotated Screenshots (`-a`) + +The `-a` flag takes a screenshot and overlays red boxes at each ref's location — perfect for visual debugging: + +```bash +$B snapshot -a -o /tmp/annotated.png +``` + +This uses Playwright's `boundingBox()` to locate each ref on screen, then draws the overlay. + +## Cursor-Interactive Refs (`@c`) + +Some elements are interactive via CSS (`cursor: pointer`) or JavaScript (`onclick`) but don't have proper ARIA roles. The `-C` flag scans for these and assigns separate `@c` references: + +``` +- button "Menu" @e1 +- [cursor-interactive] div.card @c1 +- [cursor-interactive] span.tag @c2 +- link "More" @e2 +``` + +This catches the "invisible interactive" elements that the accessibility tree misses — common in custom components and SPA frameworks. + +## How It Works Under the Hood + +Let's trace the full flow of a snapshot command: + +```mermaid +flowchart TD + CMD["$B snapshot -i -s '#main'"] + PARSE["parseSnapshotArgs()\n→ interactive: true, selector: '#main'"] + ARIA["page.locator('#main')\n.ariaSnapshot({interestingOnly: true})"] + TREE["Raw YAML tree\n(from Playwright)"] + LOOP["For each line:\n1. parseLine() → role, name\n2. Assign @eN if interactive\n3. Build Locator via getByRole()"] + MAP["Store in refMap:\n@e1 → Locator\n@e2 → Locator\n..."] + OUT["Return formatted tree\nwith @refs inline"] + + CMD --> PARSE --> ARIA --> TREE --> LOOP + LOOP --> MAP + LOOP --> OUT +``` + +The key insight is that Playwright's Locators are **not CSS selectors** — they're live references that Playwright resolves at interaction time. This means they work even when: +- The element's position in the DOM changes (dynamic layouts) +- CSS classes are renamed (style refactors) +- The page uses Shadow DOM (web components) + +They only break when the element is **removed from the DOM entirely** — which the staleness check catches. + +## Practical Patterns + +### Pattern 1: Navigate and Explore +```bash +$B goto https://myapp.com +$B snapshot -i # See interactive elements +$B click @e3 # Click a nav link +$B snapshot -i # Fresh refs after navigation +``` + +### Pattern 2: Fill a Form +```bash +$B snapshot -s "form" # Scope to the form +$B fill @e1 "John Doe" +$B fill @e2 "john@example.com" +$B click @e3 # Submit +$B snapshot -D # See what changed +``` + +### Pattern 3: Visual Verification +```bash +$B snapshot -a -o /tmp/before.png # Annotated screenshot +$B click @e5 # Make a change +$B snapshot -a -o /tmp/after.png # Compare visually +``` + +### Pattern 4: Complex Page Debugging +```bash +$B snapshot -d 2 # Shallow view (top 2 levels) +$B snapshot -s "#sidebar" -c # Compact sidebar view +$B snapshot -C # Include cursor-interactive elements +``` + +## What's Next? + +Now that you understand how gstack sees and references page elements, let's explore the full command system — all the read, write, and meta commands available. + +→ Next: [Chapter 4: Command System](04_command_system.md) + diff --git a/docs/04_command_system.md b/docs/04_command_system.md new file mode 100644 index 00000000..3cdb95dc --- /dev/null +++ b/docs/04_command_system.md @@ -0,0 +1,389 @@ +--- +layout: default +title: "Chapter 4: Command System" +parent: "gstack" +nav_order: 4 +--- + +# Chapter 4: Command System + +Welcome to the command system — the complete set of operations you can perform with the browse engine. In the previous chapters, you learned how the [browse engine](02_browse_engine.md) works and how the [snapshot system](03_snapshot_and_refs.md) identifies page elements. Now let's explore every command available. + +## What Problem Does This Solve? + +An AI agent testing a web application needs to do three things: **read** page state (is the text correct? what links exist?), **write** to the page (click buttons, fill forms, navigate), and perform **meta** operations (manage tabs, take screenshots, chain commands). The command system provides a clean, categorized interface for all of these. + +## The Command Registry + +All commands are defined in a single file: `browse/src/commands.ts`. This is the **single source of truth** — it's used by the CLI help text, the documentation generator, the test validator, and the server's dispatch logic. + +```typescript +// From commands.ts — three exhaustive sets +export const READ_COMMANDS = new Set([ + 'text', 'html', 'links', 'forms', 'accessibility', 'js', 'eval', + 'css', 'attrs', 'console', 'network', 'cookies', 'storage', + 'perf', 'dialog', 'is' +]); + +export const WRITE_COMMANDS = new Set([ + 'goto', 'back', 'forward', 'reload', 'click', 'fill', 'select', + 'hover', 'type', 'press', 'scroll', 'wait', 'viewport', 'cookie', + 'cookie-import', 'cookie-import-browser', 'header', 'useragent', + 'upload', 'dialog-accept', 'dialog-dismiss' +]); + +export const META_COMMANDS = new Set([ + 'tabs', 'tab', 'newtab', 'closetab', 'status', 'stop', 'restart', + 'screenshot', 'pdf', 'responsive', 'chain', 'diff', 'url', 'snapshot' +]); +``` + +Each command also has metadata in `COMMAND_DESCRIPTIONS`: + +```typescript +export const COMMAND_DESCRIPTIONS: Record = { + goto: { + category: 'navigation', + description: 'Navigate to a URL', + usage: '$B goto ', + }, + click: { + category: 'interaction', + description: 'Click an element', + usage: '$B click <@ref|selector>', + }, + // ... 50+ commands +}; +``` + +A **load-time validation** ensures that every command in the three sets has a description, and vice versa. If you add a command but forget its description, the binary won't start. + +## Read Commands + +Read commands extract data from the page without side effects. They never change anything. + +### Text and HTML + +```bash +$B text # Clean page text (strips script/style/noscript/svg) +$B text "#main" # Text from a specific element +$B html # Full page HTML +$B html ".card" # HTML of a specific element +``` + +The `text` command uses a custom `getCleanText()` function that strips `