ChromePilot uses a dual-LLM architecture to separate high-level reasoning from low-level execution:
- Role: High-level task planning
- Input: Screenshot + HTML + User request
- Output: Array of plain English step descriptions
- Example Output:
{ "needs_steps": true, "steps": [ "Open a new tab with YouTube.com", "Click on the search box visible in the page", "Type 'cats' into the search box" ], "message": "I'll open YouTube and search for cats." }
- Role: Translate steps into tool calls
- Input: Step description + Execution history (previous inputs/outputs)
- Output: Tool name + Parameters
- Example:
- Input: "Click the first link from the search results"
- Has access to: Previous step outputs showing search results
- Output:
{ "tool": "clickElement", "inputs": { "selector": ".result:first-child a" } }
Steps can reference previous outputs:
- "Open the URL shown in the previous step"
- "Click the element that was highlighted"
- "Use the tab ID from step 1"
- Orchestrator: Focuses on "what to do" without worrying about tool syntax
- Executor: Focuses on "how to do it" with full execution context
- Steps are human-readable plain English
- Easy to debug (see exactly what the orchestrator planned)
- Executor can adapt to different page states using execution history
- Orchestrator runs once with vision (expensive)
- Executor runs per-step without vision (fast, llama3.1-8b-32k:latest is lightweight)
User Request
↓
[Orchestrator: qwen3-vl-32k]
↓
Plain English Steps
↓
User Approves Plan
↓
For each step:
↓
[Executor: llama3.1-8b-32k:latest] ← Previous step outputs
↓
Tool Call (name + params)
↓
Execute Tool
↓
Store Output
↓
Next Step
User: "Search for cats on Google"
{
"needs_steps": true,
"steps": [
"Open a new tab with Google.com",
"Click on the search input box",
"Type 'cats' in the search box",
"Click the search button or press Enter"
],
"message": "I'll search for cats on Google for you."
}Step 1: "Open a new tab with Google.com"
- Executor receives: No previous context
- Executor output:
{ "tool": "openTab", "inputs": { "url": "https://www.google.com" } } - Tool execution: Opens tab
- Stored output:
{ "tabId": 123, "url": "https://www.google.com" }
Step 2: "Click on the search input box"
- Executor receives: Step 1 outputs (tabId: 123)
- Executor output:
{ "tool": "clickElement", "inputs": { "selector": "input[name='q']" } } - Tool execution: Clicks element
- Stored output:
{ "success": true, "elementText": "" }
Step 3: "Type 'cats' in the search box"
- Executor receives: Step 1 & 2 outputs
- Executor can see the search box was successfully clicked
- And so on...
Tools are defined with input/output specifications:
{
name: "openTab",
description: "Opens a new browser tab with the specified URL",
inputs: ["url"],
outputs: ["tabId", "url"],
inputDescription: "url: The URL to open in the new tab",
outputDescription: "Returns the tab ID and confirmed URL of the opened tab"
}This allows:
- Orchestrator to understand what tools can do
- Executor to know what parameters are needed
- Executor to know what outputs will be available for next steps
- Description: Click on buttons, links, or any interactive element using accessibility tree ID
- Inputs:
a11yId(required): Accessibility tree element ID from getSchema outputclickType(optional): 'single' (default), 'double', or 'right'
- Outputs:
success,elementText,elementClicked - Use Cases: Click buttons, links, expand dropdowns, trigger UI actions
- Note: Must call getSchema first to get a11yId values
- Description: Enter text into input fields, textareas, or contenteditable elements
- Inputs:
a11yId(required): Accessibility tree element ID from getSchema outputtext(required): Text to typemode(optional): 'replace' (default) or 'append'submit(optional): true to press Enter after typing
- Outputs:
success,finalValue - Use Cases: Fill forms, search bars, comment boxes, login fields
- Note: Must call getSchema first to get a11yId values
- Description: Select an option from dropdown menus
- Inputs:
a11yId(required): Accessibility tree element ID from getSchema outputoption(required): Value to selectby(optional): 'value' (default), 'text', or 'index'
- Outputs:
success,selectedValue,selectedText - Use Cases: Country selectors, filters, form dropdowns
- Note: Must call getSchema first to get a11yId values
- Description: Simulate keyboard key presses including shortcuts
- Inputs:
key(required): Key name (Enter, Tab, Escape, ArrowUp/Down, PageUp/Down, etc.) or shortcut (Ctrl+A, Ctrl+F)selector(optional): Element to focus before pressing key
- Outputs:
success,keyPressed - Use Cases: Submit forms (Enter), navigate (Tab), close modals (Escape), shortcuts (Ctrl+F)
- Description: Scroll the page or specific scrollable elements
- Inputs:
target(optional): CSS selector for element to scroll (empty = page)direction(required): 'up', 'down', 'top', 'bottom', or 'toElement'amount(optional): Pixels to scroll for up/down (default 500)
- Outputs:
success,scrollPosition - Use Cases: Load more content, scroll to sections, navigate long pages
- Description: Navigate to URLs or control browser history
- Inputs:
action(required): 'goto', 'back', 'forward', or 'reload'url(required for 'goto'): URL to navigate to
- Outputs:
success,currentUrl,title - Use Cases: Open websites, go back/forward, refresh pages
- Description: Open, close, switch, or list browser tabs
- Inputs:
action(required): 'open', 'close', 'switch', or 'list'tabId(required for close/switch): Tab IDurl(required for open): URL for new tab
- Outputs:
success,tabs,activeTabId - Use Cases: Multi-tab workflows, organize browsing, compare pages
- Description: Wait for elements to appear, page to load, or network to idle
- Inputs:
waitType(required): 'time', 'element', 'navigation', or 'networkIdle'value(required for 'time'): Milliseconds to waitselector(required for 'element'): CSS selector to wait fortimeout(optional): Max wait time in ms (default 5000)
- Outputs:
success,elementFound,timeWaited - Use Cases: Handle dynamic content, wait for page loads, avoid race conditions, delay between actions
- Description: Extract the page's accessibility tree with interactive elements
- Inputs: None
- Outputs:
schema(array of elements with id, type, role, label, placeholder, text, location) - Element Properties:
id: Unique identifier (a11yId) used for click/type/select toolstype: HTML element type (button, input, a, etc.)role: ARIA role or computed semantic rolelabel: Accessible name from aria-label, labels, or text contentplaceholder: Placeholder text for inputstext: Text content for links/buttonslocation: Bounding box coordinates
- Smart Filtering: Only returns meaningful, identifiable interactive elements
- Skips elements with no label, placeholder, or text
- Reduces ~387 raw elements to ~100-150 actionable elements
- Filters out decorative icons, structural divs, and noise
- Use Cases:
- REQUIRED before click/type/select to get a11yId values
- Find buttons, links, inputs by their accessible labels
- Understand page structure and available interactions
- Example Output:
{ "success": true, "schema": [ {"id": 1, "type": "input", "role": "combobox", "label": "Search", "placeholder": "Search YouTube"}, {"id": 2, "type": "button", "role": "button", "label": "Search", "text": "Search"}, {"id": 18, "type": "a", "role": "link", "label": "Rick Astley - Never Gonna Give You Up", "text": "Rick Astley - Never Gonna Give You Up"} ] }
- Description: Get HTML content of the entire page or specific elements
- Inputs:
selector(optional): CSS selector for specific element (empty = full page)
- Outputs:
html,success - Use Cases: Extract data, analyze page structure, scrape content
ChromePilot uses an accessibility tree extraction system instead of raw DOM selectors:
- Framework-Agnostic: Works with React, Vue, Angular, etc. that obfuscate IDs/classes
- Semantic Understanding: Uses ARIA roles and labels, matching how screen readers work
- Noise Reduction: Filters out ~70% of elements, keeping only meaningful interactive items
- Stable Selection: Based on accessible names, not fragile CSS selectors
-
Extraction (
content.js::extractAccessibilityTree()):- Queries interactive elements:
button, a, input, textarea, select, [role], [onclick], [tabindex] - Computes accessible name from: aria-label, aria-labelledby, label[for], placeholder, text content
- Computes role from: ARIA roles or semantic HTML tags
- Critical Filter: Skips elements with no label AND no placeholder
- Tags each element with
data-agent-idattribute in DOM - Returns array of ~100-150 meaningful elements
- Queries interactive elements:
-
Element Selection (
content.js):- Executor specifies
a11yIdfrom getSchema output - Content script finds element by
data-agent-id="{a11yId}" - Fallback to in-memory
a11yTreeElementsmap - Performs action on the matched element
- Executor specifies
-
Smart Filtering Example:
Before filtering: 387 elements - Many with label: null (YouTube logo, decorative icons, structural divs) - Noise confuses executor's element selection After filtering: ~100-150 elements - Only elements with label OR placeholder - All actionable, identifiable elements - Clear, unambiguous for executor to match
The executor uses partial string matching to find elements:
- Step: "Click the fullscreen button" → Extract: "fullscreen" → Find: label contains "Full screen"
- Step: "Type in search box" → Extract: "search" → Find: role="combobox" AND label contains "Search"
- Step: "Click Rick Astley video" → Extract: "Rick Astley" → Find: label contains "Rick Astley"
-
Screenshot Capture: Static capture at start of each message (if toggle enabled)
- Captured once per user message and sent to orchestrator (vision model)
- NOT available as a tool since executor model (llama3.1-8b-32k:latest) is text-only
- Only the CURRENT screenshot is sent, not historical ones
-
HTML Capture:
- Static capture at start if toggle enabled (sent to orchestrator)
- Also available as
getHTMLtool during execution (text-based, works with executor) - Can target specific elements with CSS selector
-
No Redundancy: Previous screenshots/HTML are NOT carried in conversation history
ORCHESTRATOR_PROMPT: System prompt for plan generationexecuteStep(): Calls executor model with full historyexecuteToolCall(): Actually runs the toolhandlePlanApproval(): Manages execution loop with history tracking
handleStreamOllama(): Streams orchestrator responseshandleExecuteWithModel(): Non-streaming executor calls
- Shows plain English steps (easy to understand)
- Displays execution status per step
- Shows tool calls and outputs after execution
- Collapsible for clean interface