This directory contains mocked frontend websites designed to test browser-operating AI agents' capabilities.
- URL:
/gbr/ - Difficulty: Easy
- Purpose: Test AI Agent's page navigation, clicking, and information gathering capabilities
- Features:
- Multi-page news website
- Header navigation
- Search functionality
- Article cards with click tracking
- Subscribe/Sign-in buttons
- URL:
/techforum/ - Difficulty: Medium
- Purpose: Test AI Agent's ability to interact with Q&A forum websites
- Features:
- Question/Answer cards
- Like/Collect/Comment/Share buttons
- Comment modal with text input
- Topic navigation
- Sidebar navigation
- Search functionality
- URL:
/cloudstack/(legacy:/aliyun/) - Difficulty: Hard
- Purpose: Test AI Agent's ability to handle complex enterprise consoles with distractions
- Features:
- Complex dashboard layout
- Instance management table
- Filter and search functionality
- Create instance modal with multi-step form
- Spam popups that appear at intervals (promotions, security alerts, notifications)
- Notification panel
- Multiple action buttons per row
- URL:
/dataflow/ - Difficulty: Medium
- Purpose: Test visual understanding through dashboard interactions
- Features:
- Settings panel with toggle switches
- Revenue chart with interactive elements
- Tab navigation (Revenue, Settings, Reports)
- Quarterly data visualization
- URL:
/finviz/ - Difficulty: Medium
- Purpose: Test complex filter interactions with financial data
- Features:
- 27 dropdown filter options
- Sortable data table (40 stocks)
- Pagination controls
- Multiple view modes (Overview, Valuation, Financial, etc.)
- Dark theme matching original finviz.com
- URL:
/bluebook/ - Difficulty: Hard
- Purpose: Test Xiaohongshu-style visual browsing, search, dense card layouts, modal note reading, and comment interactions
- Features:
- Dense masonry feed with 70+ mocked posts
- Search bar with separate clear/search icon buttons
- Floating "graphic only" and "reload" buttons
- Note detail modal with left media area and right comment panel
- Comment like / reply interactions with author-specific tracking
- Shared tracker integration plus site-specific events
- URL:
/mapquest/ - Difficulty: Hard
- Purpose: Test panel-state navigation, search autocomplete, icon-only transport buttons, and spatial pin interaction on a map canvas
- Features:
- Left search panel with state machine:
search → results → place-detail → directions - Autocomplete dropdown that appears as the user types (no page reload)
- Horizontally-scrollable category chip bar ("Restaurants", "Gas", "Coffee"…) duplicated inside
place-detailso it stays reachable after drilling in - Icon-only transport mode bar (drive / transit / walk / bike) above directions
- Route result cards with duration + distance; shortest route is pre-selected and emits a
route_selectevent on panel entry - Spatial pin clicks on the canvas with hover → label reveal
- Test cases:
mapquest_navigate,mapquest_nearby_pins
- Left search panel with state machine:
- Main challenges:
- Autocomplete timing — suggestions only appear mid-typing, not on submit
- Icon-only transport buttons with no text labels
- Ambiguous pin targets on a dense map
- Panel-state (not page) navigation — back/forward is stateful, not a route change
- Default-selection events: the shortest route and "drive" mode are pre-selected; the site emits synthetic
*_selectevents on entry so criteria don't require a redundant click
- URL:
/staybnb/ - Difficulty: Hard → Very Hard
- Purpose: Test segmented search pill, date-range calendar, dual-handle price slider drag, multi-amenity filter modal, gallery viewer, and two-step reserve → confirm booking
- Features:
- Home page with real Lorem Picsum destination cards
- Top search pill with Where / Check-in / Check-out / Who popovers that share a dismiss backdrop
- Guest stepper (adults / children / infants) with per-row +/- buttons
- Results view reachable via
#resultsdeep link (pre-loaded with Tokyo listings for tests) - Card carousel (left/right arrows) plus card-hover → map-pin highlight sync
- Filter modal with dual-handle price slider (drag both ends), amenity checkboxes, and an "Instant Book" toggle
- "Show N stays" apply button whose label updates live with the filtered count
- Detail page with 5-photo grid, fullscreen gallery (scrollable), and a Reserve → Confirm-and-pay two-step checkout
- Test cases:
staybnb_search,staybnb_book
- Main challenges:
- Drag interaction on a dual-handle slider — min and max handles must be dragged independently
- Segmented popover stacking: header must sit above the dismiss backdrop or the popover inputs become click-dead (header
z-indexraised to 300 to resolve) - Sequential checkout with hidden second step
- Amenity events normalized to lowercase so checker criteria match regardless of display casing
- Live apply-button label ("Show N stays") — test instructions and criteria reference the actual label rather than a generic "Apply"
- URL:
/taskflow/ - Difficulty: Medium → Hard
- Purpose: Test drag-and-drop, hover-reveal controls, inline editing, color-only label selection, and board-menu space theft
- Features:
- Kanban board with 4 columns and draggable cards (HTML5 drag-and-drop)
- Hover-reveal pencil icon on cards for inline rename
- Card detail modal with description, labels, checklist, comments
- Color-swatch label picker (no text labels — hue only)
- Due-date picker and member assignment
- Right-side board menu that pushes board content left when opened
- Test cases:
taskflow_drag_and_edit,taskflow_full_workflow
- Main challenges:
- True drag-and-drop between columns (not click-to-move)
- Hover-to-reveal pencil — the affordance is invisible until the pointer enters the card
- Color-only label picker requires visual discrimination without text
- Board menu steals horizontal space on open, shifting downstream click targets
- Inline editing: click to enter edit mode, Enter/blur to commit
- URL:
/vidhub/ - Difficulty: Medium → Hard
- Purpose: Test auto-hide player controls, thin-bar timeline scrub, nested settings popup, hover-reveal volume slider, and nested comment reply threads
- Features:
- Home feed grid with thumbnail + title + channel cards
- Search box in masthead with result filter chips
- Video watch page with poster + overlaid control bar that auto-hides after 3s of inactivity
- Icon-only player controls: play/pause, volume, CC, settings, theater, miniplayer, fullscreen
- Thin progress bar (~4px, thickens to ~8px on hover) for timeline seek
- Volume icon with hover-reveal horizontal slider (two-step hover-then-drag)
- Nested settings popup: gear → Playback speed → submenu (0.25x … 2x)
- Like/dislike, red → gray Subscribe toggle, expandable description
- Comment section with sort dropdown and nested reply threads
- Test cases:
vidhub_player,vidhub_comment
- Main challenges:
- Control bar auto-hide forces repeated hover-to-reveal between actions
- Thin-bar scrub precision for percentage-based seek targets
- Two-step hover-reveal volume slider
- Nested popup navigation — settings opens a popup, submenu opens inside the same popup
- Scroll-to-comments below fold followed by scroll-back-up to player area
All websites include comprehensive event tracking that records:
- Clicks (element, position, text)
- Scrolls (position, max scroll)
- Input (field, value length)
- Hovers (element, selector)
- Navigation (page changes)
- Form submissions
- Site-specific actions (upvote, comment, instance operations, etc.)
- Events are stored in browser localStorage
- Events are also sent to server via
/api/trackendpoint - Server maintains in-memory event store
curl http://localhost:PORT/api/eventscurl http://localhost:PORT/api/events/clearcurl http://localhost:PORT/api/sitescurl http://localhost:PORT/api/helpcurl -X POST http://localhost:PORT/api/track \
-H "Content-Type: application/json" \
-d '{"eventType": "click", "site": "globalbusinessreview.com", ...}'cd eval
python server.pyThe server will:
- Automatically find an available port
- Start serving the websites
- Print URLs for all sites and API endpoints
Automated evaluation now requires a browser UUID capability token copied from the Chrome extension UUID page.
Quick start:
python eval/evaluate_browser_agent.py --test techforum --chrome-uuid YOUR_BROWSER_UUID --model-alias defaultRecommended options:
export OPENBROWSER_CHROME_UUID=YOUR_BROWSER_UUID
python eval/evaluate_browser_agent.py --test techforum --model-alias default
python eval/evaluate_browser_agent.py --test techforum --model-alias plus
python eval/evaluate_browser_agent.py --model-alias plus --model-alias flash
python eval/evaluate_browser_agent.py --list
python eval/evaluate_browser_agent.py --manual --test techforumNotes:
--chrome-uuidis required for automated runs that call the OpenBrowser browser-control APIs.- Automated evaluation also requires at least one
--model-alias, which must match a configured LLM alias in the OpenBrowser web UI. --manualand--listdo not require a browser UUID.OPENBROWSER_CHROME_UUIDis the equivalent environment variable for scripting and CI-style usage.
After an AI agent interacts with the websites, you can:
- Export events:
GET /api/eventsreturns all tracked events in JSON format - Analyze behavior: Events include timestamps, element selectors, action types
- Compare sessions: Each session has a unique ID for comparison
- Clear and reset: Use
/api/events/clearto reset between tests
{
"timestamp": 1710234567890,
"sessionId": "session_1710234567890_abc123",
"site": "techforum.com",
"difficulty": "medium",
"page": "/techforum/",
"eventType": "click",
"element": "BUTTON",
"elementId": null,
"elementClass": "action-btn upvote",
"elementText": "👍 2,341",
"selector": "button.action-btn.upvote",
"x": 450,
"y": 320
}eval/
├── server.py # Python server with tracking API
├── evaluate_browser_agent.py # Evaluation runner
├── dataset/ # YAML test case definitions
│ ├── gbr.yaml
│ ├── gbr_detailed.yaml
│ ├── techforum.yaml
│ ├── techforum_reply.yaml
│ ├── cloudstack.yaml
│ ├── cloudstack_interactive.yaml
│ ├── finviz_simple.yaml
│ ├── finviz_complex.yaml
│ ├── dataflow.yaml
│ ├── mapquest_navigate.yaml
│ ├── mapquest_nearby_pins.yaml
│ ├── staybnb_search.yaml
│ ├── staybnb_book.yaml
│ ├── taskflow_drag_and_edit.yaml
│ ├── taskflow_full_workflow.yaml
│ ├── vidhub_player.yaml
│ └── vidhub_comment.yaml
├── css/
│ ├── gbr.css # GBR styles
│ ├── techforum.css # TechForum styles
│ ├── aliyun.css # Aliyun styles
│ └── finviz.css # Finviz styles
├── js/
│ ├── tracker.js # Shared tracking library
│ ├── gbr.js # GBR interactions
│ ├── techforum.js # TechForum interactions
│ ├── aliyun.js # Aliyun interactions
│ └── finviz.js # Finviz interactions
├── gbr/ # News website
│ ├── index.html
│ └── articles/
├── techforum/ # Q&A forum
│ └── index.html
├── cloudstack/ # Enterprise console (aliyun clone)
│ └── *.html
├── dataflow/ # Dashboard visualization
│ └── index.html
├── finviz/ # Stock screener
│ └── index.html
├── mapquest/ # Google Maps mock (panel state machine)
│ ├── index.html
│ ├── css/mapquest.css
│ └── js/mapquest.js
├── staybnb/ # Airbnb mock (filter modal, gallery, booking)
│ ├── index.html
│ ├── css/staybnb.css
│ └── js/staybnb.js
├── taskflow/ # Trello mock (drag-and-drop board)
│ ├── index.html
│ ├── css/taskflow.css
│ └── js/taskflow.js
└── vidhub/ # YouTube mock (player, comments)
├── index.html
├── css/vidhub.css
└── js/vidhub.js
To manually test the websites:
- Start the server:
python server.py - Open browser to the displayed URL (e.g.,
http://localhost:11826/ws/) - Interact with the website (click, scroll, input)
- Check events:
curl http://localhost:11826/api/events
After an AI agent interacts with a website, you can analyze the tracked events to evaluate its performance. Here are some example evaluation criteria:
- Navigation: Did the agent navigate between pages (Home, World, Business, Markets, etc.)?
- Information gathering: Did the agent click on article links to read content?
- Search: Did the agent use the search functionality?
- Subscription: Did the agent attempt to subscribe or sign in?
- Button distinction: Did the agent correctly distinguish between like, collect, comment, and share buttons?
- Comment placement: Did the agent open the comment modal and submit a comment on the correct answer?
- Scrolling: Did the agent scroll through the feed to view more content?
- Navigation: Did the agent use sidebar and header navigation?
- Popup handling: Did the agent close spam popups (promotions, security alerts, etc.)?
- Complex UI interaction: Did the agent interact with the instance table, filters, and pagination?
- Multi-step process: Did the agent initiate and progress through the "Create Instance" modal?
- Action selection: Did the agent perform appropriate instance actions (start, restart, connect, etc.)?
- Settings interaction: Did the agent enable the weekly reports feature?
- Chart interaction: Did the agent click on the quarter with highest revenue?
- Tab navigation: Did the agent navigate to the Revenue tab?
- Filter application: Did the agent apply the correct market cap filter?
- Multi-filter combination: Did the agent apply multiple filters correctly?
- Data interpretation: Did the agent understand the filter results?
- Event completeness: Did the agent trigger expected event types (clicks, scrolls, inputs)?
- Session duration: How long did the agent spend on the task?
- Error rate: Did the agent trigger any error events or fail to complete key actions?
Use the /api/events endpoint to retrieve JSON data. You can write scripts to compute metrics such as:
- Total number of events per type
- Sequence of navigation events
- Time between key actions
- Completion of predefined task flows
Example analysis script:
import requests
import json
events = requests.get('http://localhost:PORT/api/events').json()
clicks = [e for e in events['events'] if e['eventType'] == 'click']
print(f"Total clicks: {len(clicks)}")