Name	Name	Last commit message	Last commit date
parent directory ..
amazon	amazon
archived/2026-03-16	archived/2026-03-16
bluebook	bluebook
booking	booking
cloudstack	cloudstack
css	css
dataflow	dataflow
dataset	dataset
drive	drive
finviz	finviz
gbr	gbr
github	github
gmail	gmail
js	js
mapquest	mapquest
northstar	northstar
reports	reports
routine_eval	routine_eval
staybnb	staybnb
taskflow	taskflow
techforum	techforum
vidhub	vidhub
AGENTS.md	AGENTS.md
README.md	README.md
TASK.md	TASK.md
evaluate_browser_agent.py	evaluate_browser_agent.py
evaluation_report.json	evaluation_report.json
server.py	server.py

Mock Websites for AI Agent Evaluation

This directory contains mocked frontend websites designed to test browser-operating AI agents' capabilities.

Websites

1. GBR.com (Easy)

URL: /gbr/
Difficulty: Easy
Purpose: Test AI Agent's page navigation, clicking, and information gathering capabilities
Features:
- Multi-page news website
- Header navigation
- Search functionality
- Article cards with click tracking
- Subscribe/Sign-in buttons

2. TechForum.com (Medium)

URL: /techforum/
Difficulty: Medium
Purpose: Test AI Agent's ability to interact with Q&A forum websites
Features:
- Question/Answer cards
- Like/Collect/Comment/Share buttons
- Comment modal with text input
- Topic navigation
- Sidebar navigation
- Search functionality

3. CloudStack.com Console (Hard)

URL: /cloudstack/ (legacy: /aliyun/)
Difficulty: Hard
Purpose: Test AI Agent's ability to handle complex enterprise consoles with distractions
Features:
- Complex dashboard layout
- Instance management table
- Filter and search functionality
- Create instance modal with multi-step form
- Spam popups that appear at intervals (promotions, security alerts, notifications)
- Notification panel
- Multiple action buttons per row

4. DataFlow Dashboard (Medium)

URL: /dataflow/
Difficulty: Medium
Purpose: Test visual understanding through dashboard interactions
Features:
- Settings panel with toggle switches
- Revenue chart with interactive elements
- Tab navigation (Revenue, Settings, Reports)
- Quarterly data visualization

5. Finviz Stock Screener (Medium)

URL: /finviz/
Difficulty: Medium
Purpose: Test complex filter interactions with financial data
Features:
- 27 dropdown filter options
- Sortable data table (40 stocks)
- Pagination controls
- Multiple view modes (Overview, Valuation, Financial, etc.)
- Dark theme matching original finviz.com

6. BlueBook Feed (Hard)

URL: /bluebook/
Difficulty: Hard
Purpose: Test Xiaohongshu-style visual browsing, search, dense card layouts, modal note reading, and comment interactions
Features:
- Dense masonry feed with 70+ mocked posts
- Search bar with separate clear/search icon buttons
- Floating "graphic only" and "reload" buttons
- Note detail modal with left media area and right comment panel
- Comment like / reply interactions with author-specific tracking
- Shared tracker integration plus site-specific events

7. MapQuest (Hard) — Google Maps mock

URL: /mapquest/
Difficulty: Hard
Purpose: Test panel-state navigation, search autocomplete, icon-only transport buttons, and spatial pin interaction on a map canvas
Features:
- Left search panel with state machine: search → results → place-detail → directions
- Autocomplete dropdown that appears as the user types (no page reload)
- Horizontally-scrollable category chip bar ("Restaurants", "Gas", "Coffee"…) duplicated inside place-detail so it stays reachable after drilling in
- Icon-only transport mode bar (drive / transit / walk / bike) above directions
- Route result cards with duration + distance; shortest route is pre-selected and emits a route_select event on panel entry
- Spatial pin clicks on the canvas with hover → label reveal
- Test cases: mapquest_navigate, mapquest_nearby_pins
Main challenges:
- Autocomplete timing — suggestions only appear mid-typing, not on submit
- Icon-only transport buttons with no text labels
- Ambiguous pin targets on a dense map
- Panel-state (not page) navigation — back/forward is stateful, not a route change
- Default-selection events: the shortest route and "drive" mode are pre-selected; the site emits synthetic *_select events on entry so criteria don't require a redundant click

8. StayBnB (Hard → Very Hard) — Airbnb mock

URL: /staybnb/
Difficulty: Hard → Very Hard
Purpose: Test segmented search pill, date-range calendar, dual-handle price slider drag, multi-amenity filter modal, gallery viewer, and two-step reserve → confirm booking
Features:
- Home page with real Lorem Picsum destination cards
- Top search pill with Where / Check-in / Check-out / Who popovers that share a dismiss backdrop
- Guest stepper (adults / children / infants) with per-row +/- buttons
- Results view reachable via #results deep link (pre-loaded with Tokyo listings for tests)
- Card carousel (left/right arrows) plus card-hover → map-pin highlight sync
- Filter modal with dual-handle price slider (drag both ends), amenity checkboxes, and an "Instant Book" toggle
- "Show N stays" apply button whose label updates live with the filtered count
- Detail page with 5-photo grid, fullscreen gallery (scrollable), and a Reserve → Confirm-and-pay two-step checkout
- Test cases: staybnb_search, staybnb_book
Main challenges:
- Drag interaction on a dual-handle slider — min and max handles must be dragged independently
- Segmented popover stacking: header must sit above the dismiss backdrop or the popover inputs become click-dead (header z-index raised to 300 to resolve)
- Sequential checkout with hidden second step
- Amenity events normalized to lowercase so checker criteria match regardless of display casing
- Live apply-button label ("Show N stays") — test instructions and criteria reference the actual label rather than a generic "Apply"

9. TaskFlow (Medium → Hard) — Trello mock

URL: /taskflow/
Difficulty: Medium → Hard
Purpose: Test drag-and-drop, hover-reveal controls, inline editing, color-only label selection, and board-menu space theft
Features:
- Kanban board with 4 columns and draggable cards (HTML5 drag-and-drop)
- Hover-reveal pencil icon on cards for inline rename
- Card detail modal with description, labels, checklist, comments
- Color-swatch label picker (no text labels — hue only)
- Due-date picker and member assignment
- Right-side board menu that pushes board content left when opened
- Test cases: taskflow_drag_and_edit, taskflow_full_workflow
Main challenges:
- True drag-and-drop between columns (not click-to-move)
- Hover-to-reveal pencil — the affordance is invisible until the pointer enters the card
- Color-only label picker requires visual discrimination without text
- Board menu steals horizontal space on open, shifting downstream click targets
- Inline editing: click to enter edit mode, Enter/blur to commit

10. VidHub (Medium → Hard) — YouTube mock

URL: /vidhub/
Difficulty: Medium → Hard
Purpose: Test auto-hide player controls, thin-bar timeline scrub, nested settings popup, hover-reveal volume slider, and nested comment reply threads
Features:
- Home feed grid with thumbnail + title + channel cards
- Search box in masthead with result filter chips
- Video watch page with poster + overlaid control bar that auto-hides after 3s of inactivity
- Icon-only player controls: play/pause, volume, CC, settings, theater, miniplayer, fullscreen
- Thin progress bar (~4px, thickens to ~8px on hover) for timeline seek
- Volume icon with hover-reveal horizontal slider (two-step hover-then-drag)
- Nested settings popup: gear → Playback speed → submenu (0.25x … 2x)
- Like/dislike, red → gray Subscribe toggle, expandable description
- Comment section with sort dropdown and nested reply threads
- Test cases: vidhub_player, vidhub_comment
Main challenges:
- Control bar auto-hide forces repeated hover-to-reveal between actions
- Thin-bar scrub precision for percentage-based seek targets
- Two-step hover-reveal volume slider
- Nested popup navigation — settings opens a popup, submenu opens inside the same popup
- Scroll-to-comments below fold followed by scroll-back-up to player area

Event Tracking

All websites include comprehensive event tracking that records:

Clicks (element, position, text)
Scrolls (position, max scroll)
Input (field, value length)
Hovers (element, selector)
Navigation (page changes)
Form submissions
Site-specific actions (upvote, comment, instance operations, etc.)

Tracking Data Storage

Events are stored in browser localStorage
Events are also sent to server via /api/track endpoint
Server maintains in-memory event store

API Endpoints

Get All Events

curl http://localhost:PORT/api/events

Clear All Events

curl http://localhost:PORT/api/events/clear

List Available Sites

curl http://localhost:PORT/api/sites

API Help

curl http://localhost:PORT/api/help

Submit Tracking Event (from browser)

curl -X POST http://localhost:PORT/api/track \
  -H "Content-Type: application/json" \
  -d '{"eventType": "click", "site": "globalbusinessreview.com", ...}'

Starting the Server

cd eval
python server.py

The server will:

Automatically find an available port
Start serving the websites
Print URLs for all sites and API endpoints

Running OpenBrowser Evaluation

Automated evaluation now requires a browser UUID capability token copied from the Chrome extension UUID page.

Quick start:

python eval/evaluate_browser_agent.py --test techforum --chrome-uuid YOUR_BROWSER_UUID --model-alias default

Recommended options:

export OPENBROWSER_CHROME_UUID=YOUR_BROWSER_UUID
python eval/evaluate_browser_agent.py --test techforum --model-alias default
python eval/evaluate_browser_agent.py --test techforum --model-alias plus
python eval/evaluate_browser_agent.py --model-alias plus --model-alias flash
python eval/evaluate_browser_agent.py --list
python eval/evaluate_browser_agent.py --manual --test techforum

Notes:

--chrome-uuid is required for automated runs that call the OpenBrowser browser-control APIs.
Automated evaluation also requires at least one --model-alias, which must match a configured LLM alias in the OpenBrowser web UI.
--manual and --list do not require a browser UUID.
OPENBROWSER_CHROME_UUID is the equivalent environment variable for scripting and CI-style usage.

Evaluating AI Agent Behavior

After an AI agent interacts with the websites, you can:

Export events: GET /api/events returns all tracked events in JSON format
Analyze behavior: Events include timestamps, element selectors, action types
Compare sessions: Each session has a unique ID for comparison
Clear and reset: Use /api/events/clear to reset between tests

Example Event Structure

{
  "timestamp": 1710234567890,
  "sessionId": "session_1710234567890_abc123",
  "site": "techforum.com",
  "difficulty": "medium",
  "page": "/techforum/",
  "eventType": "click",
  "element": "BUTTON",
  "elementId": null,
  "elementClass": "action-btn upvote",
  "elementText": "👍 2,341",
  "selector": "button.action-btn.upvote",
  "x": 450,
  "y": 320
}

Directory Structure

eval/
├── server.py              # Python server with tracking API
├── evaluate_browser_agent.py  # Evaluation runner
├── dataset/               # YAML test case definitions
│   ├── gbr.yaml
│   ├── gbr_detailed.yaml
│   ├── techforum.yaml
│   ├── techforum_reply.yaml
│   ├── cloudstack.yaml
│   ├── cloudstack_interactive.yaml
│   ├── finviz_simple.yaml
│   ├── finviz_complex.yaml
│   ├── dataflow.yaml
│   ├── mapquest_navigate.yaml
│   ├── mapquest_nearby_pins.yaml
│   ├── staybnb_search.yaml
│   ├── staybnb_book.yaml
│   ├── taskflow_drag_and_edit.yaml
│   ├── taskflow_full_workflow.yaml
│   ├── vidhub_player.yaml
│   └── vidhub_comment.yaml
├── css/
│   ├── gbr.css           # GBR styles
│   ├── techforum.css     # TechForum styles
│   ├── aliyun.css        # Aliyun styles
│   └── finviz.css        # Finviz styles
├── js/
│   ├── tracker.js        # Shared tracking library
│   ├── gbr.js            # GBR interactions
│   ├── techforum.js      # TechForum interactions
│   ├── aliyun.js         # Aliyun interactions
│   └── finviz.js         # Finviz interactions
├── gbr/                   # News website
│   ├── index.html
│   └── articles/
├── techforum/            # Q&A forum
│   └── index.html
├── cloudstack/           # Enterprise console (aliyun clone)
│   └── *.html
├── dataflow/             # Dashboard visualization
│   └── index.html
├── finviz/               # Stock screener
│   └── index.html
├── mapquest/             # Google Maps mock (panel state machine)
│   ├── index.html
│   ├── css/mapquest.css
│   └── js/mapquest.js
├── staybnb/              # Airbnb mock (filter modal, gallery, booking)
│   ├── index.html
│   ├── css/staybnb.css
│   └── js/staybnb.js
├── taskflow/             # Trello mock (drag-and-drop board)
│   ├── index.html
│   ├── css/taskflow.css
│   └── js/taskflow.js
└── vidhub/               # YouTube mock (player, comments)
    ├── index.html
    ├── css/vidhub.css
    └── js/vidhub.js

Testing

To manually test the websites:

Start the server: python server.py
Open browser to the displayed URL (e.g., http://localhost:11826/ws/)
Interact with the website (click, scroll, input)
Check events: curl http://localhost:11826/api/events

Evaluating AI Agent Performance

After an AI agent interacts with a website, you can analyze the tracked events to evaluate its performance. Here are some example evaluation criteria:

GBR (Easy Level)

Navigation: Did the agent navigate between pages (Home, World, Business, Markets, etc.)?
Information gathering: Did the agent click on article links to read content?
Search: Did the agent use the search functionality?
Subscription: Did the agent attempt to subscribe or sign in?

TechForum (Medium Level)

Button distinction: Did the agent correctly distinguish between like, collect, comment, and share buttons?
Comment placement: Did the agent open the comment modal and submit a comment on the correct answer?
Scrolling: Did the agent scroll through the feed to view more content?
Navigation: Did the agent use sidebar and header navigation?

CloudStack (Hard Level)

Popup handling: Did the agent close spam popups (promotions, security alerts, etc.)?
Complex UI interaction: Did the agent interact with the instance table, filters, and pagination?
Multi-step process: Did the agent initiate and progress through the "Create Instance" modal?
Action selection: Did the agent perform appropriate instance actions (start, restart, connect, etc.)?

DataFlow (Medium Level)

Settings interaction: Did the agent enable the weekly reports feature?
Chart interaction: Did the agent click on the quarter with highest revenue?
Tab navigation: Did the agent navigate to the Revenue tab?

Finviz (Medium Level)

Filter application: Did the agent apply the correct market cap filter?
Multi-filter combination: Did the agent apply multiple filters correctly?
Data interpretation: Did the agent understand the filter results?

General Metrics

Event completeness: Did the agent trigger expected event types (clicks, scrolls, inputs)?
Session duration: How long did the agent spend on the task?
Error rate: Did the agent trigger any error events or fail to complete key actions?

Analyzing Event Data

Use the /api/events endpoint to retrieve JSON data. You can write scripts to compute metrics such as:

Total number of events per type
Sequence of navigation events
Time between key actions
Completion of predefined task flows

Example analysis script:

import requests
import json

events = requests.get('http://localhost:PORT/api/events').json()
clicks = [e for e in events['events'] if e['eventType'] == 'click']
print(f"Total clicks: {len(clicks)}")

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Mock Websites for AI Agent Evaluation

Websites

1. GBR.com (Easy)

2. TechForum.com (Medium)

3. CloudStack.com Console (Hard)

4. DataFlow Dashboard (Medium)

5. Finviz Stock Screener (Medium)

6. BlueBook Feed (Hard)

7. MapQuest (Hard) — Google Maps mock

8. StayBnB (Hard → Very Hard) — Airbnb mock

9. TaskFlow (Medium → Hard) — Trello mock

10. VidHub (Medium → Hard) — YouTube mock

Event Tracking

Tracking Data Storage

API Endpoints

Get All Events

Clear All Events

List Available Sites

API Help

Submit Tracking Event (from browser)

Starting the Server

Running OpenBrowser Evaluation

Evaluating AI Agent Behavior

Example Event Structure

Directory Structure

Testing

Evaluating AI Agent Performance

GBR (Easy Level)

TechForum (Medium Level)

CloudStack (Hard Level)

DataFlow (Medium Level)

Finviz (Medium Level)

General Metrics

Analyzing Event Data

FilesExpand file tree

eval

Directory actions

More options

Directory actions

More options

Latest commit

History

eval

Folders and files

parent directory

README.md

Mock Websites for AI Agent Evaluation

Websites

1. GBR.com (Easy)

2. TechForum.com (Medium)

3. CloudStack.com Console (Hard)

4. DataFlow Dashboard (Medium)

5. Finviz Stock Screener (Medium)

6. BlueBook Feed (Hard)

7. MapQuest (Hard) — Google Maps mock

8. StayBnB (Hard → Very Hard) — Airbnb mock

9. TaskFlow (Medium → Hard) — Trello mock

10. VidHub (Medium → Hard) — YouTube mock

Event Tracking

Tracking Data Storage

API Endpoints

Get All Events

Clear All Events

List Available Sites

API Help

Submit Tracking Event (from browser)

Starting the Server

Running OpenBrowser Evaluation

Evaluating AI Agent Behavior

Example Event Structure

Directory Structure

Testing

Evaluating AI Agent Performance

GBR (Easy Level)

TechForum (Medium Level)

CloudStack (Hard Level)

DataFlow (Medium Level)

Finviz (Medium Level)

General Metrics

Analyzing Event Data