Skip to content

Add 4 new mock eval sites + full-eval observation report#55

Closed
softpudding wants to merge 5 commits intomainfrom
codex/new-mock-eval-sites
Closed

Add 4 new mock eval sites + full-eval observation report#55
softpudding wants to merge 5 commits intomainfrom
codex/new-mock-eval-sites

Conversation

@softpudding
Copy link
Copy Markdown
Owner

Summary

  • Adds 4 new mock eval sites (MapQuest, StayBnB, TaskFlow, VidHub) with 8 test-case YAMLs targeting interaction primitives not covered by the existing suite (panel-state navigation, dual-handle price slider, HTML5 drag-and-drop, auto-hide player controls).
  • Adds eval/AGENTS.md with non-obvious implementation gotchas (stacking contexts, default-state event anti-pattern, tracker case normalization).
  • Adds full-eval observation report from a 105-run benchmark pass across qwen3.5-flash, qwen3.5-plus, and qwen3.6-plus. Documents 7 recurring OpenBrowser agent issues with per-test evidence and proposed fixes.

Test plan

  • All 35 tests load without schema errors
  • Full 105-run eval completed (~3h, 82.9% avg pass rate across models)
  • Reviewer sanity-checks site implementations and the observation report

🤖 Generated with Claude Code

softpudding and others added 5 commits April 11, 2026 15:21
Adds four new browser-agent eval sites targeting interaction primitives
not yet covered by the existing suite: panel-state navigation and icon-only
transport buttons (MapQuest); dual-handle price slider drag, segmented
search popovers, and two-step booking checkout (StayBnB); HTML5 drag-and-drop
with hover-reveal inline editing (TaskFlow); auto-hide player controls,
thin-bar timeline scrub, nested settings popup, and hover-reveal volume
slider (VidHub).

Each site ships with 2 test-case YAMLs under eval/dataset/ scored against
tracker events, for 8 new tests total. eval/README.md documents each site
with main challenges. eval/AGENTS.md captures non-obvious implementation
gotchas learned during the build (stacking context with header popovers,
default-state event anti-pattern, tracker case normalization, dual-handle
drag, deep-link entry points). eval/SPEC_NEW_SITES.md is the design brief
the sites were generated from.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Captures findings from a 105-run benchmark pass across qwen3.5-flash,
qwen3.5-plus, and qwen3.6-plus on the full 35-test dataset. Documents
seven recurring OpenBrowser agent issues (missing drag primitive, stale
DOM disorientation, terminal error interpretation, feedback-loop
blindness, instruction-precision drift, missing completion signal,
8765 HTTP channel fragility) with per-test evidence and proposed fixes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Brings in the 6 hard mock eval sites and supporting infrastructure
from the codex/mock-eval-sites branch (Amazon, Booking, Drive, GitHub,
Gmail, plus 12 test YAMLs and tracker/server plumbing) so the full
eval PR includes both rounds of codex-authored mock sites alongside
the follow-up 4 sites (MapQuest, StayBnB, TaskFlow, VidHub).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

# Conflicts:
#	eval/server.py
@softpudding
Copy link
Copy Markdown
Owner Author

Closing — re-opening from /Users/yangxiao/git/OpenBrowser on a proper branch that matches the exact state the eval was run against.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant