Add 4 new mock eval sites + full-eval observation report by softpudding · Pull Request #55 · softpudding/OpenBrowser

softpudding · 2026-04-11T13:25:49Z

Summary

Adds 4 new mock eval sites (MapQuest, StayBnB, TaskFlow, VidHub) with 8 test-case YAMLs targeting interaction primitives not covered by the existing suite (panel-state navigation, dual-handle price slider, HTML5 drag-and-drop, auto-hide player controls).
Adds eval/AGENTS.md with non-obvious implementation gotchas (stacking contexts, default-state event anti-pattern, tracker case normalization).
Adds full-eval observation report from a 105-run benchmark pass across qwen3.5-flash, qwen3.5-plus, and qwen3.6-plus. Documents 7 recurring OpenBrowser agent issues with per-test evidence and proposed fixes.

Test plan

All 35 tests load without schema errors
Full 105-run eval completed (~3h, 82.9% avg pass rate across models)
Reviewer sanity-checks site implementations and the observation report

🤖 Generated with Claude Code

Adds four new browser-agent eval sites targeting interaction primitives not yet covered by the existing suite: panel-state navigation and icon-only transport buttons (MapQuest); dual-handle price slider drag, segmented search popovers, and two-step booking checkout (StayBnB); HTML5 drag-and-drop with hover-reveal inline editing (TaskFlow); auto-hide player controls, thin-bar timeline scrub, nested settings popup, and hover-reveal volume slider (VidHub). Each site ships with 2 test-case YAMLs under eval/dataset/ scored against tracker events, for 8 new tests total. eval/README.md documents each site with main challenges. eval/AGENTS.md captures non-obvious implementation gotchas learned during the build (stacking context with header popovers, default-state event anti-pattern, tracker case normalization, dual-handle drag, deep-link entry points). eval/SPEC_NEW_SITES.md is the design brief the sites were generated from. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Captures findings from a 105-run benchmark pass across qwen3.5-flash, qwen3.5-plus, and qwen3.6-plus on the full 35-test dataset. Documents seven recurring OpenBrowser agent issues (missing drag primitive, stale DOM disorientation, terminal error interpretation, feedback-loop blindness, instruction-precision drift, missing completion signal, 8765 HTTP channel fragility) with per-test evidence and proposed fixes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Brings in the 6 hard mock eval sites and supporting infrastructure from the codex/mock-eval-sites branch (Amazon, Booking, Drive, GitHub, Gmail, plus 12 test YAMLs and tracker/server plumbing) so the full eval PR includes both rounds of codex-authored mock sites alongside the follow-up 4 sites (MapQuest, StayBnB, TaskFlow, VidHub). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> # Conflicts: # eval/server.py

softpudding · 2026-04-11T13:52:25Z

Closing — re-opening from /Users/yangxiao/git/OpenBrowser on a proper branch that matches the exact state the eval was run against.

softpudding and others added 5 commits April 11, 2026 15:21

Add hard mock evaluation sites

c451049

Fix review findings in eval mocks

cdf21cf

softpudding closed this Apr 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add 4 new mock eval sites + full-eval observation report#55

Add 4 new mock eval sites + full-eval observation report#55
softpudding wants to merge 5 commits intomainfrom
codex/new-mock-eval-sites

softpudding commented Apr 11, 2026

Uh oh!

softpudding commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

softpudding commented Apr 11, 2026

Summary

Test plan

Uh oh!

softpudding commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant