Eval: 10 mock sites + routine-compile eval + full-eval benchmark report#56
Merged
softpudding merged 14 commits intomainfrom Apr 11, 2026
Merged
Eval: 10 mock sites + routine-compile eval + full-eval benchmark report#56softpudding merged 14 commits intomainfrom
softpudding merged 14 commits intomainfrom
Conversation
- Fix _format_agent_event to match actual SSE event types (ActionEvent/ ObservationEvent/MessageEvent) and use frontend-matching labels (STEP/ASK/RESULT/AGENT) - Collapse multiline event bodies to single lines with "|" separator for readable terminal output - Add --full-events flag for no-truncation mode (default 500 chars) - Add _log() helper for immediate flushed stderr output at milestones - Replace synthetic fixtures with real browser recordings - Delete unused fixtures (techforum_upvote_clear, finviz_threshold_ambiguous) - Update finviz_filter_clear expectations: require asking about sort direction (user's real goal is finding 20% monthly drops) - Update replay YAML and golden routine for multi-filter Finviz workflow Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Save two compiler-produced routines as static test assets and wire them into evaluate_browser_agent.py via replay YAML test cases: - techforum_search_upvote_agents.md: search "AI", upvote+collect agent posts, open comments (6 criteria, 10 pts) - finviz_filter_sort_open.md: 5 filters, Performance view, sort by Perf Month ascending, open top 3 losers (8 criteria, 12 pts) Both run with: uv run python eval/evaluate_browser_agent.py \ --test replay_techforum_upvote --test replay_finviz_filter_simple Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Fix UnicodeEncodeError in eval server when tracker events contain emoji surrogates (e.g. 👍 from TechForum upvote buttons) - Fix search_for_ai criterion: search event fires on /techforum/ not /techforum/search.html Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…tions Bump agent-sdk to b7f39b82 which reframes the sorted-list asking rule around position-based vs identity-based replay divergence with a worked example, making the compiler reliably ask clarification questions. Also fix the finviz_filter_clear fixture: remove the sort-direction question from required expectations since the sort state is observable from the trace (element class + keyframe values). Update intent_summary and raw_intention to match. Eval results: 2/2 pass, mean asking_behavior 1.0. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adds four new browser-agent eval sites targeting interaction primitives not yet covered by the existing suite: panel-state navigation and icon-only transport buttons (MapQuest); dual-handle price slider drag, segmented search popovers, and two-step booking checkout (StayBnB); HTML5 drag-and-drop with hover-reveal inline editing (TaskFlow); auto-hide player controls, thin-bar timeline scrub, nested settings popup, and hover-reveal volume slider (VidHub). Each site ships with 2 test-case YAMLs under eval/dataset/ scored against tracker events, for 8 new tests total. eval/README.md documents each site with main challenges. eval/AGENTS.md captures non-obvious implementation gotchas learned during the build (stacking context with header popovers, default-state event anti-pattern, tracker case normalization, dual-handle drag, deep-link entry points). eval/SPEC_NEW_SITES.md is the design brief the sites were generated from. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…outine Brings in 4 additional mock eval sites (MapQuest, StayBnB, TaskFlow, VidHub) from the parallel new-mock-eval-sites worktree, alongside the 5 sites already merged from codex/mock-eval-sites (gmail, drive, booking, github, amazon). Conflict resolution in eval/server.py: kept the generic file-serving refactor and DEFAULT_PORT / configurable-port main() from the target branch, and folded the 4 new site entries into SITE_NAME_TO_BUCKET, /api/sites, /api/help, and print_startup_info alongside the existing codex entries. The 4 new URL_MAPPINGS entries for /mapquest/, /staybnb/, /taskflow/, /vidhub/ are redundant with send_file's directory→index.html fallback but are kept for consistency with the legacy dataflow/finviz/bluebook/northstar entries. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- eval/reports/20260411_full_eval.md: 105-run benchmark across qwen3.5-flash, qwen3.5-plus, qwen3.6-plus on the full 35-test dataset (pre-existing + codex mock sites + follow-up mock sites). Documents seven recurring OpenBrowser agent issues with per-test evidence and proposed fixes. - eval/evaluation_report.json: refreshed with the 20260411 run (105 tests, 82.86% pass rate). - eval/routine_eval/evaluate_routine_compile.py: factor main-report build/write helpers so partial runs still emit a valid report. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Satisfies the pre-commit black hook on files touched by earlier mock-site and routine-eval commits on this branch. No logic changes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Snapshots the exact state of
feat/optimize-browser-routinethat the 20260411 full-eval benchmark was run against. This PR is a superset of the eval work currently landed betweenmainand this branch.Noteworthy contents (13 commits total):
Mock eval sites (10 new hard sites):
c451049/cdf21cf— 6 codex-authored hard mock sites: Amazon, Booking, Drive, GitHub, Gmail + tracker/server plumbing + 13 test-case YAMLs.895c376— 4 follow-up mock sites: MapQuest, StayBnB, TaskFlow, VidHub + 8 test-case YAMLs +eval/AGENTS.mdgotchas doc.Routine compile/replay eval infra:
5d82823,45c07e3,12ba733,3483695,25b3a2e,c1f9f50,cc0f520— routine record/replay eval support, TechForum interaction expansion, compiled-replay fixtures, eval server surrogate fix, record-compile-replay workflow docs, compiler ask_user prompting + Finviz fixture fixes.Full-eval 20260411 benchmark (this commit,
37a38f1):eval/reports/20260411_full_eval.md— 105-run benchmark across qwen3.5-flash, qwen3.5-plus, qwen3.6-plus on the full 35-test dataset. Documents seven recurring OpenBrowser agent issues (missing drag primitive, stale-DOM disorientation, terminal error interpretation, feedback-loop blindness, instruction-precision drift, missing completion signal, 8765 HTTP channel fragility) with per-test evidence and proposed fixes.eval/evaluation_report.jsonrefreshed with 20260411 numbers (105 tests, 82.86% pass rate).eval/routine_eval/evaluate_routine_compile.pymain-report build/write helpers factored out so partial runs still emit a valid report.Test plan
eval/reports/20260411_full_eval.md🤖 Generated with Claude Code