Skip to content

Add drag-and-drop, set_slider, hover persistence, and scroll-clip fix#58

Merged
softpudding merged 9 commits intomainfrom
fix/eval-t7-t1-drag-and-timeout
Apr 13, 2026
Merged

Add drag-and-drop, set_slider, hover persistence, and scroll-clip fix#58
softpudding merged 9 commits intomainfrom
fix/eval-t7-t1-drag-and-timeout

Conversation

@softpudding
Copy link
Copy Markdown
Owner

Summary

  • drag_and_drop_element: 2-phase commit interaction for dragging elements between containers (TaskFlow +10.5 score improvement)
  • set_slider: Universal slider control via slidable interaction hints (VidHub 15.0/15.0 all models)
  • Hover persistence: Maintains hover state for elements revealed by mouseover
  • Scroll-container clipping: isElementVisibleInScrollParent() filters out elements scrolled outside overflow:auto/scroll containers, preventing phantom highlight labels (fixes Drive Bulk Release flash regression)
  • API-stall retry: Auto-retries eval tests when API response gaps exceed 60s threshold
  • Dev reload: Vite watch mode with WebSocket auto-reload for Chrome extension development
  • qwen3.6-plus: Added to large model profile

Eval Results (84.76% vs 82.86% main baseline)

Model Pass Rate
qwen3.5-flash 27/35 77.1%
qwen3.5-plus 31/35 88.6%
qwen3.6-plus 31/35 88.6%
Total 89/105 84.76%

Test plan

  • Full 35-test eval pass for all 3 models
  • Drive Bulk Release Assets flash: FAIL 7.8 → PASS 10.0 after scroll-clip fix
  • TaskFlow Full Workflow: plus/3.6-plus both PASS 13.0 (was FAIL 2.5/3.0)
  • VidHub Comment slider: 15.0/15.0 all models
  • Extension dev build succeeds with watch mode
  • Pre-commit passes (black + prettier)
  • Pytest: 464 passed, 4 skipped
  • Extension tests: 191 passed, 0 fail

🤖 Generated with Claude Code

softpudding and others added 9 commits April 12, 2026 19:50
Implement end-to-end drag-and-drop support: element discovery via
draggable/droppable element types and interaction hints, 2PC
confirmation flow with container preview, and precise drop placement
using relative_to/position. This addresses the 0% pass rate on
taskflow_drag_and_edit eval tasks where the agent had no way to
discover or execute DnD operations.

Key changes:
- Add draggable/droppable as element_types and interactionHints
- Detection heuristics: explicit attrs, cursor:grab, parent-of-draggable
- HighlightDropPreviewCommand: crops container, highlights inner elements
- confirm_drag_and_drop with relative_to/position for precise placement
- Fix drag script rAF hang on hidden tabs (setTimeout fallback)
- Fix post-drag occlusion false positive (DOM mutates during drag)
- Remove misleading offset_x/offset_y from LLM-facing tool schema
- Harden eval SSE streaming with retry and configurable timeouts

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…lider

Three features addressing eval failures and interaction gaps:

1. Hover persistence: store last-hovered element per conversation/tab
   and replay hover events before confirmation screenshots, so
   hover-revealed UI (video controls) stays visible during click
   confirmation.

2. Slidable interaction hint: detect slider-like elements across three
   tiers (native range, ARIA role=slider, custom progress bars via
   structural heuristics) and annotate them with a "slidable" hint.
   Ancestor walk ensures leaf elements inside slider containers inherit
   the hint.

3. Universal set_slider: extend from native-only to three paths —
   native range (write value), ARIA slider (position click via
   aria-valuemin/max), and generic custom sliders (percentage-based
   position click with ancestor walk to find full-width container).

Also fixes: ancestor opacity walk for visibility detection, large
interactive region detection (video players), and slider/draggable
exclusion to prevent conflicting hints.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Updates VISUAL_GROUNDING, INTERACTION_MODEL, and DISCOVERY_STRATEGY
sections in the system prompt to cover the new DnD 2PC flow, slidable
interaction hints, and droppable element discovery.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adds a Vite plugin that starts a WebSocket server (port 8767) during
`npm run dev`. The extension's background script connects to it in dev
builds and calls chrome.runtime.reload() on each rebuild, eliminating
the need to manually reload on chrome://extensions. The reload code is
tree-shaken out of production builds via a __DEV__ compile-time constant.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The dynamic import via Vite's __vitePreload polyfill silently failed in
MV3 service workers (polyfill references `document` which doesn't exist).
Switch to static import and disable the modulepreload polyfill. Also
change `npm run dev` from watch mode to one-shot build+reload.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The instruction said "board-packet" (hyphenated) but the mock thread
subject uses "Board Packet" (space-separated), causing the agent to
use a search term that doesn't match.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…eval results

- Add isElementVisibleInScrollParent() to filter elements scrolled out of
  view inside overflow containers, preventing phantom labels from confusing
  the agent (fixes Drive Bulk Release Assets flash regression)
- Add API-stall detection and retry logic to evaluate_browser_agent.py
  so transient API timeouts trigger automatic re-queue
- Add qwen3.6-plus to the large model profile
- Update evaluation_report.json with merged results: 89/105 (84.76%)
  flash 27/35, plus 31/35, 3.6-plus 31/35

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The agent-sdk prompt changed from "Only click, keyboard_input, and select
use the YELLOW stage" to "click, keyboard_input, and select return a YELLOW
preview screenshot before execution." Update the test assertion to match.
Also includes auto-formatting from black and prettier.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@softpudding softpudding merged commit bb93236 into main Apr 13, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant