Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
73 changes: 73 additions & 0 deletions eval/AGENTS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
# eval/ — Agent Notes

Implementation knowledge for the mock sites under `eval/`. Repo-wide conventions live in `../AGENTS.md`. Read **both** before editing sites or test cases.

## Source of truth

- **`SPEC_NEW_SITES.md`** is the design brief that generated the four post-2026-03 sites: `mapquest/`, `staybnb/`, `taskflow/`, `vidhub/`. When regenerating or extending a site, re-read the relevant section there instead of reverse-engineering from the rendered HTML. The spec pins the **intended behaviours**; the HTML only reflects what actually shipped.
- Test-case YAMLs in `dataset/` are the second source of truth — a site change that breaks a criterion's expected event shape is a site bug, not a criterion bug, unless the spec disagrees.

## Tracker contract

- Every site uses the shared `eval/js/tracker.js` — `window.tracker = new AgentTracker('<site>', '<difficulty>')`. Do not create site-local trackers.
- Emitted event values must match YAML criteria **exactly**, including case. Normalize at the emit site, not in the criterion. Concrete case: StayBnB amenity checkboxes render "Wifi"/"Kitchen" but `staybnb_book.yaml` expects `amenity: "wifi"`. The fix is `.toLowerCase()` in the tracker call, not capitalizing the YAML.
- When instruction text references a button label, the label must be the literal DOM text. StayBnB's filter apply button reads "Show N stays" (live count), so the instruction and criterion description both say "Show N stays" — never a generic "Apply".

## Default-state events — do not auto-credit

Tempting anti-pattern: if a criterion expects `route_select` / `transport_mode_select` but the UI **pre-selects** the default option (shortest route, drive mode, etc.), a user who agrees with the default never clicks, so no event fires and the criterion would fail. The tempting "fix" is to emit a synthetic `*_select` on state entry with `defaultSelected: true`.

**Don't do this.** The synthetic event also fires when the agent does nothing, so the criterion passes for a no-op run — and in tests like `mapquest_nearby_pins` where "Choose driving mode" is scored, the agent gets credit without ever identifying the icon. It silently dilutes the test signal.

**Rules:**

1. Criteria must match only on explicit user interaction events. No state-entry auto-emits for `*_select` style events.
2. If a test asks the agent to click a pre-selected default, it is the test author's job to make the target unambiguous. Either:
- Change the task to select a **non-default** option (e.g., `mode: "walk"` instead of `"drive"`), or
- Pin the criterion to specific field values (e.g., `routeIndex: 0`) so explicit clicks still match, and accept that re-clicking the default is part of the task.
3. When naming the criterion "Select the shortest route", pin `routeIndex: 0` in the YAML so the scorer distinguishes shortest from non-shortest.

History: `mapquest.js` briefly emitted both events as state-entry defaults; the pattern was removed after a review showed `mapquest_nearby_pins` was auto-crediting `transport_mode_select: drive` on directions-panel entry.

## Panel-state machines vs page routes

MapQuest and StayBnB both have panels whose views swap in place rather than navigating. Keep state transitions **stateful** (class toggles on panel containers, `switchPanelState(...)`), not URL-based — the spec intentionally tests panel-state navigation.

Consequence: if two tests need the same control in different panel states, duplicate the DOM rather than routing between states. Concrete case: `mapquest_navigate` wants the Directions flow inside `place-detail`, but `mapquest_nearby_pins` wants the category chip bar visible from inside `place-detail` too. The chip bar is duplicated inside the place-detail state; active-class sync is done across both copies via `document.querySelectorAll('.chip[data-category="..."]')`.

## Deep-link entry points

Some tests start pre-loaded into a non-home view so the agent doesn't have to traverse navigation it isn't being scored on. StayBnB supports `#results` in `staybnb/js/staybnb.js#init()` to jump straight into results with Tokyo listings rendered. The test YAML's `start_url` ends with `/staybnb/#results`. Add a similar hash handler whenever a new test needs a non-home starting view.

## Stacking contexts — popovers and headers

Popovers anchored inside a header will be clipped to the header's effective z-order because the header's own `z-index` creates a stacking context. If a full-screen dismiss backdrop sits above that header z-index, clicks on the popover input land on the backdrop instead and close the popover immediately.

Concrete case: StayBnB's search pill popovers were unclickable until `.topbar` was raised from `z-index: 100` to `300`, above the `.popover-backdrop` at `150`. When adding any overlay that uses a backdrop pattern, verify the anchoring element's stacking context dominates the backdrop.

## Real images

Agents occasionally need real visual content (thumbnails, destination cards, gallery photos). Use **Lorem Picsum seeded URLs** so images are deterministic across runs:

```
https://picsum.photos/seed/<stable-seed>/<W>/<H>
```

Seeds used so far live in `staybnb/index.html` (home cards) and `staybnb/js/staybnb.js` (results + detail + gallery — seeds of form `staybnb-<listingId>-<k>`). Prefer `background-image: url(...)` with `background-size: cover` so the layout tolerates any aspect ratio.

## Drag-and-drop primitives

- **StayBnB price slider** is dual-handle. Each handle drag emits `price_slider_change` with `handle: "min"` or `handle: "max"`. The two criteria are independent, so the handles must be draggable independently — don't couple them.
- **TaskFlow cards** use HTML5 drag-and-drop, not click-to-move. Criteria expect `card_drop` with source and destination column IDs.

## Manual-test harness workflow

When validating a new or changed test via `evaluate_browser_agent.py --manual`:

1. `mkfifo /tmp/eval_in` and hold the write end open with a background `sleep` so the harness's `stdin` stays open across multiple tests.
2. The harness clears events at instruction display time — timing starts then, not at page load.
3. Sending `ok\n` to the FIFO completes a test and prints the score breakdown. A criterion that reports 0 despite visibly correct behaviour is almost always an event-shape mismatch (case, missing field, or default-state pattern above).

## Scope discipline

Keep site code minimal and aligned with the spec's stated challenges. Do **not** add extra animations, fallbacks, or abstractions beyond what a criterion exercises. The sites exist to probe specific agent weaknesses; incidental complexity dilutes the signal and creates spurious event noise.
Loading
Loading