Skip to content

Latest commit

 

History

History
62 lines (39 loc) · 4.57 KB

File metadata and controls

62 lines (39 loc) · 4.57 KB

Full-page screenshots with Browser-use

How to add custom Browser-use actions and take full-page screenshots

🛑 TL;DR just show me the code: https://github.com/focused-dot-io/browser-use-screenshotter-example/blob/main/main.py

See What Your Agent Sees

Web scraping is a powerful, notoriously brittle tool in a technologist's toolkit. Agentic web scraping with tools like Browser-use lets us hand-wave away some of that brittleness and accomplish what I call "fuzzy" tasks, i.e., tasks that are not possible to accomplish via traditional web scrapers (Beautiful Soup, Playwright, etc.). Granted, that "fuzziness" comes with many, many tradeoffs.

The biggest tradeoff is that agentic web scrapers like Browser-use add an unfortunate element of non-determinism to our tasks because they are powered by LLMs. If you ask an agent to navigate to a particular URL and extract the price of a particular item and it returns $19.99, how do you know if it scrolled far enough to see all products? Did it take into account the discount code next to the item price? Did it even load the correct page?

It's these questions that have made me appreciate taking screenshots during each run: they bridge the gap between what the agent thinks happened and what actually happened.

Browser-use supports a built-in screenshotter tool, but I recently found myself wanting to take full-page screenshots. If you're interested in that code snippet, you can check out this GitHub repo. Happy coding!

Enter Browser-use and the Chromium DevTools Protocol

Browser-use lets you drive a real browser session from inside an agentic workflow. Under the hood it uses the same Chromium DevTools Protocol (CDP) that Chrome uses internally.

The tool has dozens of built-in actions including "click", "scroll", "input", and "screenshot". Recently I found myself wanting to take full-page screenshots and was limited by the built-in screenshotter's reliance on the size of the browser's viewport. Fortunately you can make your own custom actions in Browser-use!

Custom Tools: Taking a Full-Page Screenshot

To make a custom tool involves registering a new action on the tools registry. From there you can fill out the implementation of the action and pass the tools registry to your browser agent. In my case I get the current page from the browser session, generate a helpful filename, take a full-page screenshot with CDP, and return the result.

@tools.registry.action(description="Full-page screenshot via Chrome DevTools Protocol", param_model=ScreenshotParams)
async def take_full_page_screenshot(params: ScreenshotParams, browser_session: BrowserSession) -> ActionResult:
    page = await browser_session.get_current_page()
    if not page:
        return ActionResult(error="No current page", extracted_content="")

    filename = f"{datetime.now().strftime('%Y%m%d_%H%M%S')}.png"
    filepath = SCREENSHOTS_DIR / filename

    session_id = await page._ensure_session()
    cdp_client = page._client
    result = await cdp_client.send.Page.captureScreenshot(
        params={"format": "png", "captureBeyondViewport": True},
        session_id=session_id,
    )

    data = base64.b64decode(result["data"])
    with open(str(filepath), "wb") as f:
        f.write(data)

    return ActionResult(extracted_content=f"Saved {filename}")

To use this action, reference it by name in the prompt you pass to your browser agent.

For example, if I name my custom action take_full_page_screenshot, reference the action by that name in your system prompt, i.e., "navigate to ${URL}, find product X and call 'take_full_page_screenshot'."

Pro tip: Even if you don't add custom tools, it is best practice to reference Browser-use tools by name source

Conclusion

If I ask an agentic scraper to output text, I never believe it. We should never take what LLMs produce at face value. "Trust, but verify." If we have a screenshot, we immediately have context. The agent's run might be inconclusive, but we can at least see what the agent was looking at at that particular point in time. They enable grounded debugging: “Oh, it says the item is $0.00 because the price element hadn’t loaded yet.” Screenshots can even be curated into offline eval datasets.

We should never trust anything produced by an LLM or take it at face value. In the case of agentic web browsing, screenshots can help narrow the gap between: "It worked???" and "It worked."