A benchmark for evaluating web agents on everyday tasks directly on real websites.
Dataset: https://huggingface.co/datasets/yutori-ai/navi-bench
Blog post: https://yutori.com/blog/introducing-navigator
Want to understand what the benchmark tasks look like? You can run them manually using our human-in-the-loop demo:
We recommend installing with uv:
uv sync --extra eval
source .venv/bin/activate
python -m playwright install chromium webkitOr, using raw pip:
```bash pip install -e ".[eval]" python -m playwright install chromium webkit ```python -m demofrom datasets import load_dataset
from navi_bench.base import DatasetItem, instantiate
# Load dataset from HF
dataset = load_dataset("yutori-ai/navi-bench")
# Load a task from the dataset
task_item = DatasetItem.model_validate(dataset[0])
# Generate the task configuration
task_config = task_item.generate_task_config()
# Access task details
print(f"Task: {task_config.task}")
print(f"URL: {task_config.url}")
print(f"Evaluation Config: {task_config.eval_config}")
# Instantiate evaluator when starting the agent task
agent = ...
evaluator = instantiate(task_config.eval_config)
for _ in range(max_steps):
# Agent takes a step
...
# Update evaluator
await evaluator.update(...)
# Get the final evaluation result
eval_result = await evaluator.compute()Note: most evaluators rely on site state for verification, so ensure the verifier is run before closing the browser window
We provide an evaluation script for the Yutori n1 model. You can use it as a reference for evaluating your own agents. The reference evaluator preserves the full saved trajectory for visualization while using the SDK's native n1 payload and coordinate helpers to keep the request history bounded and action execution aligned.
-
Authenticate with Yutori:
yutori auth login
This will open Yutori in your browser and save your API key locally to
~/.yutori/config.json.Or, set the API key manually:
export YUTORI_API_KEY=yt-...If both are present, the environment variable takes precedence over saved credentials.
-
(Optional, but recommended) Use a remote browser provider (such as BrightData) to avoid getting blocked by certain websites.
- By default, the eval script uses a remote browser connected via the
BROWSER_CDP_URLenvironment variable for sites that tend to block automated browsers (apartments.com, resy.com). - If
BROWSER_CDP_URLis not set, it falls back to a local browser, which may get blocked and lead to crashes. In that case, you can re-run the eval script after the first run with--eval_concurrency 2to retry the crashed tasks.
- By default, the eval script uses a remote browser connected via the
Evaluate on a single sample:
python -m evaluation.eval_n1 \
--dataset_include_domains 'craigslist' \
--dataset_max_samples 1Run a single custom example without verification, good for testing (see examples/example.json):
python -m evaluation.eval_n1 \
--dataset_item_json examples/example.json \The JSON file can include a readable task_generation_config object instead of the serialized task_generation_config_json field. For custom tasks, set "use_cdp": true inside the task config and set BROWSER_CDP_URL env var if you want that task to use the remote browser.
Evaluate on the full dataset (recommended to specify BROWSER_CDP_URL to avoid being blocked by certain websites):
BROWSER_CDP_URL=... \
python -m evaluation.eval_n1Optionally, evaluate on other datasets that share the same schema (e.g., Halluminate Westworld):
HALLUMINATE_API_KEY=... \
python -m evaluation.eval_n1 \
--dataset_name 'Halluminate/westworld'The results on the full Navi-Bench dataset may look like:
Where we print the number of finished/crashed tasks and three scores:
- Lower Bound: treat crashed tasks as score=0, then average across all the tasks
- Excl. Crashed: exclude crashed tasks, then average across the rest of the tasks
- Upper Bound: treat crashed tasks as score=1, then average across all the tasks
Results are saved to results_n1/ by default. The script automatically resumes from existing results, so you can re-run to retry any crashed tasks. To start fresh, delete the directory or pass a different --eval_save_dir.
Each task gets its own sub-directory containing a visualization.html file that lets you step through the agent's trajectory with annotated screenshots.
Navi-Bench dataset is available on HuggingFace. It consists of 100 tasks from five real websites: Apartments, Craigslist, OpenTable, Resy, and Google Flights.
Optionally, you may check out Westworld, a benchmark from Halluminate featuring five simulated environments for e-commerce and travel tasks. Both datasets share the same format thus can be directly concatenated for joint evaluation.
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
If you use Yutori Navi-Bench in your research, please cite:
@misc{yutori2025navigator,
author = {Yutori},
title = {Introducing Navigator},
howpublished = {\url{https://yutori.com/blog/introducing-navigator}},
note = {Yutori Blog},
year = {2025},
}Contributions are welcome! Please feel free to submit a Pull Request.
For questions or issues, please open an issue on GitHub.
