This repo features code for the paper RealWebAssist: A Benchmark for Long-Horizon Web Assistance with Real-World Users.
It contains:
- Instructions for using the benchmark for evaluation
- Implementations of the baseline experiments
To use our benchmark, first clone the repository:
git clone https://github.com/SCAI-JHU/RealWebAssist.gitThen please access our dataset with huggingface link https://huggingface.co/datasets/stonehj05/RealWebAssist
You can download the dataset as follows:
Install huggingface cli:
pip install huggingface_hubDownload Dataset:
huggingface-cli download stonehj05/RealWebAssist --repo-type dataset --local-dir ./RealWebAssistOur benchmark evaluates correctness by determining if the model's output coordinates is within one of the correct bounding boxes.
After downloading the dataset, you would see folders 1-10 (corresponding to 10 human participants). Make a new folder full_human_dataset and put these folders under that directory
The final file structure should be as follows:
RealWebAssist/
├── model_scripts/ # 🔧 Inference and baseline model implementations
├── output_files/ # 📦 Precomputed outputs and results
│ └── reasoning_results/ # Reasoning results from baseline models (e.g., GPT-4o, Claude)
├── full_human_dataset/ # 👥 Human interaction data (10 participants)
│ ├── 1/ # Participant 1 data
│ ├── 2/ # Participant 2 data
│ ├── ... # More participants (3–10)
├── evaluate.py # 🚀 Entry point for running all evaluations
├── environment.yaml # 🧪 Conda environment specification (Not recommended to use, please follow our instructions for setting up the environment)
└── README.md # 📖 Project overview and usage instructions
Each folder for human data should have the following structure:
1/
├── answer_images/ # Images showing the ground-truth bounding boxes
├── audio/ # Audio clips of participant speech
├── images/ # Screenshot of the webpages
├── extracted_actions.json # GPT-4o extracted actions (given as history for baseline evaluations)
├── extracted_actions_gt.json # GPT-4o extracted actions with groud truth captions
├── questions_gt.json # Different versions of questions data. Can just use the second one, which also has the task goal labels
├── questions_with_task_updated.json
├── questions_with_task.json
├── questions_Wlarge.json
├── transcriptions_GT.json # Ground-truth transcriptions of speech and different audio models
├── transcriptions_Wlarge.json
├── transcriptions_Wturbo.json
To reproduce the baseline results, first set up a conda environment and install the necessary packages:
conda create -n realwebassist python=3.9
pip install transformers==4.48.0
pip install qwen-vl-utils
conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia
pip install accelerateFor reproducibility and to save cost, we provide the reasoning results for GPT-4o, Claude 3.7 Sonnet and OpenAI O1 (O1 costs around $200 USD for the whole benchmark).
Note: Baseline experiments requires a GPU to run
To run the baseline experiments without a reasoning model:
For OS-Atlas:
python evaluate.py --model_to_run os_atlasFor UGround:
python evaluate.py --model_to_run ugroundFor Aria-UI:
python evaluate.py --model_to_run aria_uiTo run the baseline experiments with the saved reasoning results:
For GPT-4o + OS-Atlas:
python evaluate.py --model_to_run os_atlas_gptFor GPT-4o + UGround:
python evaluate.py --model_to_run uground_gptFor GPT-4o + Aria-UI:
python evaluate.py --model_to_run aria_ui_gptFor OpenAI o1 and Claude 3.7 Sonnet we only provide evaluation with UGround.
For OpenAI o1 + UGround:
python evaluate.py --model_to_run o1For Claude 3.7 Sonnet + UGround:
python evaluate.py --model_to_run claudeIf you want to run the reasoning instead of using the existing results, first install these additional packages for OpenAI API:
pip install openai
pip install python-dotenvAnd install these additional packages for Claude API:
pip install anthropicThen, run the scripts:
For gpt-4o:
python evaluate.py --model_to_run gpt_reasoningFor o1:
python evaluate.py --model_to_run o1_reasoningFor claude:
python evaluate.py --model_to_run claude_reasoningThe script will save the reasoning results to the location needed for evaluation scripts
To evaluate your own model on our benchmark, follow these steps:
- Add your model file (i.e. mymodel.py) under /model_scripts
- Create a get_coordinate(config_data, history, base_dir, output_dir) function that returns the coordinate that the model outputs
- Add import to evaluate.py (i.e. from model_scripts import mymodel.py)
- Change the line that calls the get_coordinate function to match the model file (i.e. x, y = my_model.get_coordinate(config_data, history_string, base_dir, image_output_dir)
- Run evaluate.py
Please cite the paper and star this repo if you find it interesting/useful, thanks!
@article{ye2025realwebassist,
title={Realwebassist: A benchmark for long-horizon web assistance with real-world users},
author={Ye, Suyu and Shi, Haojun and Shih, Darren and Yun, Hyokun and Roosta, Tanya and Shu, Tianmin},
journal={arXiv preprint arXiv:2504.10445},
year={2025}
}