This repo opensources the training and evaluation code for AutoGUI, an automatic and scalable GUI annotation pipeline
TODO:
- AutoGUI-v2 data collected on thousands of mobile apps is coming soon.
- Functionality-grounding-assisted GUI agents are coming soon.
Existing UI annotation methods typically collect data from static UIs, focusing on describing either the visual appearance (e.g., a button beside 30 the navigation bar), element categories (e.g., “menu button”), or brief functions weakly related to the UI context (e.g., “show more information”).
Here, we are thrilled to unveil AutoGUI, a groundbreaking and scalable UI annotation pipeline. AutoGUI can autonomously annotate the contextual functionalities of diverse UI elements at scale, entirely eliminating the need for human experts. This innovation not only accelerates the data collection process but also enhances the depth and accuracy of UI functionality descriptions, opening a new path in the field of UI annotation.
AutoGUI initiates by collecting interaction trajectories on Common Crawl websites. Each trajectory step captures all interactable elements and the accessibility tree (AXTree) that briefly outlines the UI structure. The content changes in the AXTrees before and after interaction will be used by an open-source LLM (e.g., Llama-3-70B) to predict functionality annotations of the interacted elements.
This annotation process provides rich functional semantics in the generated annotations, thereby allowing for curating a GUI dataset that can potentially enhance the GUI understanding capabilities of GUI agents.
You can install the AutoGUI package by cloning the repository and running the following command:
git clone https://github.com/BraveGroup/AutoGUI
cd AutoGUI
pip install -e .
Please also follow the installation instructions of LLaVA, vLLM==0.7.3 and SGLang to install them for evaluation of LLaVA.
Note that Qwen2-VL needs transformers >= 4.47.1 to avoid the MROPE bugs while Qwen2.5-VL needs transformers >= 4.49.0.
We provide 702k functionality grounding/captioning tasks that are generated by populating task templates with the collected element-functionality pairs. To mitigate the gap between various device types, the screenshots are rendered at various resolutions to mimic web browsers and mobile devices.
Please view the training data here.
A functionality grounding example:
User: In this web page image, please locate the element as I describe it (with point). This element triggers a user registration process, allowing new users to create a PayPal account and gain access to the platform's services.
Assistant: (91,6)
A functionality captioning example:
User: What happens when you tap position (61,73) on the screen?
Assistant: This element serves as an input field for users to provide their birth date, contributing to the registration process by ensuring that users meet the age requirements for creating a Yahoo account.
We also curate a 2k split used for evaluating the functionality grounding capabilities of existing vision-language models (VLMs). This split contains 1k samples at web resolution (1280 x 720) and 1k at mobile resolution (428x746).
Download this test split on Google Drive.
Each test sample contains:
image: the GUI screenshot.func: the functionality annotation of a target element on the screenshot.point: the center point (X,Y) of the target element. Note that the coordinates are normalized with the range 0-100.unnormalized_box: the bounding box of the target element in the image coordinate frame.elem_text: the displayed or alt text of the element.elem_tag: the HTML tag of the element.device: the device type of the screenshot.
- Prepare Data
After downloading the tar-format data, please generate a json file that records all samples with the absolute image paths required by the Qwen-VL model.
For example, the conversations field must start with a user message that looks like "<img>path/to/autogui_625k/1_web.png</img>\n (instruction)"
- Finetuning Qwen-VL-Chat
Set the data_path in finetune/finetune_autogui_lora.sh and then run it.
Our evaluation code is adapted from lmms-eval. To evaluate a model on a specific UI grounding benchmark, run this command:
python3 -m accelerate.commands.launch
--num_processes=8 \
-m lmms_eval \
--model autogui \
--model_args pretrained=WebAgent/AutoGUI-Qwen-v0.1-LoRA \
--tasks func_pred_rec,screenspot_rec \
--batch_size 1 \
--log_samples \
--log_samples_suffix autogui_funcpred \
--output_path ./logs/ \
["--limit", "0.01"] \ # For debugging
The evaluation tasks used in our paper include: func_pred_rec, screenspot_rec, screenspot_v2_rec, refexp, motif, vwb_ag, vwb_eg.
The supported models include: autogui, qwen_vl_chat, llava_sglang, llava_hf, deepseek_vl_chat.py, cogagent, qwen2_vl, ferret_ui, UGround, OS-ATLAS (Please see lmms_eval/models). If autogui is used, the pretrained argument can be either a LoRA model path that contains only the adapter or a merged model path.
Our project codes are based on the Qwen-VL, SeeClick, and lmms-eval. We thank the authors for their open-source works.

