AutoGUI: Scaling GUI Grounding with Autonomous Functionality Annotations from LLMs

This repo opensources the training and evaluation code for AutoGUI, an automatic and scalable GUI annotation pipeline

TODO:

AutoGUI-v2 data collected on thousands of mobile apps is coming soon.
Functionality-grounding-assisted GUI agents are coming soon.

AutoGUI pipeline - Revolutionizing Large-Scale GUI Data Annotation

Existing UI annotation methods typically collect data from static UIs, focusing on describing either the visual appearance (e.g., a button beside 30 the navigation bar), element categories (e.g., “menu button”), or brief functions weakly related to the UI context (e.g., “show more information”).

Here, we are thrilled to unveil AutoGUI, a groundbreaking and scalable UI annotation pipeline. AutoGUI can autonomously annotate the contextual functionalities of diverse UI elements at scale, entirely eliminating the need for human experts. This innovation not only accelerates the data collection process but also enhances the depth and accuracy of UI functionality descriptions, opening a new path in the field of UI annotation.

AutoGUI initiates by collecting interaction trajectories on Common Crawl websites. Each trajectory step captures all interactable elements and the accessibility tree (AXTree) that briefly outlines the UI structure. The content changes in the AXTrees before and after interaction will be used by an open-source LLM (e.g., Llama-3-70B) to predict functionality annotations of the interacted elements.

This annotation process provides rich functional semantics in the generated annotations, thereby allowing for curating a GUI dataset that can potentially enhance the GUI understanding capabilities of GUI agents.

Installation

You can install the AutoGUI package by cloning the repository and running the following command:

git clone https://github.com/BraveGroup/AutoGUI
cd AutoGUI
pip install -e .

Optional Packages

Please also follow the installation instructions of LLaVA, vLLM==0.7.3 and SGLang to install them for evaluation of LLaVA.

Note that Qwen2-VL needs transformers >= 4.47.1 to avoid the MROPE bugs while Qwen2.5-VL needs transformers >= 4.49.0.

AutoGUI Dataset

Training Set

We provide 702k functionality grounding/captioning tasks that are generated by populating task templates with the collected element-functionality pairs. To mitigate the gap between various device types, the screenshots are rendered at various resolutions to mimic web browsers and mobile devices.

Please view the training data here.

A functionality grounding example:

User: In this web page image, please locate the element as I describe it (with point). This element triggers a user registration process, allowing new users to create a PayPal account and gain access to the platform's services.

Assistant: (91,6)

A functionality captioning example:

User: What happens when you tap position (61,73) on the screen?

Assistant: This element serves as an input field for users to provide their birth date, contributing to the registration process by ensuring that users meet the age requirements for creating a Yahoo account.

Funcpred - Functionality Grounding Benchmark

We also curate a 2k split used for evaluating the functionality grounding capabilities of existing vision-language models (VLMs). This split contains 1k samples at web resolution (1280 x 720) and 1k at mobile resolution (428x746).

Download this test split on Google Drive.

Each test sample contains:

image: the GUI screenshot.
func: the functionality annotation of a target element on the screenshot.
point: the center point (X,Y) of the target element. Note that the coordinates are normalized with the range 0-100.
unnormalized_box: the bounding box of the target element in the image coordinate frame.
elem_text: the displayed or alt text of the element.
elem_tag: the HTML tag of the element.
device: the device type of the screenshot.

Finetuning Code

Prepare Data

After downloading the tar-format data, please generate a json file that records all samples with the absolute image paths required by the Qwen-VL model.

For example, the conversations field must start with a user message that looks like "<img>path/to/autogui_625k/1_web.png</img>\n (instruction)"

Finetuning Qwen-VL-Chat

Set the data_path in finetune/finetune_autogui_lora.sh and then run it.

Evaluation code

Our evaluation code is adapted from lmms-eval. To evaluate a model on a specific UI grounding benchmark, run this command:

python3 -m accelerate.commands.launch
    --num_processes=8 \
    -m lmms_eval \
    --model autogui \
    --model_args pretrained=WebAgent/AutoGUI-Qwen-v0.1-LoRA \
    --tasks func_pred_rec,screenspot_rec \
    --batch_size 1 \
    --log_samples \
    --log_samples_suffix autogui_funcpred \
    --output_path ./logs/ \
    ["--limit", "0.01"] \ # For debugging

The evaluation tasks used in our paper include: func_pred_rec, screenspot_rec, screenspot_v2_rec, refexp, motif, vwb_ag, vwb_eg.

The supported models include: autogui, qwen_vl_chat, llava_sglang, llava_hf, deepseek_vl_chat.py, cogagent, qwen2_vl, ferret_ui, UGround, OS-ATLAS (Please see lmms_eval/models). If autogui is used, the pretrained argument can be either a LoRA model path that contains only the adapter or a merged model path.

Acknowledgement

Our project codes are based on the Qwen-VL, SeeClick, and lmms-eval. We thank the authors for their open-source works.

Name		Name	Last commit message	Last commit date
Latest commit History 162 Commits
.vscode		.vscode
assets		assets
autogui_model		autogui_model
finetune		finetune
lmms_eval		lmms_eval
pretrain		pretrain
utils		utils
.gitignore		.gitignore
DATASET_CARD.md.md		DATASET_CARD.md.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AutoGUI: Scaling GUI Grounding with Autonomous Functionality Annotations from LLMs

AutoGUI pipeline - Revolutionizing Large-Scale GUI Data Annotation

Installation

Optional Packages

AutoGUI Dataset

Training Set

Funcpred - Functionality Grounding Benchmark

Finetuning Code

Evaluation code

Acknowledgement

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

ZJULiHongxin/AutoGUI

Folders and files

Latest commit

History

Repository files navigation

AutoGUI: Scaling GUI Grounding with Autonomous Functionality Annotations from LLMs

AutoGUI pipeline - Revolutionizing Large-Scale GUI Data Annotation

Installation

Optional Packages

AutoGUI Dataset

Training Set

Funcpred - Functionality Grounding Benchmark

Finetuning Code

Evaluation code

Acknowledgement

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages