|
| 1 | +# Grounding Verifier for GUI-Actor |
| 2 | + |
| 3 | +We developed a grounding verifier to assess whether a selected action position aligns with a given language instruction. This model is particularly effective for GUI-Actor, as GUI-Actor's attention map produces diverse candidate positions from a single inference. With the verifier, we can efficiently evaluate actions **in hindsight**—after identifying the chosen position on the image—and make more informed decisions. |
| 4 | + |
| 5 | +<img src="https://cdn-uploads.huggingface.co/production/uploads/64d45451c34a346181b130dd/1LTBORYJsO9Ru6B4q_SKl.png" alt="image" width="500"/> |
| 6 | + |
| 7 | +## Training |
| 8 | + |
| 9 | +The verifier is trained to take a language instruction and an image (with a red circle marking the candidate position) as input, and predict whether the position is correct—outputting "True" or "False." |
| 10 | + |
| 11 | +We use the [OS-Atlas dataset](https://huggingface.co/datasets/OS-Copilot/OS-Atlas-data) and process it using `verifier_data_generation.py` to curate training data. The model is fine-tuned via supervised training (SFT) starting from the UITARS-SFT-2B checkpoint, providing strong performance with a relatively small model size. |
| 12 | + |
| 13 | +### Data Preparation |
| 14 | + |
| 15 | +To prepare the dataset: |
| 16 | + |
| 17 | +1. Download and unzip the [OS-Atlas dataset](https://huggingface.co/datasets/OS-Copilot/OS-Atlas-data) following the instructions on the Hugging Face page. |
| 18 | +2. Organize the images into the following directory structure: |
| 19 | + |
| 20 | +```python |
| 21 | +image_folder_dict = { |
| 22 | + 'windows_splited': f'{root_path}/desktop_domain/windows_images', |
| 23 | + 'linux_splited': f'{root_path}/desktop_domain/linux_images', |
| 24 | + 'macos_splited': f'{root_path}/desktop_domain/macos_images', |
| 25 | + 'widget_captioning': f'{root_path}/mobile_domain/combined', |
| 26 | + 'uibert_raw': f'{root_path}/mobile_domain/UIBert', |
| 27 | + 'ricosca': f'{root_path}/mobile_domain/combined', |
| 28 | + 'amex_raw': f'{root_path}/mobile_domain/amex_images', |
| 29 | + 'seeclick_web': f'{root_path}/web_domain/seeclick_web_imgs', |
| 30 | + 'fineweb_3m': f'{root_path}/web_domain/fineweb' |
| 31 | +} |
| 32 | +``` |
| 33 | + |
| 34 | +Each training sample includes a positive and one or more negative examples: |
| 35 | + |
| 36 | +* **Positive samples**: taken directly from the original dataset with a red circle marking the correct target. |
| 37 | +* **Negative samples**: created by either (a) selecting another meaningful UI element or (b) randomly sampling a point, which may not correspond to any actionable item. |
| 38 | + |
| 39 | +To generate the dataset, run the following commands (since the dataset is very large, you can ): |
| 40 | + |
| 41 | +```bash |
| 42 | +python verifier_data_generation.py --root_path ${path_to_OS-Atlas-data} --new_directory ${save_path} --file_dict_key desktop_domain --selected_size 30000 |
| 43 | +python verifier_data_generation.py --root_path ${path_to_OS-Atlas-data} --new_directory ${save_path} --file_dict_key mobile_domain --selected_size 30000 |
| 44 | +python verifier_data_generation.py --root_path ${path_to_OS-Atlas-data} --new_directory ${save_path} --file_dict_key web_domain --selected_size 30000 |
| 45 | +``` |
| 46 | + |
| 47 | + |
| 48 | +### SFT |
| 49 | + |
| 50 | +We use the official code from [Aguvis](https://github.com/xlang-ai/aguvis) to perform SFT training. Make sure to set the file path correctly in the `stage1.yaml` configuration. For training, we use [**UITARS-2B-SFT**](https://huggingface.co/ByteDance-Seed/UI-TARS-2B-SFT) as the base model with a learning rate of $2 \times 10^{-5}$, running for one epoch. |
| 51 | + |
| 52 | + |
| 53 | + |
| 54 | +## Evaluation |
| 55 | + |
| 56 | +We evaluate our method using the attention weights generated by GUI-Actor and the grounding verifier, saved in a JSON file (e.g., `screenspot_all_preds_Original.json`). Before running the evaluation scripts, please update the file paths in `run_ss_v1.sh`, `run_ss_v2.sh`, and `run_ss_pro.sh` accordingly. |
| 57 | + |
| 58 | +Make sure to download the ScreenSpot datasets and ensure their paths exactly match those specified in the shell scripts. Specifically, download **ScreenSpot** and **ScreenSpot-Pro** from [ss-v1](https://huggingface.co/datasets/rootsautomation/ScreenSpot) and [ss-pro](https://huggingface.co/datasets/likaixin/ScreenSpot-Pro), respectively. |
| 59 | +For **ScreenSpot-v2**, we provide a converted version (`ScreenSpot-v2-new`) that aligns with the format used by the other datasets. However, you still need to download the original images from [ss-v2](https://huggingface.co/datasets/OS-Copilot/ScreenSpot-v2). |
| 60 | + |
| 61 | + |
| 62 | +Once everything is set up, run the following commands: |
| 63 | + |
| 64 | +```bash |
| 65 | +bash run_ss_v1.sh |
| 66 | +bash run_ss_v2.sh |
| 67 | +bash run_ss_pro.sh |
| 68 | +``` |
| 69 | + |
| 70 | + |
| 71 | + |
| 72 | + |
| 73 | + |
0 commit comments