Skip to content

Commit 89f9fff

Browse files
QianhuiWuQianhui Wu
authored andcommitted
Update arXiv link.
1 parent da0eaba commit 89f9fff

File tree

1 file changed

+4
-4
lines changed

1 file changed

+4
-4
lines changed

index.html

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -110,12 +110,12 @@ <h1 class="title is-2 publication-title">GUI-Actor: Coordinate-Free Visual Groun
110110

111111
<!-- arXiv Link. -->
112112
<span class="link-block">
113-
<a target="_blank" href="https://www.arxiv.org/pdf/2502.13130"
113+
<a target="_blank" href="https://www.arxiv.org/pdf/2506.03143"
114114
class="external-link button is-normal is-rounded is-dark">
115115
<span class="icon">
116116
<i class="ai ai-arxiv"></i>
117117
</span>
118-
<span>ArXiv</span>
118+
<span>Paper</span>
119119
</a>
120120
</span>
121121
<span class="link-block">
@@ -129,7 +129,7 @@ <h1 class="title is-2 publication-title">GUI-Actor: Coordinate-Free Visual Groun
129129
</span>
130130
<!-- Dataset Link. -->
131131
<span class="link-block">
132-
<a href="https://huggingface.co/microsoft/GUI-Actor-7B-Qwen2-Vl"
132+
<a href="https://huggingface.co/microsoft/GUI-Actor-7B-Qwen2-VL"
133133
class="external-link button is-normal is-rounded is-dark">
134134
<span class="icon">
135135
<img src="static/images/hf_icon.svg" />
@@ -182,7 +182,7 @@ <h3 class="title is-3" style="padding: 0 0 0 0;">Abstract</h3>
182182
<p style="font-size: 100%">
183183
One of the principal challenges in building VLM-powered GUI agents is visual grounding—localizing the appropriate screen region for action execution based on both the visual content and the textual plans. Most existing work formulates this as a text-based coordinate generation task. However, these approaches suffer from several limitations: weak spatial-semantic alignment due to the lack of explicit spatial supervision; inability to handle ambiguous supervision targets, as single-point predictions penalize valid variations; and a mismatch between the dense nature of screen coordinates and the coarse, patch-level granularity of visual features extracted by models like Vision Transformers. In this paper, we propose GUI-Actor, a VLM-based method for coordinate-free GUI grounding. At its core, GUI-Actor introduces an attention-based action head that learns to align a dedicated <i>&lt;ACTOR&gt;</i> token with all relevant visual patch tokens, enabling the model to propose one or more action regions in a single forward pass. In line with this, we further design a grounding verifier to evaluate and select the most plausible action region from the candidates proposed for action execution.
184184
Extensive experiments show that GUI-Actor outperforms prior state-of-the-art methods on multiple GUI action grounding benchmarks, with improved generalization to unseen screen resolutions and layouts.
185-
Notably GUI-Actor-7B (40.7) even surpasses UI-TARS-72B (38.1) on ScreenSpot-Pro, with much fewer parameters and training data.
185+
Notably, GUI-Actor-7B even surpasses UI-TARS-72B (38.1) on ScreenSpot-Pro, achieving scores of 40.7 with Qwen2-VL and, 44.6 with Qwen2.5-VL as backbones.
186186
Furthermore, by incorporating the verifier, we find that fine-tuning only the newly introduced action head (~100M parameters for 7B model) while keeping the VLM backbone frozen is sufficient to achieve performance comparable to previous state-of-the-art models, highlighting that GUI-Actor can endow the underlying VLM with effective grounding capabilities without compromising its general-purpose strengths.
187187
</p>
188188
</div>

0 commit comments

Comments
 (0)