You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
<p>🤔 <strong>There are several <span style="color: rgb(182, 30, 30);">intrinsic limitations</span> in the existing <span style="color: rgb(182, 30, 30);">coordinate-generation based methods</span> (i.e., output screen positions as text tokens x=..., y=...) for GUI grounding:</strong></p>
202
202
<ul>
203
-
<!-- <li><i>Spatial-semantic alignment is weak</i>: generating discrete coordinate tokens requires the model to implicitly map visual inputs to numeric outputs via a language modeling head, without any explicit spatial inductive bias. This process is inefficient, data-intensive, and prone to errors due to the lack of direct supervision linking visual features to action locations.</li>
203
+
<li><i>Spatial-semantic alignment is weak</i>: generating discrete coordinate tokens requires the model to implicitly map visual inputs to numeric outputs via a language modeling head, without any explicit spatial inductive bias. This process is inefficient, data-intensive, and prone to errors due to the lack of direct supervision linking visual features to action locations.</li>
204
204
<li><i>Supervision signals are ambiguous</i>: many GUI actions, such as clicking within a button, allow for a range of valid target positions. However, coordinate-based methods typically treat the task as single-point prediction, penalizing all deviations—even reasonable ones—and failing to capture the natural ambiguity of human interaction.</li>
205
-
<li><i>Granularity mismatch between vision and action space</i>: while coordinates are continuous and high-resolution, vision models like Vision Transformers (ViTs) operate on patch-level features. This mismatch forces the model to infer dense, pixel-level actions from coarse visual tokens, which undermines generalization to diverse screen layouts and resolutions.</li> -->
205
+
<li><i>Granularity mismatch between vision and action space</i>: while coordinates are continuous and high-resolution, vision models like Vision Transformers (ViTs) operate on patch-level features. This mismatch forces the model to infer dense, pixel-level actions from coarse visual tokens, which undermines generalization to diverse screen layouts and resolutions.</li>
206
206
<li><i>Spatial-semantic alignment is weak</i>: Mapping visual inputs to numeric coordinates via language modeling lacks explicit spatial inductive bias.</li>
207
207
<li><i>Single-point supervision is ambiguous</i>: Simplifying GUI element boxes to single click points penalizes valid deviations and makes training inefficient.</li>
208
208
<li><i>Granularity mismatch between vision and action space</i>: Patch-level features (ViT) vs. fine-grained coordinates lead to a generalization gap.</li>
209
209
</ul>
210
210
211
211
<p>💡 <strong>Rethink how humans interact with digital interfaces: <span style="color: rgb(182, 30, 30);">humans do NOT calculate precise screen coordinates before acting—they perceive the target element and interact with it directly.</strong></p>
212
-
<!-- <p>🚀 <strong>We propose <span style="color: rgb(182, 30, 30);">GUI-Actor</span>, a VLM-based method for <span style="color: rgb(182, 30, 30);">coordinate-free</span> GUI grounding that <span style="color: rgb(182, 30, 30);">more closely aligns with human behavior</span> while <span style="color: rgb(182, 30, 30);">addressing the above limitations</span>:</strong></p>
213
-
<ul>
214
-
<li>We introduce a dedicated <i><ACTOR></i> token as the contextual anchor to encode the grounding context by jointly processing visual input and NL instructions, and adopt an <i>attention-based action head</i> to align the <i><ACTOR></i> token with most relevant GUI regions by attending over visual patch tokens from the screenshot. ✅ Explicit spatial-semantic alignment. ✅ The resulting attention map naturally identifies (multiple) actionable regions in a single forward pass, offering flexibility for downstream modules such as search strategies.</li>
215
-
<li>GUI-Actor is trained using <i>multi-patch supervision</i>. All visual patches overlapping with ground-truth bounding boxes are labeled as positives, while others are labeled as negatives. ✅ Reduce supervision signal ambiguity and over-penalization of valid action variations.</li>
216
-
<li>GUI-Actor <i>grounds actions directly at the vision module's native spatial resolution</i>. ✅ Avoid the granularity mismatch and generalize more robustly to unseen screen resolutions and layouts.</li>
217
-
<li>We design a <i>grounding verifier</i> to evaluate and select the most plausible action region from the candidates proposed for action execution. ✅ Can be easily integrated with other grounding methods for further performance boost.</li>
218
-
</ul> -->
219
212
<p>🚀 <strong>We propose <span style="color: rgb(182, 30, 30);">GUI-Actor</span>, a VLM-based method for <span style="color: rgb(182, 30, 30);">coordinate-free</span> GUI grounding that more closely aligns with human behavior while addressing the above limitations:</strong></p>
220
213
<ul>
221
214
<li>We introduce a dedicated <i><ACTOR></i> token as the contextual anchor and adopt an <i>attention-based action head</i> to align the <i><ACTOR></i> token with most relevant GUI regions by directly attending over visual patch tokens from the screenshot. ✅ Explicit spatial-semantic alignment.</li>
<p>💡 <strong>Rethink how humans interact with digital interfaces: humans do NOT calculate precise screen coordinates before acting—they perceive the target element and interact with it directly.</strong></p>
260
253
<p>🚀 <strong>We propose <spanstyle="color: rgb(182, 30, 30);">GUI-Actor</span>, a VLM enhanced by an <i>action head</i>, to mitigate the above limitations. The attention-based action head not only enables GUI-Actor to peform coordinate-free GUI grounding that more closely aligns with human behavior, but also can generate multiple candidate regions in a single forward pass, offering flexibility for downstream modules such as search strategies.</strong></p>
261
254
<p>➕ <strong>We design a <i>grounding verifier</i> to evaluate and select the most plausible action region among the candidates proposed from the action attention map. We show that this verifier can be easily integrated with other grounding methods for a further performance boost.</strong><p>
262
-
<p>🎯 <strong>GUI-Actor achieves state-of-the-art performance on multiple GUI action grounding benchmarks, demonstrating its effectiveness and generalization to unseen screen resolutions and layouts. Notably, GUI-Actor-7B even surpasses UI-TARS-72B on ScreenSpot-Pro.</strong></p>
255
+
<p>🎯 <strong>GUI-Actor achieves <spanstyle="color: rgb(182, 30, 30);">state-of-the-art performance</span> on multiple GUI action grounding benchmarks with the same Qwen2-VL backbone, demonstrating its effectiveness and generalization to unseen screen resolutions and layouts. Notably, GUI-Actor-7B even surpasses UI-TARS-72B on ScreenSpot-Pro.</strong></p>
0 commit comments