Skip to content

Commit 0af2d22

Browse files
committed
Update (kz)
1 parent 5590fd4 commit 0af2d22

File tree

1 file changed

+7
-4
lines changed

1 file changed

+7
-4
lines changed

index.html

Lines changed: 7 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -75,7 +75,7 @@
7575
<div class="columns is-centered">
7676
<div class="column has-text-centered">
7777
<div style="display: flex; align-items: center; justify-content: center;">
78-
<img src="./static/images/logo.png" alt="GUI-Actor Logo" style="margin-right: 5px; height: 80px;margin-top: -10px;">
78+
<img src="./static/images/logo.png" alt="GUI-Actor Logo" style="margin-right: 15px; height: 80px;margin-top: -10px;">
7979
<h1 class="title is-2 publication-title">GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents</h1>
8080
</div>
8181
<div class="is-size-5 publication-authors">
@@ -164,7 +164,7 @@ <h1 class="title is-2 publication-title">GUI-Actor: Coordinate-Free Visual Groun
164164
<br>
165165
<!-- <h2 class="subtitle has-text-centered"> -->
166166
<p style="font-size: 100%"><strong>Figure 1.</strong>
167-
<strong>Left: </strong> Model performance vs. training data scale on the ScreenSpot-Pro benchmark. Higher and more left is better; larger points indicate models with more parameters.
167+
<strong>Left: </strong> Model performance vs. training data scale on the ScreenSpot-Pro. Higher and more left is better; larger points indicate models with more parameters.
168168
<strong>Right: </strong> Illustration of action attention. GUI-Actor grounds target elements by attending to the most relevant visual regions.
169169
</p>
170170

@@ -200,9 +200,12 @@ <h3 class="title is-3" style="padding-bottom: 20px;"><span>Key Takeaways</span><
200200
<div class="content has-text-justified">
201201
<p>🤔 <strong>There are several <span style="color: rgb(182, 30, 30);">intrinsic limitations</span> in the existing <span style="color: rgb(182, 30, 30);">coordinate-generation based methods</span> (i.e., output screen positions as text tokens x=..., y=...) for GUI grounding:</strong></p>
202202
<ul>
203-
<li><i>Spatial-semantic alignment is weak</i>: generating discrete coordinate tokens requires the model to implicitly map visual inputs to numeric outputs via a language modeling head, without any explicit spatial inductive bias. This process is inefficient, data-intensive, and prone to errors due to the lack of direct supervision linking visual features to action locations.</li>
203+
<!-- <li><i>Spatial-semantic alignment is weak</i>: generating discrete coordinate tokens requires the model to implicitly map visual inputs to numeric outputs via a language modeling head, without any explicit spatial inductive bias. This process is inefficient, data-intensive, and prone to errors due to the lack of direct supervision linking visual features to action locations.</li>
204204
<li><i>Supervision signals are ambiguous</i>: many GUI actions, such as clicking within a button, allow for a range of valid target positions. However, coordinate-based methods typically treat the task as single-point prediction, penalizing all deviations—even reasonable ones—and failing to capture the natural ambiguity of human interaction.</li>
205-
<li><i>Granularity mismatch between vision and action space</i>: while coordinates are continuous and high-resolution, vision models like Vision Transformers (ViTs) operate on patch-level features. This mismatch forces the model to infer dense, pixel-level actions from coarse visual tokens, which undermines generalization to diverse screen layouts and resolutions.</li>
205+
<li><i>Granularity mismatch between vision and action space</i>: while coordinates are continuous and high-resolution, vision models like Vision Transformers (ViTs) operate on patch-level features. This mismatch forces the model to infer dense, pixel-level actions from coarse visual tokens, which undermines generalization to diverse screen layouts and resolutions.</li> -->
206+
<li><i>Spatial-semantic alignment is weak</i>: Mapping visual inputs to numeric coordinates via language modeling lacks explicit spatial inductive bias.</li>
207+
<li><i>Single-point supervision is ambiguous</i>: Simplifying GUI element boxes to single click points penalizes valid deviations and makes training inefficient.</li>
208+
<li><i>Granularity mismatch between vision and action space</i>: Patch-level features (ViT) vs. fine-grained coordinates lead to a generalization gap.</li>
206209
</ul>
207210

208211
<p>💡 <strong>Rethink how humans interact with digital interfaces: <span style="color: rgb(182, 30, 30);">humans do NOT calculate precise screen coordinates before acting—they perceive the target element and interact with it directly.</strong></p>

0 commit comments

Comments
 (0)