|
75 | 75 | <div class="columns is-centered"> |
76 | 76 | <div class="column has-text-centered"> |
77 | 77 | <div style="display: flex; align-items: center; justify-content: center;"> |
78 | | - <img src="./static/images/logo.png" alt="GUI-Actor Logo" style="margin-right: 5px; height: 80px;margin-top: -10px;"> |
| 78 | + <img src="./static/images/logo.png" alt="GUI-Actor Logo" style="margin-right: 15px; height: 80px;margin-top: -10px;"> |
79 | 79 | <h1 class="title is-2 publication-title">GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents</h1> |
80 | 80 | </div> |
81 | 81 | <div class="is-size-5 publication-authors"> |
@@ -164,7 +164,7 @@ <h1 class="title is-2 publication-title">GUI-Actor: Coordinate-Free Visual Groun |
164 | 164 | <br> |
165 | 165 | <!-- <h2 class="subtitle has-text-centered"> --> |
166 | 166 | <p style="font-size: 100%"><strong>Figure 1.</strong> |
167 | | - <strong>Left: </strong> Model performance vs. training data scale on the ScreenSpot-Pro benchmark. Higher and more left is better; larger points indicate models with more parameters. |
| 167 | + <strong>Left: </strong> Model performance vs. training data scale on the ScreenSpot-Pro. Higher and more left is better; larger points indicate models with more parameters. |
168 | 168 | <strong>Right: </strong> Illustration of action attention. GUI-Actor grounds target elements by attending to the most relevant visual regions. |
169 | 169 | </p> |
170 | 170 |
|
@@ -200,9 +200,12 @@ <h3 class="title is-3" style="padding-bottom: 20px;"><span>Key Takeaways</span>< |
200 | 200 | <div class="content has-text-justified"> |
201 | 201 | <p>🤔 <strong>There are several <span style="color: rgb(182, 30, 30);">intrinsic limitations</span> in the existing <span style="color: rgb(182, 30, 30);">coordinate-generation based methods</span> (i.e., output screen positions as text tokens x=..., y=...) for GUI grounding:</strong></p> |
202 | 202 | <ul> |
203 | | - <li><i>Spatial-semantic alignment is weak</i>: generating discrete coordinate tokens requires the model to implicitly map visual inputs to numeric outputs via a language modeling head, without any explicit spatial inductive bias. This process is inefficient, data-intensive, and prone to errors due to the lack of direct supervision linking visual features to action locations.</li> |
| 203 | + <!-- <li><i>Spatial-semantic alignment is weak</i>: generating discrete coordinate tokens requires the model to implicitly map visual inputs to numeric outputs via a language modeling head, without any explicit spatial inductive bias. This process is inefficient, data-intensive, and prone to errors due to the lack of direct supervision linking visual features to action locations.</li> |
204 | 204 | <li><i>Supervision signals are ambiguous</i>: many GUI actions, such as clicking within a button, allow for a range of valid target positions. However, coordinate-based methods typically treat the task as single-point prediction, penalizing all deviations—even reasonable ones—and failing to capture the natural ambiguity of human interaction.</li> |
205 | | - <li><i>Granularity mismatch between vision and action space</i>: while coordinates are continuous and high-resolution, vision models like Vision Transformers (ViTs) operate on patch-level features. This mismatch forces the model to infer dense, pixel-level actions from coarse visual tokens, which undermines generalization to diverse screen layouts and resolutions.</li> |
| 205 | + <li><i>Granularity mismatch between vision and action space</i>: while coordinates are continuous and high-resolution, vision models like Vision Transformers (ViTs) operate on patch-level features. This mismatch forces the model to infer dense, pixel-level actions from coarse visual tokens, which undermines generalization to diverse screen layouts and resolutions.</li> --> |
| 206 | + <li><i>Spatial-semantic alignment is weak</i>: Mapping visual inputs to numeric coordinates via language modeling lacks explicit spatial inductive bias.</li> |
| 207 | + <li><i>Single-point supervision is ambiguous</i>: Simplifying GUI element boxes to single click points penalizes valid deviations and makes training inefficient.</li> |
| 208 | + <li><i>Granularity mismatch between vision and action space</i>: Patch-level features (ViT) vs. fine-grained coordinates lead to a generalization gap.</li> |
206 | 209 | </ul> |
207 | 210 |
|
208 | 211 | <p>💡 <strong>Rethink how humans interact with digital interfaces: <span style="color: rgb(182, 30, 30);">humans do NOT calculate precise screen coordinates before acting—they perceive the target element and interact with it directly.</strong></p> |
|
0 commit comments