You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
<p>💡 <strong>Rethink how humans interact with digital interfaces: <spanstyle="color: rgb(182, 30, 30);">humans do NOT calculate precise screen coordinates before acting—they perceive the target element and interact with it directly.</strong></p>
212
-
<p>🚀 <strong>We propose <spanstyle="color: rgb(182, 30, 30);">GUI-Actor</span>, a VLM-based method for <spanstyle="color: rgb(182, 30, 30);">coordinate-free</span> GUI grounding that <spanstyle="color: rgb(182, 30, 30);">more closely aligns with human behavior</span> while <spanstyle="color: rgb(182, 30, 30);">addressing the above limitations</span>:</strong></p>
212
+
<!-- <p>🚀 <strong>We propose <span style="color: rgb(182, 30, 30);">GUI-Actor</span>, a VLM-based method for <span style="color: rgb(182, 30, 30);">coordinate-free</span> GUI grounding that <span style="color: rgb(182, 30, 30);">more closely aligns with human behavior</span> while <span style="color: rgb(182, 30, 30);">addressing the above limitations</span>:</strong></p>
213
213
<ul>
214
214
<li>We introduce a dedicated <i><ACTOR></i> token as the contextual anchor to encode the grounding context by jointly processing visual input and NL instructions, and adopt an <i>attention-based action head</i> to align the <i><ACTOR></i> token with most relevant GUI regions by attending over visual patch tokens from the screenshot. ✅ Explicit spatial-semantic alignment. ✅ The resulting attention map naturally identifies (multiple) actionable regions in a single forward pass, offering flexibility for downstream modules such as search strategies.</li>
215
215
<li>GUI-Actor is trained using <i>multi-patch supervision</i>. All visual patches overlapping with ground-truth bounding boxes are labeled as positives, while others are labeled as negatives. ✅ Reduce supervision signal ambiguity and over-penalization of valid action variations.</li>
216
216
<li>GUI-Actor <i>grounds actions directly at the vision module's native spatial resolution</i>. ✅ Avoid the granularity mismatch and generalize more robustly to unseen screen resolutions and layouts.</li>
217
217
<li>We design a <i>grounding verifier</i> to evaluate and select the most plausible action region from the candidates proposed for action execution. ✅ Can be easily integrated with other grounding methods for further performance boost.</li>
218
+
</ul> -->
219
+
<p>🚀 <strong>We propose <spanstyle="color: rgb(182, 30, 30);">GUI-Actor</span>, a VLM-based method for <spanstyle="color: rgb(182, 30, 30);">coordinate-free</span> GUI grounding that more closely aligns with human behavior while addressing the above limitations:</strong></p>
220
+
<ul>
221
+
<li>We introduce a dedicated <i><ACTOR></i> token as the contextual anchor and adopt an <i>attention-based action head</i> to align the <i><ACTOR></i> token with most relevant GUI regions by directly attending over visual patch tokens from the screenshot. ✅ Explicit spatial-semantic alignment.</li>
222
+
<li>GUI-Actor is trained using <i>multi-patch supervision</i>. All visual patches overlapping with ground-truth bounding boxes are labeled as positives, while others are labeled as negatives. ✅ Reduce supervision signal ambiguity and over-penalization of valid action variations.</li>
223
+
<li>GUI-Actor <i>grounds actions directly at the vision module's native spatial resolution</i>. ✅ Avoid the granularity mismatch and generalize more robustly to unseen screen resolutions and layouts.</li>
224
+
<li>We design a <i>grounding verifier</i> to evaluate and select the most plausible action region from the candidates proposed for action execution. ✅ Can be easily integrated with other grounding methods for further performance boost.</li>
0 commit comments