-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Survey state-of-the-art computer control benchmarks for next Thursday
- Focus on newer, more complex benchmarks beyond Mind-to-Web
- Communicate on this ticket and document in this coda doc
You can start by looking at if industry standards have changed first. You can try alphaxiv's ai search engine, huggingface trending papers, gemini/chatgpt deep research etc and also look at SOTA cua models on OSWorld and screenspot-pro and other SOTA leaderboards and see what benchmarks these cua models were evaluated on.
Last I checked, many GUI agents paper still use OSWorld, ScreenSpot-Pro, OSWorld-G. Also, there are text only or hybrid GUI+API CUA models that use way more benchmarks like tau^2, webshop, etc.
A few interesting directions like GUI-Robust (wondering if they have any eval sets, i think there is a benchmark paper focusing on robustness as well), mind2web-live (use a reward model, not sure how many SOTA CUA models are actually evaluated on it), etc.
Think about what would be beneficial to include in the survey. Off the top, I can think of things like data size, domain, distribution, format, innovations compared to previous benchmarks, SOTA models' performance on the benchmarks, etc.
I might drop some good benchmark papers in this ticket if any of them come to me.