SOTA Computer Use Benchmark Survey

Survey state-of-the-art computer control benchmarks for next Thursday
- Focus on newer, more complex benchmarks beyond Mind-to-Web
- Communicate on this ticket and document in this [coda doc](https://coda.io/d/Software-Control-Models-Knowledge-Base_dzujAXxemDw/SOTA-CUA-Benchmark-Survey_susEKlCq#_luL9Flel)


You can start by looking at if industry standards have changed first. You can try alphaxiv's ai search engine, huggingface trending papers, gemini/chatgpt deep research etc and also look at SOTA cua models on OSWorld and screenspot-pro and other SOTA leaderboards and see what benchmarks these cua models were evaluated on.

Last I checked, many GUI agents paper still use OSWorld, ScreenSpot-Pro, OSWorld-G. Also, there are text only or hybrid GUI+API CUA models that use way more benchmarks like tau^2, webshop, etc.

A few interesting directions like GUI-Robust (wondering if they have any eval sets, i think there is a benchmark paper focusing on robustness as well), mind2web-live (use a reward model, not sure how many SOTA CUA models are actually evaluated on it), etc.

Think about what would be beneficial to include in the survey. Off the top, I can think of things like data size, domain, distribution, format, innovations compared to previous benchmarks, SOTA models' performance on the benchmarks, etc.

I might drop some good benchmark papers in this ticket if any of them come to me.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SOTA Computer Use Benchmark Survey #213

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

SOTA Computer Use Benchmark Survey #213

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions