Skip to content

SOTA Computer Use Benchmark Survey #213

@Locke0

Description

@Locke0

Survey state-of-the-art computer control benchmarks for next Thursday

  • Focus on newer, more complex benchmarks beyond Mind-to-Web
  • Communicate on this ticket and document in this coda doc

You can start by looking at if industry standards have changed first. You can try alphaxiv's ai search engine, huggingface trending papers, gemini/chatgpt deep research etc and also look at SOTA cua models on OSWorld and screenspot-pro and other SOTA leaderboards and see what benchmarks these cua models were evaluated on.

Last I checked, many GUI agents paper still use OSWorld, ScreenSpot-Pro, OSWorld-G. Also, there are text only or hybrid GUI+API CUA models that use way more benchmarks like tau^2, webshop, etc.

A few interesting directions like GUI-Robust (wondering if they have any eval sets, i think there is a benchmark paper focusing on robustness as well), mind2web-live (use a reward model, not sure how many SOTA CUA models are actually evaluated on it), etc.

Think about what would be beneficial to include in the survey. Off the top, I can think of things like data size, domain, distribution, format, innovations compared to previous benchmarks, SOTA models' performance on the benchmarks, etc.

I might drop some good benchmark papers in this ticket if any of them come to me.

Metadata

Metadata

Assignees

Labels

researchresearch, experiments, papers

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions