Factual error in Table 1: Tongyi DeepResearch is a BrowseComp result, not BrowseComp-Plus

Hi authors,

Table 1 appears to contain a **factual benchmark error**.

**Tongyi DeepResearch (44.5)** is listed under **BrowseComp-Plus**, but this number is very likely from **BrowseComp**, not BrowseComp-Plus.

This distinction matters because **BrowseComp-Plus is a much easier setting** than BrowseComp: it uses a **pre-collected webpage set** and **embedding-based retrieval**, so scores are typically much higher than on the original BrowseComp.

In our own reproduction, **Tongyi DeepResearch reaches about 68% on BrowseComp-Plus**, which is far above the **44.5** reported in the table. That strongly suggests the table is mixing results from **two different benchmarks**.

So the current comparison is misleading unless this is corrected.

A second issue is that the **BrowseComp baselines are incomplete and somewhat outdated**. If you want to compare against representative recent open deep-research/search-agent systems, at least the following should be considered:

* **WebSailor-V2: Bridging the Chasm to Proprietary Agents via Synthetic Data and Scalable Reinforcement Learning**
* **REDSearcher: A Scalable and Cost-Efficient Framework for Long-Horizon Search Agents**

In particular, **REDSearcher** should not be omitted from a modern BrowseComp comparison. Its open-source data is already strong enough that **SFT alone can reach ~37% on BrowseComp**.

Resources:

* REDSearcher SFT: `https://huggingface.co/datasets/Zchu/REDSearcher_SFT_10K`
* REDSearcher RL: `https://huggingface.co/datasets/Zchu/REDSearcher_RL_1K`

I suggest:

1. **Correct the Tongyi DeepResearch row** if 44.5 is in fact a BrowseComp number.
2. **Avoid mixing BrowseComp and BrowseComp-Plus results** in the same subtable.
3. **Add stronger recent BrowseComp baselines**, especially WebSailor-V2 and REDSearcher.

Right now, Table 1 overstates fairness of comparison because it appears to contain both a **benchmark mismatch** and **missing strong baselines**.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Factual error in Table 1: Tongyi DeepResearch is a BrowseComp result, not BrowseComp-Plus #3

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Factual error in Table 1: Tongyi DeepResearch is a BrowseComp result, not BrowseComp-Plus #3

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions