Skip to content

Factual error in Table 1: Tongyi DeepResearch is a BrowseComp result, not BrowseComp-Plus #3

@BeyonderXX

Description

@BeyonderXX

Hi authors,

Table 1 appears to contain a factual benchmark error.

Tongyi DeepResearch (44.5) is listed under BrowseComp-Plus, but this number is very likely from BrowseComp, not BrowseComp-Plus.

This distinction matters because BrowseComp-Plus is a much easier setting than BrowseComp: it uses a pre-collected webpage set and embedding-based retrieval, so scores are typically much higher than on the original BrowseComp.

In our own reproduction, Tongyi DeepResearch reaches about 68% on BrowseComp-Plus, which is far above the 44.5 reported in the table. That strongly suggests the table is mixing results from two different benchmarks.

So the current comparison is misleading unless this is corrected.

A second issue is that the BrowseComp baselines are incomplete and somewhat outdated. If you want to compare against representative recent open deep-research/search-agent systems, at least the following should be considered:

  • WebSailor-V2: Bridging the Chasm to Proprietary Agents via Synthetic Data and Scalable Reinforcement Learning
  • REDSearcher: A Scalable and Cost-Efficient Framework for Long-Horizon Search Agents

In particular, REDSearcher should not be omitted from a modern BrowseComp comparison. Its open-source data is already strong enough that SFT alone can reach ~37% on BrowseComp.

Resources:

  • REDSearcher SFT: https://huggingface.co/datasets/Zchu/REDSearcher_SFT_10K
  • REDSearcher RL: https://huggingface.co/datasets/Zchu/REDSearcher_RL_1K

I suggest:

  1. Correct the Tongyi DeepResearch row if 44.5 is in fact a BrowseComp number.
  2. Avoid mixing BrowseComp and BrowseComp-Plus results in the same subtable.
  3. Add stronger recent BrowseComp baselines, especially WebSailor-V2 and REDSearcher.

Right now, Table 1 overstates fairness of comparison because it appears to contain both a benchmark mismatch and missing strong baselines.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions