-
Notifications
You must be signed in to change notification settings - Fork 64
Factual error in Table 1: Tongyi DeepResearch is a BrowseComp result, not BrowseComp-Plus #3
Description
Hi authors,
Table 1 appears to contain a factual benchmark error.
Tongyi DeepResearch (44.5) is listed under BrowseComp-Plus, but this number is very likely from BrowseComp, not BrowseComp-Plus.
This distinction matters because BrowseComp-Plus is a much easier setting than BrowseComp: it uses a pre-collected webpage set and embedding-based retrieval, so scores are typically much higher than on the original BrowseComp.
In our own reproduction, Tongyi DeepResearch reaches about 68% on BrowseComp-Plus, which is far above the 44.5 reported in the table. That strongly suggests the table is mixing results from two different benchmarks.
So the current comparison is misleading unless this is corrected.
A second issue is that the BrowseComp baselines are incomplete and somewhat outdated. If you want to compare against representative recent open deep-research/search-agent systems, at least the following should be considered:
- WebSailor-V2: Bridging the Chasm to Proprietary Agents via Synthetic Data and Scalable Reinforcement Learning
- REDSearcher: A Scalable and Cost-Efficient Framework for Long-Horizon Search Agents
In particular, REDSearcher should not be omitted from a modern BrowseComp comparison. Its open-source data is already strong enough that SFT alone can reach ~37% on BrowseComp.
Resources:
- REDSearcher SFT:
https://huggingface.co/datasets/Zchu/REDSearcher_SFT_10K - REDSearcher RL:
https://huggingface.co/datasets/Zchu/REDSearcher_RL_1K
I suggest:
- Correct the Tongyi DeepResearch row if 44.5 is in fact a BrowseComp number.
- Avoid mixing BrowseComp and BrowseComp-Plus results in the same subtable.
- Add stronger recent BrowseComp baselines, especially WebSailor-V2 and REDSearcher.
Right now, Table 1 overstates fairness of comparison because it appears to contain both a benchmark mismatch and missing strong baselines.