中文 | English
DR³-Eval is a realistic, reproducible, and multimodal evaluation benchmark for Deep Research Agents, focusing on multi-file report generation tasks.
Existing benchmarks face a fundamental tension between realism, controllability, and reproducibility when evaluating deep research agents. DR³-Eval addresses this through the following design:
- 🔬 Real User Scenarios: Tasks are constructed from real user-provided multimodal files, covering 3 major domains and 13 sub-domains
- 📦 Static Sandbox Corpora: An independent static research sandbox is built for each task, containing supportive, distracting, and noisy documents
- 🎯 Reverse Construction Method: Queries are reverse-engineered from verified evidence documents, eliminating evaluation ambiguity
- 📊 Multi-dimensional Evaluation: Five dimensions — Information Recall, Factual Accuracy, Citation Coverage, Instruction Following, and Depth Quality
Figure 1. Comparison of DR³-Eval with existing deep research benchmarks. DR³-Eval supports both user files and sandbox corpora, providing a realistic and reproducible multimodal evaluation environment.
- 📦 HuggingFace Dataset: The DR³-Eval dataset is now available on HuggingFace! Download it directly from HuggingFace.
DR³-Eval is the first deep research evaluation benchmark that simultaneously satisfies all of the following: user file input, static sandbox corpora, multimodality, real-world scenarios, multi-file upload, and reverse construction.
Figure 2. Comprehensive comparison of DR³-Eval with representative benchmarks.
The overall framework of DR³-Eval consists of three core components:
- 📝 Data Construction: Synthesizes search paths from real multimodal files through a diverge-converge mechanism, establishes static sandboxes with controllable signal-to-noise ratios, and generates queries via reverse engineering
- 🤖 DR³-Agent: Hierarchical multi-agent architecture (see next section)
- 📊 Evaluation Protocol: A multi-dimensional metric suite that comprehensively evaluates evidence retrieval and report generation performance
Figure 3. DR³-Eval framework overview. Includes data construction, DR³-Agent multi-agent system, and multi-dimensional evaluation protocol.
To validate the effectiveness of DR³-Eval, we developed DR³-Agent — an LLM-driven multi-agent deep research system built on the MiroFlow framework. Its core architecture is as follows:
- Main Agent: The reasoning hub of the system, integrating perception tools for video, audio, etc. It maintains global task context, runs a dynamic "Plan-Execute-Observe" loop, and coordinates sub-agents to complete information retrieval tasks
- RAG Search Sub-agent: Interacts with the static sandbox corpus using an iterative dense retrieval mechanism based on
text-embedding-3-small, refining queries under the ReAct paradigm to obtain evidence - File Reading Sub-agent: Specializes in parsing long-form user files, supporting keyword queries and page-number-based content retrieval
Sub-agents do not share global state; they only return highly condensed summaries to the main agent to reduce its context burden.
- 100 independent tasks (50 English + 50 Chinese)
- 3 major domains, 13 sub-domains
- 68% of tasks involve multimodal input
- Average of 2.24 user files per task, up to 6
- Sandbox corpus contains an average of 465.5 web pages under the 512k configuration
Figure 4. Dataset statistics. (a) Domain distribution. (b) File type distribution. (c) Distribution of user files per task.
| Dimension | Metric | Description |
|---|---|---|
| Information Retrieval | IR (Information Recall) | Coverage of key insights from user files and sandbox corpus in the report |
| Information Retrieval | CC (Citation Coverage) | Extent to which the report cites necessary source documents |
| Report Generation | FA (Factual Accuracy) | Factual correctness of cited claims in the report |
| Report Generation | IF (Instruction Following) | Whether the report satisfies all requirements specified in the task query |
| Report Generation | DQ (Depth Quality) | Analytical depth and logical rigor of the report |
Comprehensive evaluation was conducted on 8 state-of-the-art LLMs. Key findings:
- DR³-Eval is extremely challenging — the best model, Claude Sonnet 4, achieves only 65.6 average score under 512k
- Longer context → Lower performance — Noise and distracting information make it difficult for models to locate effective evidence
- Instruction following ≠ Factual accuracy — Some models generate reports that "look" complete but contain factual errors
- Significant performance variation across domains
Figure 5. Evaluation results of different models under 64k/128k/512k sandbox scales.
Figure 6. Performance heatmap of different models across 13 domains.
📦 More Experimental Results
Figure 7. Performance trends under different sandbox corpus scales (32k–512k).
Figure 8. Error type distribution across different models. Hallucination is the primary cause of failure for most models.
Figure 9. Ablation study comparing long-context vs. RAG approaches.
Figure 10. Performance comparison between static sandbox corpus and real-world web search.
Figure 11. Comparison of different retrieval methods.
Figure 12. Consistency analysis between LLM-as-Judge and human evaluation.
The dataset is hosted on HuggingFace and can be downloaded directly.
# Install dependencies
uv sync
# Configure environment variables
cp .env.example .env
# Edit .env to fill in API keys (OPENROUTER_API_KEY, etc.)
# Verify installation
uv run python main.py --help# Single task
uv run python main.py run \
--folder data/datasets_en/001 \
--query "Analyze the documents and generate a research report." \
--offline
# Batch tasks
uv run python main.py batch \
--data-dir data/datasets_en \
--context-size 128k \
--llm-config gpt-4 \
--offlineuv run python eval.py all \
--result-base results_main/datasets_en \
--datasets-dir data/datasets_en \
--workers 4If you find this work useful, please cite:
@article{dr3eval2026,
title={DR3-Eval: Towards Realistic and Reproducible Deep Research Evaluation},
author={},
journal={arXiv preprint arXiv:xxxx.xxxxx},
year={2026}
}This project is licensed under the Apache License 2.0. See LICENSE for details.
The DR³-Agent in this project is built on the MiroFlow framework by MiroMind AI. We extended it with the DR³-Eval evaluation framework, including multi-dimensional report quality metrics, benchmark support, and multi-model comparison capabilities.
For questions, please reach out via GitHub Issues.
NJU-LINK Team, Nanjing University