DR³-Eval: Towards Realistic and Reproducible
Deep Research Evaluation

中文 | English

✨ Overview

DR³-Eval is a realistic, reproducible, and multimodal evaluation benchmark for Deep Research Agents, focusing on multi-file report generation tasks.

Existing benchmarks face a fundamental tension between realism, controllability, and reproducibility when evaluating deep research agents. DR³-Eval addresses this through the following design:

🔬 Real User Scenarios: Tasks are constructed from real user-provided multimodal files, covering 3 major domains and 13 sub-domains
📦 Static Sandbox Corpora: An independent static research sandbox is built for each task, containing supportive, distracting, and noisy documents
🎯 Reverse Construction Method: Queries are reverse-engineered from verified evidence documents, eliminating evaluation ambiguity
📊 Multi-dimensional Evaluation: Five dimensions — Information Recall, Factual Accuracy, Citation Coverage, Instruction Following, and Depth Quality

Figure 1. Comparison of DR³-Eval with existing deep research benchmarks. DR³-Eval supports both user files and sandbox corpora, providing a realistic and reproducible multimodal evaluation environment.

📰 News

📦 HuggingFace Dataset: The DR³-Eval dataset is now available on HuggingFace! Download it directly from HuggingFace.

🏆 Benchmark Comparison

DR³-Eval is the first deep research evaluation benchmark that simultaneously satisfies all of the following: user file input, static sandbox corpora, multimodality, real-world scenarios, multi-file upload, and reverse construction.

Figure 2. Comprehensive comparison of DR³-Eval with representative benchmarks.

🧩 Framework and Pipeline

The overall framework of DR³-Eval consists of three core components:

📝 Data Construction: Synthesizes search paths from real multimodal files through a diverge-converge mechanism, establishes static sandboxes with controllable signal-to-noise ratios, and generates queries via reverse engineering
🤖 DR³-Agent: Hierarchical multi-agent architecture (see next section)
📊 Evaluation Protocol: A multi-dimensional metric suite that comprehensively evaluates evidence retrieval and report generation performance

Figure 3. DR³-Eval framework overview. Includes data construction, DR³-Agent multi-agent system, and multi-dimensional evaluation protocol.

🤖 DR³-Agent

To validate the effectiveness of DR³-Eval, we developed DR³-Agent — an LLM-driven multi-agent deep research system built on the MiroFlow framework. Its core architecture is as follows:

Main Agent: The reasoning hub of the system, integrating perception tools for video, audio, etc. It maintains global task context, runs a dynamic "Plan-Execute-Observe" loop, and coordinates sub-agents to complete information retrieval tasks
RAG Search Sub-agent: Interacts with the static sandbox corpus using an iterative dense retrieval mechanism based on text-embedding-3-small, refining queries under the ReAct paradigm to obtain evidence
File Reading Sub-agent: Specializes in parsing long-form user files, supporting keyword queries and page-number-based content retrieval

Sub-agents do not share global state; they only return highly condensed summaries to the main agent to reduce its context burden.

📊 Dataset Statistics

100 independent tasks (50 English + 50 Chinese)
3 major domains, 13 sub-domains
68% of tasks involve multimodal input
Average of 2.24 user files per task, up to 6
Sandbox corpus contains an average of 465.5 web pages under the 512k configuration

Figure 4. Dataset statistics. (a) Domain distribution. (b) File type distribution. (c) Distribution of user files per task.

📐 Evaluation Metrics

Dimension	Metric	Description
Information Retrieval	IR (Information Recall)	Coverage of key insights from user files and sandbox corpus in the report
Information Retrieval	CC (Citation Coverage)	Extent to which the report cites necessary source documents
Report Generation	FA (Factual Accuracy)	Factual correctness of cited claims in the report
Report Generation	IF (Instruction Following)	Whether the report satisfies all requirements specified in the task query
Report Generation	DQ (Depth Quality)	Analytical depth and logical rigor of the report

📈 Experimental Results

Comprehensive evaluation was conducted on 8 state-of-the-art LLMs. Key findings:

DR³-Eval is extremely challenging — the best model, Claude Sonnet 4, achieves only 65.6 average score under 512k
Longer context → Lower performance — Noise and distracting information make it difficult for models to locate effective evidence
Instruction following ≠ Factual accuracy — Some models generate reports that "look" complete but contain factual errors
Significant performance variation across domains

Figure 5. Evaluation results of different models under 64k/128k/512k sandbox scales.

Figure 6. Performance heatmap of different models across 13 domains.

📦 More Experimental Results

Figure 7. Performance trends under different sandbox corpus scales (32k–512k).

Figure 8. Error type distribution across different models. Hallucination is the primary cause of failure for most models.

Figure 9. Ablation study comparing long-context vs. RAG approaches.

Figure 10. Performance comparison between static sandbox corpus and real-world web search.

Figure 11. Comparison of different retrieval methods.

Figure 12. Consistency analysis between LLM-as-Judge and human evaluation.

🚀 Quick Start

📥 Dataset Access

The dataset is hosted on HuggingFace and can be downloaded directly.

🔧 Environment Setup

# Install dependencies
uv sync

# Configure environment variables
cp .env.example .env
# Edit .env to fill in API keys (OPENROUTER_API_KEY, etc.)

# Verify installation
uv run python main.py --help

▶️ Running DR³-Agent

# Single task
uv run python main.py run \
    --folder data/datasets_en/001 \
    --query "Analyze the documents and generate a research report." \
    --offline

# Batch tasks
uv run python main.py batch \
    --data-dir data/datasets_en \
    --context-size 128k \
    --llm-config gpt-4 \
    --offline

📊 Evaluation

uv run python eval.py all \
    --result-base results_main/datasets_en \
    --datasets-dir data/datasets_en \
    --workers 4

📝 Citation

If you find this work useful, please cite:

@article{dr3eval2026,
  title={DR3-Eval: Towards Realistic and Reproducible Deep Research Evaluation},
  author={},
  journal={arXiv preprint arXiv:xxxx.xxxxx},
  year={2026}
}

🌟 License

This project is licensed under the Apache License 2.0. See LICENSE for details.

🙏 Acknowledgements

The DR³-Agent in this project is built on the MiroFlow framework by MiroMind AI. We extended it with the DR³-Eval evaluation framework, including multi-dimensional report quality metrics, benchmark support, and multi-model comparison capabilities.

📧 Contact

For questions, please reach out via GitHub Issues.

NJU-LINK Team, Nanjing University

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
assets		assets
benchmarks		benchmarks
conf		conf
docs		docs
evaluators		evaluators
run_scripts		run_scripts
src		src
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
README_ZH.md		README_ZH.md
eval.py		eval.py
main.py		main.py
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DR³-Eval: Towards Realistic and Reproducible
Deep Research Evaluation

✨ Overview

📰 News

🏆 Benchmark Comparison

🧩 Framework and Pipeline

🤖 DR³-Agent

📊 Dataset Statistics

📐 Evaluation Metrics

📈 Experimental Results

🚀 Quick Start

📥 Dataset Access

🔧 Environment Setup

▶️ Running DR³-Agent

📊 Evaluation

📝 Citation

🌟 License

🙏 Acknowledgements

📧 Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DR3-Eval: Towards Realistic and ReproducibleDeep Research Evaluation

✨ Overview

📰 News

🏆 Benchmark Comparison

🧩 Framework and Pipeline

🤖 DR³-Agent

📊 Dataset Statistics

📐 Evaluation Metrics

📈 Experimental Results

🚀 Quick Start

📥 Dataset Access

🔧 Environment Setup

▶️ Running DR³-Agent

📊 Evaluation

📝 Citation

🌟 License

🙏 Acknowledgements

📧 Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

DR³-Eval: Towards Realistic and Reproducible
Deep Research Evaluation

Packages