Skip to content

NJU-LINK/DR3-Eval

Repository files navigation

DR3-Eval: Towards Realistic and Reproducible
Deep Research Evaluation

Arxiv Paper HuggingFace Dataset Project Homepage License

中文 | English


✨ Overview

DR³-Eval is a realistic, reproducible, and multimodal evaluation benchmark for Deep Research Agents, focusing on multi-file report generation tasks.

Existing benchmarks face a fundamental tension between realism, controllability, and reproducibility when evaluating deep research agents. DR³-Eval addresses this through the following design:

  • 🔬 Real User Scenarios: Tasks are constructed from real user-provided multimodal files, covering 3 major domains and 13 sub-domains
  • 📦 Static Sandbox Corpora: An independent static research sandbox is built for each task, containing supportive, distracting, and noisy documents
  • 🎯 Reverse Construction Method: Queries are reverse-engineered from verified evidence documents, eliminating evaluation ambiguity
  • 📊 Multi-dimensional Evaluation: Five dimensions — Information Recall, Factual Accuracy, Citation Coverage, Instruction Following, and Depth Quality

Comparison of DR³-Eval with other benchmarks
Figure 1. Comparison of DR³-Eval with existing deep research benchmarks. DR³-Eval supports both user files and sandbox corpora, providing a realistic and reproducible multimodal evaluation environment.


📰 News

  • 📦 HuggingFace Dataset: The DR³-Eval dataset is now available on HuggingFace! Download it directly from HuggingFace.

🏆 Benchmark Comparison

DR³-Eval is the first deep research evaluation benchmark that simultaneously satisfies all of the following: user file input, static sandbox corpora, multimodality, real-world scenarios, multi-file upload, and reverse construction.

Benchmark Comparison
Figure 2. Comprehensive comparison of DR³-Eval with representative benchmarks.


🧩 Framework and Pipeline

The overall framework of DR³-Eval consists of three core components:

  1. 📝 Data Construction: Synthesizes search paths from real multimodal files through a diverge-converge mechanism, establishes static sandboxes with controllable signal-to-noise ratios, and generates queries via reverse engineering
  2. 🤖 DR³-Agent: Hierarchical multi-agent architecture (see next section)
  3. 📊 Evaluation Protocol: A multi-dimensional metric suite that comprehensively evaluates evidence retrieval and report generation performance

Framework Overview
Figure 3. DR³-Eval framework overview. Includes data construction, DR³-Agent multi-agent system, and multi-dimensional evaluation protocol.


🤖 DR³-Agent

To validate the effectiveness of DR³-Eval, we developed DR³-Agent — an LLM-driven multi-agent deep research system built on the MiroFlow framework. Its core architecture is as follows:

  • Main Agent: The reasoning hub of the system, integrating perception tools for video, audio, etc. It maintains global task context, runs a dynamic "Plan-Execute-Observe" loop, and coordinates sub-agents to complete information retrieval tasks
  • RAG Search Sub-agent: Interacts with the static sandbox corpus using an iterative dense retrieval mechanism based on text-embedding-3-small, refining queries under the ReAct paradigm to obtain evidence
  • File Reading Sub-agent: Specializes in parsing long-form user files, supporting keyword queries and page-number-based content retrieval

Sub-agents do not share global state; they only return highly condensed summaries to the main agent to reduce its context burden.


📊 Dataset Statistics

  • 100 independent tasks (50 English + 50 Chinese)
  • 3 major domains, 13 sub-domains
  • 68% of tasks involve multimodal input
  • Average of 2.24 user files per task, up to 6
  • Sandbox corpus contains an average of 465.5 web pages under the 512k configuration

Dataset Statistics
Figure 4. Dataset statistics. (a) Domain distribution. (b) File type distribution. (c) Distribution of user files per task.


📐 Evaluation Metrics

Dimension Metric Description
Information Retrieval IR (Information Recall) Coverage of key insights from user files and sandbox corpus in the report
Information Retrieval CC (Citation Coverage) Extent to which the report cites necessary source documents
Report Generation FA (Factual Accuracy) Factual correctness of cited claims in the report
Report Generation IF (Instruction Following) Whether the report satisfies all requirements specified in the task query
Report Generation DQ (Depth Quality) Analytical depth and logical rigor of the report

📈 Experimental Results

Comprehensive evaluation was conducted on 8 state-of-the-art LLMs. Key findings:

  1. DR³-Eval is extremely challenging — the best model, Claude Sonnet 4, achieves only 65.6 average score under 512k
  2. Longer context → Lower performance — Noise and distracting information make it difficult for models to locate effective evidence
  3. Instruction following ≠ Factual accuracy — Some models generate reports that "look" complete but contain factual errors
  4. Significant performance variation across domains

Main Experimental Results
Figure 5. Evaluation results of different models under 64k/128k/512k sandbox scales.

Cross-domain Performance Heatmap
Figure 6. Performance heatmap of different models across 13 domains.

📦 More Experimental Results

Scale Analysis
Figure 7. Performance trends under different sandbox corpus scales (32k–512k).

Error Type Analysis
Figure 8. Error type distribution across different models. Hallucination is the primary cause of failure for most models.

Ablation Study
Figure 9. Ablation study comparing long-context vs. RAG approaches.

Sandbox vs. Online Corpus
Figure 10. Performance comparison between static sandbox corpus and real-world web search.

Retriever Analysis
Figure 11. Comparison of different retrieval methods.

Human Evaluation Consistency
Figure 12. Consistency analysis between LLM-as-Judge and human evaluation.


🚀 Quick Start

📥 Dataset Access

The dataset is hosted on HuggingFace and can be downloaded directly.

🔧 Environment Setup

# Install dependencies
uv sync

# Configure environment variables
cp .env.example .env
# Edit .env to fill in API keys (OPENROUTER_API_KEY, etc.)

# Verify installation
uv run python main.py --help

▶️ Running DR³-Agent

# Single task
uv run python main.py run \
    --folder data/datasets_en/001 \
    --query "Analyze the documents and generate a research report." \
    --offline

# Batch tasks
uv run python main.py batch \
    --data-dir data/datasets_en \
    --context-size 128k \
    --llm-config gpt-4 \
    --offline

📊 Evaluation

uv run python eval.py all \
    --result-base results_main/datasets_en \
    --datasets-dir data/datasets_en \
    --workers 4

📝 Citation

If you find this work useful, please cite:

@article{dr3eval2026,
  title={DR3-Eval: Towards Realistic and Reproducible Deep Research Evaluation},
  author={},
  journal={arXiv preprint arXiv:xxxx.xxxxx},
  year={2026}
}

🌟 License

This project is licensed under the Apache License 2.0. See LICENSE for details.

🙏 Acknowledgements

The DR³-Agent in this project is built on the MiroFlow framework by MiroMind AI. We extended it with the DR³-Eval evaluation framework, including multi-dimensional report quality metrics, benchmark support, and multi-model comparison capabilities.

📧 Contact

For questions, please reach out via GitHub Issues.


NJU-LINK Team, Nanjing University

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors