Personalized Deep Research Bench (PDR-Bench) is the first benchmark to systematically evaluate personalization in Deep Research Agents (DRAs).
It pairs 50 real-world deep-research tasks across 10 domains with 25 authentic user profiles, yielding 250 personalized taskโuser pairs.
Evaluation is conducted via the PQR Framework, which jointly measures:
- ๐ฏ Personalization Alignment (P) โ userโtask fit
- ๐ง Content Quality (Q) โ report depth and reasoning
- ๐ Factual Reliability (R) โ factual soundness and citation integrity
| Component | Description |
|---|---|
| Tasks | 50 complex research tasks across 10 domains (Education, Career, Health, Finance, etc.) |
| User Profiles | 25 authentic, structured personas with dynamic contexts (age, occupation, lifestyle, financial traits, etc.) |
| Queries | 250 paired taskโuser scenarios simulating realistic deep research interactions |
| Languages | Chinese and English |
The PQR Evaluation Framework provides a holistic, user-centered methodology for assessing Personalized Deep Research (PDR) reports along three complementary axes:
Evaluates how well a report aligns with the user persona and task through:
- Dynamic Weighting: LLM meta-evaluator assigns relative importance to personalization dimensions
- Granular Criteria Generation: Creates task and persona specific sub-criteria for detailed assessment
- Dimension Scoring: Computes a weighted score across four key dimensions:
- ๐ฏ Goal Alignment โ Alignment with the userโs goals
- ๐ Content Alignment โ Relevance and depth suited to the userโs background
- โ๏ธ Presentation Fit โ Style and structure matching user preferences
- ๐งฉ Actionability โ Practical value and usefulness of insights
Assesses the intrinsic quality of report writing and reasoning, independent of personalization:
- Dynamic Criteria Generation: Produces task-specific sub-criteria for quality assessment
- Weighted Scoring across three key dimensions:
- ๐ก Depth & Insight โ Analytical richness and originality
- ๐ Logical Coherence โ Rigor and flow of reasoning
- ๐ Clarity & Readability โ Language fluency and presentation quality
Evaluates factual correctness and evidence grounding through automated verification:
- Claim Extraction & Deduplication: Identifies unique verifiable statements and their citations
- Automated Verification: Uses retrieval and LLM judgment to check factual support
- Citation Metrics:
- โ Factual Accuracy (FA) โ Proportion of claims supported by sources
- ๐ Citation Coverage (CC) โ Proportion of total claims properly cited
Together, P, Q, and R form a unified framework balancing personalization, quality, and reliability for robust DRA evaluation.
We benchmarked a broad range of systems, including:
| Category | Representative Models |
|---|---|
| Commercial Deep Research Agents | Gemini 2.5-Pro Deep Research, O3 Deep Research, Perplexity Deep Research |
| Open-Source DRAs | OAgents, DeerFlow, MiroFlow |
| LLMs + Search Tools | GPT-4.1-Search, Claude 3.7-Sonnet-Search, Perplexity Sonar Reasoning Pro |
| Memory Systems | Mem0, Memory OS, O-Mem |
| Model | Overall | GOAL | CONT | PRES | ACTI | DEIN | LOGC | CLAR | FA | CC |
|---|---|---|---|---|---|---|---|---|---|---|
| Commercial Deep Research Agents | โ | โ | โ | โ | โ | โ | โ | โ | โ | โ |
| Gemini-2.5-Pro Deep Research | 6.58 | 5.27 | 5.78 | 5.83 | 4.56 | 5.32 | 6.13 | 6.16 | 8.40 | 9.26 |
| O3 Deep Research | 6.11 | 5.67 | 5.95 | 5.57 | 5.10 | 5.68 | 6.40 | 5.58 | 6.84 | 7.14 |
| Perplexity Deep Research | 5.99 | 4.69 | 4.93 | 4.72 | 4.33 | 4.93 | 5.43 | 4.68 | 7.68 | 9.02 |
| Open-Source Deep Research Agents | โ | โ | โ | โ | โ | โ | โ | โ | โ | โ |
| OAgents | 6.64 | 6.68 | 6.44 | 7.13 | 6.92 | 6.99 | 7.44 | 6.85 | 3.77 | 8.32 |
| DeerFlow | 5.30 | 5.20 | 4.97 | 6.71 | 5.41 | 5.43 | 6.25 | 6.44 | 6.85 | 2.32 |
| MiroFlow | 5.78 | 6.65 | 6.45 | 7.03 | 6.65 | 6.53 | 7.31 | 6.68 | 7.29 | 0.44 |
| LLM with Search Tools | โ | โ | โ | โ | โ | โ | โ | โ | โ | โ |
| Gemini-2.5-Pro w/Search | 5.53 | 4.85 | 5.20 | 5.61 | 4.19 | 4.54 | 5.57 | 5.41 | 6.99 | 6.62 |
| Claude-3.7-Sonnet w/Search | 4.83 | 4.27 | 4.24 | 5.43 | 4.28 | 4.26 | 5.09 | 5.34 | 8.27 | 2.37 |
| Perplexity-Sonar-Reasoning-Pro | 5.02 | 4.27 | 4.37 | 5.27 | 4.15 | 4.22 | 5.03 | 5.23 | 8.44 | 3.67 |
| GPT-4.1 w/Search | 4.28 | 4.59 | 4.86 | 5.74 | 4.07 | 4.21 | 5.27 | 5.54 | 6.75 | 0.10 |
Key Findings
- Open-source Agents: Achieve the strongest personalization, with OAgents scoring 6.64 and leading in GOAL, PRES, and LOGC. MiroFlow performs well in CONT and FA, but overall reliability is a weakness (e.g., OAgents' low FA of 3.77, and poor citation coverage in MiroFlow/DeerFlow).
- Commercial Agents: Provide more balanced quality and reliability, with Gemini-2.5-Pro Deep Research leading in FA and CC while maintaining strong quality scores. O3 Deep Research excels in personalization within this category but slightly lags in overall performance compared to open-source agents.
- LLMs with Search Tools: Underperform compared to specialized agents. While some models like Perplexity-Sonar-Reasoning-Pro achieve high FA (8.44), they fall short in CC and personalization. For example, GPT-4.1 w/Search almost fails in CC (0.10), indicating that adding search alone is insufficient to match the personalization and quality of dedicated deep research agents.
PersonaDeepResearchBench/
โโโ code/
โ โโโ prompt/ # Prompt templates
โ โโโ utils/ # Reliability scoring and tool functions
โ โโโ eval_personalization.py # P: personalization scoring
โ โโโ eval_quality.py # Q: quality scoring
โ โโโ generate_criteria.py # Dimension weights and criteria generating
โโโ data/
โ โโโ criteria_data/ # Criteria data for Evaluation
โ โโโ persona_data/ # 25 Personas
โ โ โโโ personas_en.jsonl
โ โ โโโ personas_zh.jsonl
โ โโโ prompt_data/ # user-task pairs(queries)
โ โ โโโ queries150_en.jsonl # 150 English queries
โ โ โโโ queries150_zh.jsonl # 150 Chinese queries
โ โ โโโ queries250_en.jsonl # 250 English queries
โ โ โโโ queries250_zh.jsonl # 250 Chinese queries
โ โโโ task_data/
โ โ โโโ tasks_en.jsonl # 50 English deep research tasks
โ โ โโโ tasks_zh.jsonl # 50 Chinese deep research tasks
โ โโโ test_data/
โ โ โโโ cleaned_data/ # Cleaned article data
โ โ โโโ raw_data/ # โ Put your TARGET_MODEL outputs here (model_name.jsonl)
โ โโโ results/
โ โ โโโ output_logs/ # Output logs during evaluation
โ โ โโโ overall/ # P,Q,R overall scores and final overall score
โ โ โโโ personalization/ # Personalization scoring results
โ โ โโโ quality/ # Quality scoring results
โ โ โโโ reliability/ # Reliability scoring results
โโโ README.md # This file
โโโ requirements.txt # Dependencies
โโโ run_eval.sh # Evaluation pipeline, set TARGET_MODEL then run
- Python: 3.9+
- OPENAI API key
- Jina API key
conda create -n pdrbench python=3.10
conda activate pdrbench
pip install -r requirements.txt# Set up OPENAI API key
export OPENAI_API_KEY=<your_api_key>
# Set up BASE_URL (optional)
export BASE_URL=<base_url>
# Set up Jina API key
export JINA_API_KEY=<your_api_key>Run your DRA on the benchmark queries and save outputs in the required format:
Input: Use queries from data/prompt_data/queries150_zh.jsonl.
Each line contains a JSON object with a "query" field โ this is the input text you should feed into your DRA.
Output: Save results to data/test_data/raw_data/<model_name>.jsonl
Target format (each line should contain):
{
"id": "query id",
"language": "task and user profile language",
"taskid": "task id",
"userid": "user id",
"article": "generated deep research article"
}For detailed technical instructions and parameters, see README_EN.md.
bash run_eval.shEvaluation results will be saved to:
data/results/
โโโ output_logs/
โโโ overall/
โโโ personalization/
โโโ quality/
โโโ reliability/Our implementation builds upon the DeepResearch Bench and includes significant modifications and extensions to adapt it for the Personalized Deep Research Bench (PDR-Bench) project. We sincerely thank the original authors for their excellent work.
If you find this work useful, please โญ the repo and cite our paper.
@misc{liang2025personalizeddeepresearchbenchmarks,
title={Towards Personalized Deep Research: Benchmarks and Evaluations},
author={Yuan Liang and Jiaxian Li and Yuqing Wang and Piaohong Wang and Motong Tian and Pai Liu and Shuofei Qiao and Runnan Fang and He Zhu and Ge Zhang and Minghao Liu and Yuchen Eleanor Jiang and Ningyu Zhang and Wangchunshu Zhou},
year={2025},
eprint={2509.25106},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2509.25106},
}
