Skip to content

OPPO-PersonalAI/PersonalizedDeepResearchBench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

2 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Towards Personalized Deep Research: Benchmarks and Evaluations

license Hugging Face Paper Hugging Face Dataset

๐Ÿ“– Overview

Personalized Deep Research Bench (PDR-Bench) is the first benchmark to systematically evaluate personalization in Deep Research Agents (DRAs).
It pairs 50 real-world deep-research tasks across 10 domains with 25 authentic user profiles, yielding 250 personalized taskโ€“user pairs.

Evaluation is conducted via the PQR Framework, which jointly measures:

  • ๐ŸŽฏ Personalization Alignment (P) โ€” userโ€“task fit
  • ๐Ÿง  Content Quality (Q) โ€” report depth and reasoning
  • ๐Ÿ” Factual Reliability (R) โ€” factual soundness and citation integrity

๐Ÿงฉ Benchmark Composition

Data Construction Pipeline

Component Description
Tasks 50 complex research tasks across 10 domains (Education, Career, Health, Finance, etc.)
User Profiles 25 authentic, structured personas with dynamic contexts (age, occupation, lifestyle, financial traits, etc.)
Queries 250 paired taskโ€“user scenarios simulating realistic deep research interactions
Languages Chinese and English

๐Ÿงฎ PQR Evaluation Framework

Framework Overview

The PQR Evaluation Framework provides a holistic, user-centered methodology for assessing Personalized Deep Research (PDR) reports along three complementary axes:


๐ŸŽฏ P โ€” Personalization Alignment

Evaluates how well a report aligns with the user persona and task through:

  • Dynamic Weighting: LLM meta-evaluator assigns relative importance to personalization dimensions
  • Granular Criteria Generation: Creates task and persona specific sub-criteria for detailed assessment
  • Dimension Scoring: Computes a weighted score across four key dimensions:
    • ๐ŸŽฏ Goal Alignment โ€” Alignment with the userโ€™s goals
    • ๐Ÿ“š Content Alignment โ€” Relevance and depth suited to the userโ€™s background
    • โœ๏ธ Presentation Fit โ€” Style and structure matching user preferences
    • ๐Ÿงฉ Actionability โ€” Practical value and usefulness of insights

๐Ÿง  Q โ€” Content Quality

Assesses the intrinsic quality of report writing and reasoning, independent of personalization:

  • Dynamic Criteria Generation: Produces task-specific sub-criteria for quality assessment
  • Weighted Scoring across three key dimensions:
    • ๐Ÿ’ก Depth & Insight โ€” Analytical richness and originality
    • ๐Ÿ”— Logical Coherence โ€” Rigor and flow of reasoning
    • ๐Ÿ“– Clarity & Readability โ€” Language fluency and presentation quality

๐Ÿ” R โ€” Factual Reliability

Evaluates factual correctness and evidence grounding through automated verification:

  • Claim Extraction & Deduplication: Identifies unique verifiable statements and their citations
  • Automated Verification: Uses retrieval and LLM judgment to check factual support
  • Citation Metrics:
    • โœ… Factual Accuracy (FA) โ€” Proportion of claims supported by sources
    • ๐Ÿ”— Citation Coverage (CC) โ€” Proportion of total claims properly cited

Together, P, Q, and R form a unified framework balancing personalization, quality, and reliability for robust DRA evaluation.

๐Ÿงช Experiments

Evaluated Systems

We benchmarked a broad range of systems, including:

Category Representative Models
Commercial Deep Research Agents Gemini 2.5-Pro Deep Research, O3 Deep Research, Perplexity Deep Research
Open-Source DRAs OAgents, DeerFlow, MiroFlow
LLMs + Search Tools GPT-4.1-Search, Claude 3.7-Sonnet-Search, Perplexity Sonar Reasoning Pro
Memory Systems Mem0, Memory OS, O-Mem

Main Results

Model Overall GOAL CONT PRES ACTI DEIN LOGC CLAR FA CC
Commercial Deep Research Agents โ€” โ€” โ€” โ€” โ€” โ€” โ€” โ€” โ€” โ€”
Gemini-2.5-Pro Deep Research 6.58 5.27 5.78 5.83 4.56 5.32 6.13 6.16 8.40 9.26
O3 Deep Research 6.11 5.67 5.95 5.57 5.10 5.68 6.40 5.58 6.84 7.14
Perplexity Deep Research 5.99 4.69 4.93 4.72 4.33 4.93 5.43 4.68 7.68 9.02
Open-Source Deep Research Agents โ€” โ€” โ€” โ€” โ€” โ€” โ€” โ€” โ€” โ€”
OAgents 6.64 6.68 6.44 7.13 6.92 6.99 7.44 6.85 3.77 8.32
DeerFlow 5.30 5.20 4.97 6.71 5.41 5.43 6.25 6.44 6.85 2.32
MiroFlow 5.78 6.65 6.45 7.03 6.65 6.53 7.31 6.68 7.29 0.44
LLM with Search Tools โ€” โ€” โ€” โ€” โ€” โ€” โ€” โ€” โ€” โ€”
Gemini-2.5-Pro w/Search 5.53 4.85 5.20 5.61 4.19 4.54 5.57 5.41 6.99 6.62
Claude-3.7-Sonnet w/Search 4.83 4.27 4.24 5.43 4.28 4.26 5.09 5.34 8.27 2.37
Perplexity-Sonar-Reasoning-Pro 5.02 4.27 4.37 5.27 4.15 4.22 5.03 5.23 8.44 3.67
GPT-4.1 w/Search 4.28 4.59 4.86 5.74 4.07 4.21 5.27 5.54 6.75 0.10

Key Findings

  • Open-source Agents: Achieve the strongest personalization, with OAgents scoring 6.64 and leading in GOAL, PRES, and LOGC. MiroFlow performs well in CONT and FA, but overall reliability is a weakness (e.g., OAgents' low FA of 3.77, and poor citation coverage in MiroFlow/DeerFlow).
  • Commercial Agents: Provide more balanced quality and reliability, with Gemini-2.5-Pro Deep Research leading in FA and CC while maintaining strong quality scores. O3 Deep Research excels in personalization within this category but slightly lags in overall performance compared to open-source agents.
  • LLMs with Search Tools: Underperform compared to specialized agents. While some models like Perplexity-Sonar-Reasoning-Pro achieve high FA (8.44), they fall short in CC and personalization. For example, GPT-4.1 w/Search almost fails in CC (0.10), indicating that adding search alone is insufficient to match the personalization and quality of dedicated deep research agents.

๐Ÿ“ Repository Structure

PersonaDeepResearchBench/
โ”œโ”€โ”€ code/
โ”‚   โ”œโ”€โ”€ prompt/                   # Prompt templates
โ”‚   โ”œโ”€โ”€ utils/                    # Reliability scoring and tool functions  
โ”‚   โ”œโ”€โ”€ eval_personalization.py   # P: personalization scoring
โ”‚   โ”œโ”€โ”€ eval_quality.py           # Q: quality scoring
โ”‚   โ””โ”€โ”€ generate_criteria.py      # Dimension weights and criteria generating
โ”œโ”€โ”€ data/
โ”‚   โ”œโ”€โ”€ criteria_data/            # Criteria data for Evaluation
โ”‚   โ”œโ”€โ”€ persona_data/             # 25 Personas
โ”‚   โ”‚   โ”œโ”€โ”€ personas_en.jsonl
โ”‚   โ”‚   โ””โ”€โ”€ personas_zh.jsonl
โ”‚   โ”œโ”€โ”€ prompt_data/              # user-task pairs(queries)
โ”‚   โ”‚   โ”œโ”€โ”€ queries150_en.jsonl   # 150 English queries
โ”‚   โ”‚   โ”œโ”€โ”€ queries150_zh.jsonl   # 150 Chinese queries
โ”‚   โ”‚   โ”œโ”€โ”€ queries250_en.jsonl   # 250 English queries
โ”‚   โ”‚   โ””โ”€โ”€ queries250_zh.jsonl   # 250 Chinese queries
โ”‚   โ”œโ”€โ”€ task_data/             
โ”‚   โ”‚   โ”œโ”€โ”€ tasks_en.jsonl        # 50 English deep research tasks
โ”‚   โ”‚   โ””โ”€โ”€ tasks_zh.jsonl        # 50 Chinese deep research tasks
โ”‚   โ”œโ”€โ”€ test_data/                 
โ”‚   โ”‚   โ”œโ”€โ”€ cleaned_data/         # Cleaned article data
โ”‚   โ”‚   โ””โ”€โ”€ raw_data/             # โ† Put your TARGET_MODEL outputs here (model_name.jsonl)
โ”‚   โ”œโ”€โ”€ results/ 
โ”‚   โ”‚   โ”œโ”€โ”€ output_logs/          # Output logs during evaluation
โ”‚   โ”‚   โ”œโ”€โ”€ overall/              # P,Q,R overall scores and final overall score
โ”‚   โ”‚   โ”œโ”€โ”€ personalization/      # Personalization scoring results
โ”‚   โ”‚   โ”œโ”€โ”€ quality/              # Quality scoring results
โ”‚   โ”‚   โ””โ”€โ”€ reliability/          # Reliability scoring results
โ”‚โ”€โ”€ README.md                     # This file
โ”‚โ”€โ”€ requirements.txt              # Dependencies
โ””โ”€โ”€ run_eval.sh                   # Evaluation pipeline, set TARGET_MODEL then run

๐Ÿš€ Quick Start

Prerequisites

  • Python: 3.9+
  • OPENAI API key
  • Jina API key

Environment Setup

conda create -n pdrbench python=3.10
conda activate pdrbench
pip install -r requirements.txt

API Configuration

# Set up OPENAI API key
export OPENAI_API_KEY=<your_api_key>

# Set up BASE_URL (optional)
export BASE_URL=<base_url>

# Set up Jina API key
export JINA_API_KEY=<your_api_key>

Prepare Your Data

Run your DRA on the benchmark queries and save outputs in the required format:

Input: Use queries from data/prompt_data/queries150_zh.jsonl. Each line contains a JSON object with a "query" field โ€” this is the input text you should feed into your DRA.

Output: Save results to data/test_data/raw_data/<model_name>.jsonl

Target format (each line should contain):

{
    "id": "query id", 
    "language": "task and user profile language",
    "taskid": "task id",
    "userid": "user id",
    "article": "generated deep research article"
}

Run Evaluation

For detailed technical instructions and parameters, see README_EN.md.

bash run_eval.sh

Evaluation results will be saved to:

data/results/
โ”œโ”€โ”€ output_logs/
โ”œโ”€โ”€ overall/
โ”œโ”€โ”€ personalization/
โ”œโ”€โ”€ quality/
โ””โ”€โ”€ reliability/

๐ŸŒป Acknowledgement

Our implementation builds upon the DeepResearch Bench and includes significant modifications and extensions to adapt it for the Personalized Deep Research Bench (PDR-Bench) project. We sincerely thank the original authors for their excellent work.

๐Ÿ“š Citation

If you find this work useful, please โญ the repo and cite our paper.

@misc{liang2025personalizeddeepresearchbenchmarks,
      title={Towards Personalized Deep Research: Benchmarks and Evaluations}, 
      author={Yuan Liang and Jiaxian Li and Yuqing Wang and Piaohong Wang and Motong Tian and Pai Liu and Shuofei Qiao and Runnan Fang and He Zhu and Ge Zhang and Minghao Liu and Yuchen Eleanor Jiang and Ningyu Zhang and Wangchunshu Zhou},
      year={2025},
      eprint={2509.25106},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2509.25106}, 
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published