A Discipline-Aware Benchmark for Evaluating Automatic Survey Generation
The exponential growth of scientific literature has driven the evolution of Automatic Survey Generation (ASG) from simple pipelines to multi-agent frameworks and commercial Deep Research agents. However, current ASG evaluation methods rely on generic metrics and are heavily biased toward Computer Science (CS), failing to assess whether ASG methods adhere to the distinct standards of various academic disciplines. Consequently, researchers, especially those outside CS, lack clear guidance on using ASG systems to yield high-quality surveys compliant with specific discipline standards.
SurveyLens is the first discipline-aware benchmark for evaluating ASG methods across diverse research disciplines. We construct SurveyLens-1k, a curated dataset of 1,000 high-quality human-written surveys spanning 10 disciplines, and propose a dual-lens evaluation framework:
- Discipline-Aware Rubric Evaluation — Utilizes LLMs with human preference-aligned weights to assess adherence to domain-specific writing standards.
- Canonical Alignment Evaluation — Rigorously measures content coverage and synthesis quality against human-written survey papers via embedding similarity.
We conduct extensive experiments evaluating 11 state-of-the-art ASG methods on SurveyLens, including Vanilla LLMs, ASG systems, and Deep Research agents. Our analysis reveals the distinct strengths and weaknesses of each paradigm across fields, providing essential guidance for selecting tools tailored to specific disciplinary requirements.
Figure 1: The SurveyLens dual-lens evaluation framework. Our approach combines (1) Discipline-Aware Rubric Evaluation with human preference-aligned weighting, and (2) Canonical Alignment Evaluation via embedding-based similarity metrics.
SurveyLens-1k contains 1,000 curated, high-quality human-written survey papers spanning 10 academic disciplines (100 surveys each):
| Biology | Business | Computer Science | Education | Engineering | Env. Science | Medicine | Physics | Psychology | Sociology |
|---|---|---|---|---|---|---|---|---|---|
| 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 |
Download: The full dataset is available on Google Drive.
After downloading, extract the data and place the survey files into the results/original/ directory.
SurveyLens/
├── scripts/
│ ├── config/ # Pipeline configuration files
│ │ ├── data_processing_config.json
│ │ ├── eval_qualitative_config.json
│ │ ├── eval_quantitative_config.json
│ │ ├── eval_preference_config.json
│ │ ├── bt_config.json
│ │ ├── apply_bt_weights_config.json
│ │ └── ablation_config.json
│ ├── evaluation/ # Evaluation scripts
│ │ ├── eval_qualitative.py
│ │ ├── eval_quantitative.py
│ │ ├── eval_preference.py
│ │ ├── eval_ablation.py
│ │ ├── bt.py
│ │ ├── apply_bt_weights.py
│ │ ├── analyze_results.py
│ │ └── merge_results.py
│ ├── guideline_generation/ # Discipline-specific criteria generation
│ │ ├── aggregate_aspects.py
│ │ ├── expand_aspects.py
│ │ └── merge_aspects.py
│ ├── utils/ # Utility modules
│ │ └── markdown_to_json.py
│ └── data_processing_pipeline.py # Data preprocessing pipeline
├── outputs/
│ ├── criteria/ # Discipline-specific evaluation criteria
│ └── surveys/ # Sample human-written survey papers
├── results/
│ ├── original/ # Raw survey outputs (organized by system)
│ ├── processed/ # Processed JSON surveys
│ └── evaluation/ # Evaluation results
├── LICENSE
└── README.md
- Python 3.8+
- An OpenAI-compatible LLM API endpoint (for qualitative/preference evaluation)
- An embedding API endpoint (for quantitative evaluation)
pip install openai python-dotenv chromadb numpy scipy tqdmCreate a .env file in the project root with your API credentials:
API_KEY=your_llm_api_key
BASE_URL=https://your-llm-api-endpoint/v1
# For quantitative evaluation (embedding-based)
EMBEDDING_API_KEY=your_embedding_api_key
EMBEDDING_API_BASE=https://your-embedding-api-endpoint/v1The evaluation pipeline consists of 7 steps, each controlled by a configuration file in scripts/config/. All scripts are run from the project root directory.
Config: scripts/config/data_processing_config.json
Converts raw markdown survey files into structured JSON format. Normalizes outlines, content sections, and references. Optionally performs LLM-based calibration and quality checks.
python scripts/data_processing_pipeline.py --config scripts/config/data_processing_config.jsonKey configuration options:
| Parameter | Description |
|---|---|
input_dir |
Directory containing raw survey files (default: results/original) |
output_dir |
Output directory for processed JSON (default: results/processed) |
normalize_outline |
Normalize heading hierarchy |
normalize_content |
Clean and normalize section content |
normalize_references |
Standardize reference entries |
llm_quality_check |
Enable LLM-based reference quality checking |
llm_model |
LLM model for calibration tasks |
Config: scripts/config/eval_qualitative_config.json
Evaluates survey quality using discipline-specific criteria and LLM scoring. Loads domain-specific rubrics from outputs/criteria/ and scores each survey on outline, content, and reference quality at per-aspect and per-criterion granularity.
python scripts/evaluation/eval_qualitative.py --config scripts/config/eval_qualitative_config.jsonKey configuration options:
| Parameter | Description |
|---|---|
processed_dir |
Input directory of processed surveys |
criteria_base_dir |
Directory containing discipline-specific criteria |
criteria_filename |
Criteria JSON filename (default: merged_aspects.json) |
per_aspect_scoring |
Enable per-aspect scoring |
per_criterion_scoring |
Enable per-criterion scoring |
llm_model |
LLM model for evaluation |
max_total_tokens_in_prompt |
Maximum tokens per evaluation prompt |
Config: scripts/config/eval_quantitative_config.json
Measures how closely system-generated surveys align with human-written surveys using embedding similarity. Embeds outline entries, content sections, and references into a ChromaDB vector store, then computes alignment metrics.
python scripts/evaluation/eval_quantitative.py --config scripts/config/eval_quantitative_config.jsonKey configuration options:
| Parameter | Description |
|---|---|
embedding_model |
Embedding model name |
embedding_api_base |
Embedding API endpoint |
chroma_db_dir |
ChromaDB persistence directory |
use_ams |
Enable Average Maximum Similarity |
use_bms |
Enable Bidirectional Matching Score |
use_hungarian_matching |
Enable Hungarian matching for optimal alignment |
outline_threshold / content_threshold / reference_threshold |
Similarity thresholds |
Config: scripts/config/eval_preference_config.json
Performs pairwise LLM comparison of surveys within a single system (e.g., Human) per discipline. For each discipline, all surveys are compared pairwise and ranked using ELO scoring. Supports double round-robin (evaluating both A vs B and B vs A orderings).
python scripts/evaluation/eval_preference.py --config scripts/config/eval_preference_config.jsonKey configuration options:
| Parameter | Description |
|---|---|
input_dir |
Directory containing surveys to compare (e.g., results/processed/Human) |
double_round_robin |
Compare in both orders to reduce position bias |
initial_elo |
Starting ELO rating (default: 1500) |
k_factor |
ELO K-factor (default: 32) |
eval_outline / eval_content / eval_reference |
Which components to evaluate |
llm_model |
LLM model for pairwise comparison |
Config: scripts/config/bt_config.json
Fits aspect weights from pairwise preference data using the Bradley-Terry model. Determines which scoring aspects (from Step 2) are most predictive of human/LLM preferences (from Step 4). Fits separate models per component (outline, content, reference) and optionally per domain.
python scripts/evaluation/bt.py --config scripts/config/bt_config.jsonKey configuration options:
| Parameter | Description |
|---|---|
preference_eval_file |
Path to preference evaluation results (from Step 4) |
evaluation_summary_file |
Path to qualitative evaluation summary (from Step 2) |
system |
System whose surveys were compared (default: Human) |
components |
Components to fit (default: [outline, content, reference]) |
feature_level |
Granularity: aspect, criterion, or both |
fitting_mode |
Fitting scope: domain, global, or both |
regularization_alpha |
L2 regularization strength |
augment_with_criteria |
Enable criterion-expansion data augmentation |
Config: scripts/config/apply_bt_weights_config.json
Applies the learned BT weights (from Step 5) to rescore evaluation summaries for any system. Produces a comparison of original (equal-weight) scores vs. BT-weighted scores, along with ranking changes.
python scripts/evaluation/apply_bt_weights.py --config scripts/config/apply_bt_weights_config.jsonKey configuration options:
| Parameter | Description |
|---|---|
bt_weights_file |
Path to BT weights JSON (from Step 5) |
evaluation_summary_files |
List of evaluation summary files to rescore |
weight_level |
Which weight level to apply: aspect, criterion, or both |
compute_aspect_from_criterion |
Recompute aspect scores from criterion scores |
Config: scripts/config/ablation_config.json
Runs an ablation study using generic (non-discipline-aware) rubrics. Instead of the discipline-specific criteria used in Step 2, this uses a fixed set of rubrics: Coverage, Structure, Relevance, Language, Criticalness, Outline, and Reference.
python scripts/evaluation/eval_ablation.py --config scripts/config/ablation_config.jsonConfiguration options are the same as Step 2, but the built-in fixed rubrics are used instead of discipline-specific criteria.
After running evaluations, use the following scripts to aggregate and analyze results:
Aggregate results into CSVs:
python scripts/evaluation/analyze_results.py <evaluation_summary.json> --output-dir results/analysis/Merge multiple analysis runs (mean/std):
python scripts/evaluation/merge_results.py results/analysis/analysis_1/ results/analysis/analysis_2/ --output-dir results/analysis/merged/SurveyLens evaluates the following ASG methods:
| System | Type |
|---|---|
| Human | Human-written surveys (ground truth) |
| Qwen | Vanilla LLM & Deep Research Agent |
| Gemini | Vanilla LLM & Deep Research Agent |
| AutoSurvey | ASG System |
| AutoSurvey2 | ASG System |
| InteractiveSurvey | ASG System |
| LLMxMapReduce_V2 | ASG System |
| SciSage | ASG System |
| SurveyForge | ASG System |
| SurveyX | ASG System |
Figure 2: Radar chart comparing ASG methods across key dimensions. The visualization reveals distinct strengths and weaknesses of each paradigm—Vanilla LLMs, specialized ASG systems, and commercial Deep Research agents—across different evaluation criteria.
This project is licensed under the MIT License. See the LICENSE file for details.