TVIR (Text-Visual Interleaved Report) is a deep research agent system powered by large language models, designed to automatically generate high-quality research reports with interleaved text and visual content. The system features:
- π Intelligent Information Retrieval: Multi-round deep information retrieval via Google Search API
- π Data Visualization: Automatic generation of charts and data visualizations
- πΌοΈ Multimodal Fusion: Integration of text, images, charts, and other media formats
- π Structured Reports: Generation of professional research reports with clear logic and proper citations
- π― Task Alignment: Precise understanding of user requirements to generate scenario-specific reports
The benchmark dataset is located at benchmark/data/query.json, containing 100 carefully curated deep research tasks (50 in Chinese, 50 in English) across 10 domains. Each task includes:
- Query ID: Unique identifier (e.g.,
000001,000002) - Task Description: User-specific research requirements with explicit multimodal integration needs (text, images, charts)
- Evaluation Checklist: Structured verification criteria aligned with task requirements
Tasks are designed around real-world research scenarios, requiring multi-source information integration, critical analysis, and text-visual interleaved report generation.
| Metric | Description |
|---|---|
| Citation Support | |
| Analytical Depth & Breadth | |
| Factual & Logical Consistency | |
| Instruction Alignment | |
| Writing Quality | |
| Chart-Source Consistency | |
| Figure Caption Quality | |
| Figure Context Integration | |
| Figure Quality | |
| Multimodal Composition |
TVIR/
βββ agent/ # Agent core module
β βββ main.py # Main entry point
β βββ run_agent.sh # Run script
β βββ conf/ # Configuration files
β β βββ config.yaml # Main configuration
β β βββ agent/ # Agent configurations
β β βββ llm/ # LLM model configurations
β βββ src/ # Source code
β βββ core/ # Core logic (orchestrator, pipeline)
β βββ io/ # Input/output handling
β βββ llm/ # LLM client wrappers
β βββ logging/ # Logging system
β βββ utils/ # Utility functions
βββ benchmark/ # Evaluation benchmark
β βββ eval.py # Evaluation script
β βββ preprocess.py # Preprocessing script
β βββ data/ # Benchmark dataset
β βββ reports/ # Reports to be evaluated
β βββ scripts/ # Evaluation scripts
β βββ preprocess/ # Preprocessing modules
β βββ evaluation/ # Evaluation metric modules
βββ libs/ # Third-party libraries
β βββ miroflow-tools/ # MCP tool integration
βββ logs/ # Log output directory
- Python 3.12+
- macOS / Linux / Windows
- uv package manager
# macOS/Linux
curl -LsSf https://astral.sh/uv/install.sh | sh
# Windows
powershell -c "irm https://astral.sh/uv/install.ps1 | iex"git clone https://github.com/NJU-LINK/TVIR.git
cd TVIRCopy the environment variable template and fill in the required API keys:
cp .env.example .envEdit the .env file and configure the following required API keys:
# ============================================
# Required Configuration for Agent
# ============================================
# API for Google Search
SERPER_API_KEY=your_serper_key
SERPER_BASE_URL=https://google.serper.dev
# API for Linux Sandbox
E2B_API_KEY=your_e2b_key
# Model for VQA
VQA_MODEL_NAME=gpt-41-0414-global
# API for OpenAI
OPENAI_API_KEY=your_openai_key
OPENAI_BASE_URL=https://api.openai.com/v1
# ============================================
# Required Configuration for Benchmark
# ============================================
# API for Google Search
SERPER_API_KEY=your_serper_key
SERPER_BASE_URL=https://google.serper.dev
# API for OpenAI
OPENAI_API_KEY=your_openai_key
OPENAI_BASE_URL=https://api.openai.com/v1
# Evaluation Settings
EVAL_MODEL_NAME=gpt-5.2-1211-global
MAX_RETRIES=3
TEMPERATURE=0
STREAMING=trueuv syncLoad predefined query tasks from benchmark/data/query.json:
cd agent
bash run_agent.sh claude-4-5 000001Parameters:
claude-4-5: LLM model configuration to use (options:claude-4-5,qwen-3,glm-4-7, etc.)000001: Query ID corresponding to a task inbenchmark/data/query.json
Write your task description directly in agent/main.py:
# Modify task description in main.py
task_description = """
Write a technical analysis report on memory mechanisms in large language model agents, including architecture diagrams, timelines, and other visualizations...
"""Then run:
cd agent
bash run_agent.sh claude-4-5Organize reports to be evaluated in the following directory structure:
benchmark/reports/{eval_system_name}/{query_id}/
βββ report.md # Main report file
βββ images/ # Images retrieved from web
β βββ image1.jpg
β βββ image2.png
βββ charts/ # Charts generated by tools
βββ chart1.png
βββ chart2.svg
Notes:
- Images in
report.mdcan use local paths (e.g.,./images/pic.jpg) or HTTP links eval_system_nameis the evaluation system name (e.g.,claude,gpt4,qwen)query_idmust correspond to an ID inbenchmark/data/query.json
Preprocess reports to extract citations, charts, and other information:
cd benchmark
# Preprocess a single report
uv run python preprocess.py \
--report_root_dir reports \
--eval_system_name claude \
--query_id 000001
# Batch preprocessing
uv run python preprocess.py \
--report_root_dir reports \
--eval_system_name claude gpt4 \
--query_id 000001 000002 000003Parameters:
--report_root_dir: Report root directory (default:reports)--eval_system_name: Evaluation system name(s), supports multiple--query_id: Query ID(s), supports multiple
Execute automated evaluation to generate results:
cd benchmark
# Evaluate a single report
uv run python eval.py \
--report_root_dir reports \
--result_root_dir eval_results \
--eval_system_name claude \
--query_id 000001
# Batch evaluation
uv run python eval.py \
--report_root_dir reports \
--result_root_dir eval_results \
--eval_system_name claude gpt4 \
--query_id 000001 000002 000003Parameters:
--report_root_dir: Report root directory--result_root_dir: Evaluation result output directory--eval_system_name: Evaluation system name(s), supports multiple--query_id: Query ID(s), supports multiple
Run the summary script to aggregate evaluation results across all models and dimensions:
uv run python score_generate_result.pyThis will generate model_dimension_summary.xlsx containing average scores for each model across text dimensions, visual dimensions, and overall performance.
This project is licensed under the MIT License.
If this project helps you, please give us a Star βοΈ