Skip to content

NJU-LINK/TVIR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

6 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

TVIR: Building Deep Research Agents Towards Text-Visual Interleaved Report Generation

License: MIT Python 3.12+


πŸ“– Overview

TVIR (Text-Visual Interleaved Report) is a deep research agent system powered by large language models, designed to automatically generate high-quality research reports with interleaved text and visual content. The system features:

  • πŸ” Intelligent Information Retrieval: Multi-round deep information retrieval via Google Search API
  • πŸ“Š Data Visualization: Automatic generation of charts and data visualizations
  • πŸ–ΌοΈ Multimodal Fusion: Integration of text, images, charts, and other media formats
  • πŸ“ Structured Reports: Generation of professional research reports with clear logic and proper citations
  • 🎯 Task Alignment: Precise understanding of user requirements to generate scenario-specific reports

πŸ“¦ Dataset

The benchmark dataset is located at benchmark/data/query.json, containing 100 carefully curated deep research tasks (50 in Chinese, 50 in English) across 10 domains. Each task includes:

  • Query ID: Unique identifier (e.g., 000001, 000002)
  • Task Description: User-specific research requirements with explicit multimodal integration needs (text, images, charts)
  • Evaluation Checklist: Structured verification criteria aligned with task requirements

Tasks are designed around real-world research scenarios, requiring multi-source information integration, critical analysis, and text-visual interleaved report generation.

πŸ“Š Evaluation Metrics

Metric Description
Citation Support
Analytical Depth & Breadth
Factual & Logical Consistency
Instruction Alignment
Writing Quality
Chart-Source Consistency
Figure Caption Quality
Figure Context Integration
Figure Quality
Multimodal Composition

πŸ—οΈ Project Architecture

TVIR/
β”œβ”€β”€ agent/                     # Agent core module
β”‚   β”œβ”€β”€ main.py                # Main entry point
β”‚   β”œβ”€β”€ run_agent.sh           # Run script
β”‚   β”œβ”€β”€ conf/                  # Configuration files
β”‚   β”‚   β”œβ”€β”€ config.yaml        # Main configuration
β”‚   β”‚   β”œβ”€β”€ agent/             # Agent configurations
β”‚   β”‚   └── llm/               # LLM model configurations
β”‚   └── src/                   # Source code
β”‚       β”œβ”€β”€ core/              # Core logic (orchestrator, pipeline)
β”‚       β”œβ”€β”€ io/                # Input/output handling
β”‚       β”œβ”€β”€ llm/               # LLM client wrappers
β”‚       β”œβ”€β”€ logging/           # Logging system
β”‚       └── utils/             # Utility functions
β”œβ”€β”€ benchmark/                 # Evaluation benchmark
β”‚   β”œβ”€β”€ eval.py                # Evaluation script
β”‚   β”œβ”€β”€ preprocess.py          # Preprocessing script
β”‚   β”œβ”€β”€ data/                  # Benchmark dataset
β”‚   β”œβ”€β”€ reports/               # Reports to be evaluated
β”‚   └── scripts/               # Evaluation scripts
β”‚       β”œβ”€β”€ preprocess/        # Preprocessing modules
β”‚       └── evaluation/        # Evaluation metric modules
β”œβ”€β”€ libs/                      # Third-party libraries
β”‚   └── miroflow-tools/        # MCP tool integration
└── logs/                      # Log output directory

πŸš€ Quick Start

Requirements

  • Python 3.12+
  • macOS / Linux / Windows
  • uv package manager

Installation

1. Install uv Package Manager

# macOS/Linux
curl -LsSf https://astral.sh/uv/install.sh | sh

# Windows
powershell -c "irm https://astral.sh/uv/install.ps1 | iex"

2. Clone the Repository

git clone https://github.com/NJU-LINK/TVIR.git
cd TVIR

3. Configure Environment Variables

Copy the environment variable template and fill in the required API keys:

cp .env.example .env

Edit the .env file and configure the following required API keys:

# ============================================
# Required Configuration for Agent
# ============================================

# API for Google Search
SERPER_API_KEY=your_serper_key
SERPER_BASE_URL=https://google.serper.dev

# API for Linux Sandbox
E2B_API_KEY=your_e2b_key

# Model for VQA
VQA_MODEL_NAME=gpt-41-0414-global

# API for OpenAI
OPENAI_API_KEY=your_openai_key
OPENAI_BASE_URL=https://api.openai.com/v1

# ============================================
# Required Configuration for Benchmark
# ============================================

# API for Google Search
SERPER_API_KEY=your_serper_key
SERPER_BASE_URL=https://google.serper.dev

# API for OpenAI
OPENAI_API_KEY=your_openai_key
OPENAI_BASE_URL=https://api.openai.com/v1

# Evaluation Settings
EVAL_MODEL_NAME=gpt-5.2-1211-global
MAX_RETRIES=3
TEMPERATURE=0
STREAMING=true

4. Install Dependencies

uv sync

πŸ’» Usage

Running the Agent to Generate Reports

Method 1: Using Benchmark Dataset

Load predefined query tasks from benchmark/data/query.json:

cd agent
bash run_agent.sh claude-4-5 000001

Parameters:

  • claude-4-5: LLM model configuration to use (options: claude-4-5, qwen-3, glm-4-7, etc.)
  • 000001: Query ID corresponding to a task in benchmark/data/query.json

Method 2: Custom Task Description

Write your task description directly in agent/main.py:

# Modify task description in main.py
task_description = """
Write a technical analysis report on memory mechanisms in large language model agents, including architecture diagrams, timelines, and other visualizations...
"""

Then run:

cd agent
bash run_agent.sh claude-4-5

Running Benchmark Evaluation

Step 1: Prepare Report Files

Organize reports to be evaluated in the following directory structure:

benchmark/reports/{eval_system_name}/{query_id}/
β”œβ”€β”€ report.md          # Main report file
β”œβ”€β”€ images/            # Images retrieved from web
β”‚   β”œβ”€β”€ image1.jpg
β”‚   └── image2.png
└── charts/            # Charts generated by tools
    β”œβ”€β”€ chart1.png
    └── chart2.svg

Notes:

  • Images in report.md can use local paths (e.g., ./images/pic.jpg) or HTTP links
  • eval_system_name is the evaluation system name (e.g., claude, gpt4, qwen)
  • query_id must correspond to an ID in benchmark/data/query.json

Step 2: Preprocess Reports

Preprocess reports to extract citations, charts, and other information:

cd benchmark

# Preprocess a single report
uv run python preprocess.py \
  --report_root_dir reports \
  --eval_system_name claude \
  --query_id 000001

# Batch preprocessing
uv run python preprocess.py \
  --report_root_dir reports \
  --eval_system_name claude gpt4 \
  --query_id 000001 000002 000003

Parameters:

  • --report_root_dir: Report root directory (default: reports)
  • --eval_system_name: Evaluation system name(s), supports multiple
  • --query_id: Query ID(s), supports multiple

Step 3: Run Evaluation

Execute automated evaluation to generate results:

cd benchmark

# Evaluate a single report
uv run python eval.py \
  --report_root_dir reports \
  --result_root_dir eval_results \
  --eval_system_name claude \
  --query_id 000001

# Batch evaluation
uv run python eval.py \
  --report_root_dir reports \
  --result_root_dir eval_results \
  --eval_system_name claude gpt4 \
  --query_id 000001 000002 000003

Parameters:

  • --report_root_dir: Report root directory
  • --result_root_dir: Evaluation result output directory
  • --eval_system_name: Evaluation system name(s), supports multiple
  • --query_id: Query ID(s), supports multiple

Step 4: Generate Summary Report

Run the summary script to aggregate evaluation results across all models and dimensions:

uv run python score_generate_result.py

This will generate model_dimension_summary.xlsx containing average scores for each model across text dimensions, visual dimensions, and overall performance.

πŸ“„ License

This project is licensed under the MIT License.

🌟 Star History

If this project helps you, please give us a Star ⭐️


About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors