DeepRead: Document Structure-Aware Reasoning to Enhance Agentic Search

DeepRead is a document-structure-aware RAG Agent. This repo already includes the core parsing, indexing, retrieval, and agent runtime used in the demo.

News

2026.3.16 🔥 DeepRead has been featured in New Intelligence (新智元)!

Repository Layout

Code/DeepRead.py: agent runtime + retrieval + tool calls.
Code/parser_pdf.py: PDF -> OCR (PaddleOCRVL) -> merged Markdown/JSON -> corpus; optional embeddings.
Code/paddleocr.sh: Docker-based PaddleOCRVL vLLM server runner.
Demo/TradingAgent/: demo corpus + embeddings (with images).
Demo/金山办公2023年报/: demo corpus + embeddings.

Quickstart

0) Set API Environment Variables

Set these before running DeepRead.py (adjust to your provider):

# LLM
export OPENROUTER_API_KEY="<YOUR_OPENROUTER_KEY>"
export OPENROUTER_BASE_URL="https://api.openai.com/v1"
export OPENROUTER_MODEL="gpt-4o"

# Optional: Embedding service
export EMBED_API_KEY="<YOUR_EMBEDDING_KEY>"
export EMBED_BASE_URL="http://127.0.0.1:8756/v1"
export EMBEDDING_MODEL="Qwen/Qwen3-Embedding-8B"

# Optional: Reranker service (for semantic retrieval)
export RERANK_API_KEY="<YOUR_RERANK_KEY>"
export RERANK_BASE_URL="https://api.siliconflow.cn/v1"
export RERANK_MODEL="Qwen/Qwen3-Reranker-8B"

1) Start PaddleOCRVL server for PDF OCR

We use the official PaddleOCRVL Docker image published by PaddlePaddle, and it based on VLLM. The launcher is provided in Code/paddleocr.sh. Run:

bash Code/paddleocr.sh

By default it exposes http://127.0.0.1:8956/v1, and our PDF parsing code will call this port by default.

2) PDF -> Corpus (Structure-Aware)

python Code/parser_pdf.py \
  --input /path/to/your.pdf \
  --output /path/to/output_dir

Optional embeddings (requires an embedding API server):

python Code/parser_pdf.py \
  --input /path/to/your.pdf \
  --output /path/to/output_dir \
  --build-embeddings \
  --embedding-model Qwen/Qwen3-Embedding-8B \
  --embed-base-url http://127.0.0.1:8756/v1 \
  --embed-api-key <YOUR_KEY>

This produces:

*_corpus.json (structured nodes)
*_emb.npy + *_idmap.json (optional vector store)

3) Ask Questions with DeepRead

Run:

python Code/DeepRead.py \
  --doc /path/to/output_dir/your_corpus.json \
  --question "What is your question?" \
  --enable-semantic \
  --neighbor-window 1,-1 \
  --log run_log.jsonl

4) Recommended Retrieval Modes

Choose a retrieval mode based on your service availability:

No Embedding API: use BM25 (available by default, no extra flags)

python Code/DeepRead.py --doc /path/to/your_corpus.json --question "..." --log run_log.jsonl

Embedding API only (no reranker): use Vector retrieval

python Code/DeepRead.py \
  --doc /path/to/your_corpus.json \
  --question "..." \
  --enable-vector \
  --disable-bm25 \
  --disable-regex \
  --log run_log.jsonl

Embedding + Reranker: use Semantic retrieval (vector recall + rerank)

python Code/DeepRead.py \
  --doc /path/to/your_corpus.json \
  --question "..." \
  --enable-semantic \
  --log run_log.jsonl

Demo

Demo 1: TradingAgent (multimodal + embeddings)

python Code/DeepRead.py \
  --doc "Demo/TradingAgent/TradingAgent_corpus.json" \
  --question "Which roles are included in the overall TradingAgents framework?" \
  --enable-semantic \
  --enable-multimodal \
  --log demo_trading.jsonl

Demo 2: 金山办公23年年报

python Code/DeepRead.py \
  --doc "Demo/金山办公2023年报/11724-金山办公：金山办公2023年年度报告_corpus.json" \
  --question "公司有哪些累计投入金额超过一亿元的在研项目？" \
  --enable-semantic \
  --neighbor-window 0,0 \
  --log demo_xx.jsonl

Full Usage

DeepRead.py

All options:

python Code/DeepRead.py --help

Common flags:

Input/basics: --doc, --question, --log, --max_rounds, --temperature
Retrieval toggles: --enable-vector, --enable-hybrid, --enable-semantic, --disable-bm25, --disable-regex, --disable-read
Semantic retrieval: --semantic-stage1 (vector/bm25/hybrid), --semantic-topk1, --semantic-topk2
Neighbor window: --neighbor-window up,down
Multimodal: --enable-multimodal
Embedding: --embedding-model, --embed-base-url, --embed-api-key
Rerank: --rerank-api-key, --rerank-base-url, --rerank-model

parser_pdf.py

All options:

python Code/parser_pdf.py --help

Common flags:

Input/output: --input (PDF), --output
OCR server: --paddle-vl-rec-backend, --paddle-vl-rec-server-url
Embedding: --build-embeddings, --embedding-model, --embedding-batch-size, --embed-base-url, --embed-api-key

Configuration Reference

DeepRead reads from environment variables and CLI flags:

LLM: OPENROUTER_API_KEY, OPENROUTER_BASE_URL, OPENROUTER_MODEL
Embedding: EMBED_API_KEY, EMBED_BASE_URL, EMBEDDING_MODEL
Rerank (optional): RERANK_API_KEY, RERANK_BASE_URL, RERANK_MODEL
Retrieval: --enable-vector, --enable-hybrid, --enable-semantic, --disable-bm25, --disable-regex, --disable-read
Neighbor window: --neighbor-window up,down (e.g. 1,-1, 0,0 disables)

Notes

parser_pdf.py currently accepts PDF only.
OCR requires paddleocr and PaddleOCRVL (or run the provided Docker server).
tiktoken is optional; if missing, token counting falls back to a simple tokenizer.

Related Work

Related Outstanding Work: PaddleOCR, PageIndex

Citation

If DeepRead is helpful for you, please cite us.

@article{li2026deepread,
  title={DeepRead: Document Structure-Aware Reasoning to Enhance Agentic Search},
  author={Li, Zhanli and Tian, Huiwen and Luo, Lvzhou and Cao, Yixuan and Luo, Ping},
  journal={arXiv preprint arXiv:2602.05014},
  year={2026}
}

License

See LICENSE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DeepRead: Document Structure-Aware Reasoning to Enhance Agentic Search

News

Repository Layout

Quickstart

0) Set API Environment Variables

1) Start PaddleOCRVL server for PDF OCR

2) PDF -> Corpus (Structure-Aware)

3) Ask Questions with DeepRead

4) Recommended Retrieval Modes

Demo

Demo 1: TradingAgent (multimodal + embeddings)

Demo 2: 金山办公23年年报

Full Usage

DeepRead.py

parser_pdf.py

Configuration Reference

Notes

Related Work

Citation

License

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

DeepRead: Document Structure-Aware Reasoning to Enhance Agentic Search

News

Repository Layout

Quickstart

0) Set API Environment Variables

1) Start PaddleOCRVL server for PDF OCR

2) PDF -> Corpus (Structure-Aware)

3) Ask Questions with DeepRead

4) Recommended Retrieval Modes

Demo

Demo 1: TradingAgent (multimodal + embeddings)

Demo 2: 金山办公23年年报

Full Usage

DeepRead.py

parser_pdf.py

Configuration Reference

Notes

Related Work

Citation

License