Skip to content

Zhanli-Li/DeepRead

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

33 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DeepRead: Document Structure-Aware Reasoning to Enhance Agentic Search

license Ask DeepWiki

main fig

DeepRead is a document-structure-aware RAG Agent. This repo already includes the core parsing, indexing, retrieval, and agent runtime used in the demo.

News

Repository Layout

  • Code/DeepRead.py: agent runtime + retrieval + tool calls.
  • Code/parser_pdf.py: PDF -> OCR (PaddleOCRVL) -> merged Markdown/JSON -> corpus; optional embeddings.
  • Code/paddleocr.sh: Docker-based PaddleOCRVL vLLM server runner.
  • Demo/TradingAgent/: demo corpus + embeddings (with images).
  • Demo/金山办公2023年报/: demo corpus + embeddings.

Quickstart

0) Set API Environment Variables

Set these before running DeepRead.py (adjust to your provider):

# LLM
export OPENROUTER_API_KEY="<YOUR_OPENROUTER_KEY>"
export OPENROUTER_BASE_URL="https://api.openai.com/v1"
export OPENROUTER_MODEL="gpt-4o"

# Optional: Embedding service
export EMBED_API_KEY="<YOUR_EMBEDDING_KEY>"
export EMBED_BASE_URL="http://127.0.0.1:8756/v1"
export EMBEDDING_MODEL="Qwen/Qwen3-Embedding-8B"

# Optional: Reranker service (for semantic retrieval)
export RERANK_API_KEY="<YOUR_RERANK_KEY>"
export RERANK_BASE_URL="https://api.siliconflow.cn/v1"
export RERANK_MODEL="Qwen/Qwen3-Reranker-8B"

1) Start PaddleOCRVL server for PDF OCR

We use the official PaddleOCRVL Docker image published by PaddlePaddle, and it based on VLLM. The launcher is provided in Code/paddleocr.sh. Run:

bash Code/paddleocr.sh

By default it exposes http://127.0.0.1:8956/v1, and our PDF parsing code will call this port by default.

2) PDF -> Corpus (Structure-Aware)

python Code/parser_pdf.py \
  --input /path/to/your.pdf \
  --output /path/to/output_dir

Optional embeddings (requires an embedding API server):

python Code/parser_pdf.py \
  --input /path/to/your.pdf \
  --output /path/to/output_dir \
  --build-embeddings \
  --embedding-model Qwen/Qwen3-Embedding-8B \
  --embed-base-url http://127.0.0.1:8756/v1 \
  --embed-api-key <YOUR_KEY>

This produces:

  • *_corpus.json (structured nodes)
  • *_emb.npy + *_idmap.json (optional vector store)

3) Ask Questions with DeepRead

Run:

python Code/DeepRead.py \
  --doc /path/to/output_dir/your_corpus.json \
  --question "What is your question?" \
  --enable-semantic \
  --neighbor-window 1,-1 \
  --log run_log.jsonl

4) Recommended Retrieval Modes

Choose a retrieval mode based on your service availability:

  • No Embedding API: use BM25 (available by default, no extra flags)
    python Code/DeepRead.py --doc /path/to/your_corpus.json --question "..." --log run_log.jsonl
  • Embedding API only (no reranker): use Vector retrieval
    python Code/DeepRead.py \
      --doc /path/to/your_corpus.json \
      --question "..." \
      --enable-vector \
      --disable-bm25 \
      --disable-regex \
      --log run_log.jsonl
  • Embedding + Reranker: use Semantic retrieval (vector recall + rerank)
    python Code/DeepRead.py \
      --doc /path/to/your_corpus.json \
      --question "..." \
      --enable-semantic \
      --log run_log.jsonl

Demo

Demo 1: TradingAgent (multimodal + embeddings)

python Code/DeepRead.py \
  --doc "Demo/TradingAgent/TradingAgent_corpus.json" \
  --question "Which roles are included in the overall TradingAgents framework?" \
  --enable-semantic \
  --enable-multimodal \
  --log demo_trading.jsonl

Demo 2: 金山办公23年年报

python Code/DeepRead.py \
  --doc "Demo/金山办公2023年报/11724-金山办公:金山办公2023年年度报告_corpus.json" \
  --question "公司有哪些累计投入金额超过一亿元的在研项目?" \
  --enable-semantic \
  --neighbor-window 0,0 \
  --log demo_xx.jsonl

Full Usage

DeepRead.py

All options:

python Code/DeepRead.py --help

Common flags:

  • Input/basics: --doc, --question, --log, --max_rounds, --temperature
  • Retrieval toggles: --enable-vector, --enable-hybrid, --enable-semantic, --disable-bm25, --disable-regex, --disable-read
  • Semantic retrieval: --semantic-stage1 (vector/bm25/hybrid), --semantic-topk1, --semantic-topk2
  • Neighbor window: --neighbor-window up,down
  • Multimodal: --enable-multimodal
  • Embedding: --embedding-model, --embed-base-url, --embed-api-key
  • Rerank: --rerank-api-key, --rerank-base-url, --rerank-model

parser_pdf.py

All options:

python Code/parser_pdf.py --help

Common flags:

  • Input/output: --input (PDF), --output
  • OCR server: --paddle-vl-rec-backend, --paddle-vl-rec-server-url
  • Embedding: --build-embeddings, --embedding-model, --embedding-batch-size, --embed-base-url, --embed-api-key

Configuration Reference

DeepRead reads from environment variables and CLI flags:

  • LLM: OPENROUTER_API_KEY, OPENROUTER_BASE_URL, OPENROUTER_MODEL
  • Embedding: EMBED_API_KEY, EMBED_BASE_URL, EMBEDDING_MODEL
  • Rerank (optional): RERANK_API_KEY, RERANK_BASE_URL, RERANK_MODEL
  • Retrieval: --enable-vector, --enable-hybrid, --enable-semantic, --disable-bm25, --disable-regex, --disable-read
  • Neighbor window: --neighbor-window up,down (e.g. 1,-1, 0,0 disables)

Notes

  • parser_pdf.py currently accepts PDF only.
  • OCR requires paddleocr and PaddleOCRVL (or run the provided Docker server).
  • tiktoken is optional; if missing, token counting falls back to a simple tokenizer.

Related Work

Related Outstanding Work: PaddleOCR, PageIndex

Citation

If DeepRead is helpful for you, please cite us.

@article{li2026deepread,
  title={DeepRead: Document Structure-Aware Reasoning to Enhance Agentic Search},
  author={Li, Zhanli and Tian, Huiwen and Luo, Lvzhou and Cao, Yixuan and Luo, Ping},
  journal={arXiv preprint arXiv:2602.05014},
  year={2026}
}

License

See LICENSE.

About

[DeepRead] This is the official implementation of the DeepRead paper.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors