DeepRead is a document-structure-aware RAG Agent. This repo already includes the core parsing, indexing, retrieval, and agent runtime used in the demo.
- 2026.3.16 🔥 DeepRead has been featured in New Intelligence (新智元)!
Code/DeepRead.py: agent runtime + retrieval + tool calls.Code/parser_pdf.py: PDF -> OCR (PaddleOCRVL) -> merged Markdown/JSON -> corpus; optional embeddings.Code/paddleocr.sh: Docker-based PaddleOCRVL vLLM server runner.Demo/TradingAgent/: demo corpus + embeddings (with images).Demo/金山办公2023年报/: demo corpus + embeddings.
Set these before running DeepRead.py (adjust to your provider):
# LLM
export OPENROUTER_API_KEY="<YOUR_OPENROUTER_KEY>"
export OPENROUTER_BASE_URL="https://api.openai.com/v1"
export OPENROUTER_MODEL="gpt-4o"
# Optional: Embedding service
export EMBED_API_KEY="<YOUR_EMBEDDING_KEY>"
export EMBED_BASE_URL="http://127.0.0.1:8756/v1"
export EMBEDDING_MODEL="Qwen/Qwen3-Embedding-8B"
# Optional: Reranker service (for semantic retrieval)
export RERANK_API_KEY="<YOUR_RERANK_KEY>"
export RERANK_BASE_URL="https://api.siliconflow.cn/v1"
export RERANK_MODEL="Qwen/Qwen3-Reranker-8B"We use the official PaddleOCRVL Docker image published by PaddlePaddle, and it based on VLLM. The launcher is provided in Code/paddleocr.sh. Run:
bash Code/paddleocr.shBy default it exposes http://127.0.0.1:8956/v1, and our PDF parsing code will call this port by default.
python Code/parser_pdf.py \
--input /path/to/your.pdf \
--output /path/to/output_dirOptional embeddings (requires an embedding API server):
python Code/parser_pdf.py \
--input /path/to/your.pdf \
--output /path/to/output_dir \
--build-embeddings \
--embedding-model Qwen/Qwen3-Embedding-8B \
--embed-base-url http://127.0.0.1:8756/v1 \
--embed-api-key <YOUR_KEY>This produces:
*_corpus.json(structured nodes)*_emb.npy+*_idmap.json(optional vector store)
Run:
python Code/DeepRead.py \
--doc /path/to/output_dir/your_corpus.json \
--question "What is your question?" \
--enable-semantic \
--neighbor-window 1,-1 \
--log run_log.jsonlChoose a retrieval mode based on your service availability:
- No Embedding API: use BM25 (available by default, no extra flags)
python Code/DeepRead.py --doc /path/to/your_corpus.json --question "..." --log run_log.jsonl - Embedding API only (no reranker): use Vector retrieval
python Code/DeepRead.py \ --doc /path/to/your_corpus.json \ --question "..." \ --enable-vector \ --disable-bm25 \ --disable-regex \ --log run_log.jsonl - Embedding + Reranker: use Semantic retrieval (vector recall + rerank)
python Code/DeepRead.py \ --doc /path/to/your_corpus.json \ --question "..." \ --enable-semantic \ --log run_log.jsonl
python Code/DeepRead.py \
--doc "Demo/TradingAgent/TradingAgent_corpus.json" \
--question "Which roles are included in the overall TradingAgents framework?" \
--enable-semantic \
--enable-multimodal \
--log demo_trading.jsonlpython Code/DeepRead.py \
--doc "Demo/金山办公2023年报/11724-金山办公:金山办公2023年年度报告_corpus.json" \
--question "公司有哪些累计投入金额超过一亿元的在研项目?" \
--enable-semantic \
--neighbor-window 0,0 \
--log demo_xx.jsonlAll options:
python Code/DeepRead.py --helpCommon flags:
- Input/basics:
--doc,--question,--log,--max_rounds,--temperature - Retrieval toggles:
--enable-vector,--enable-hybrid,--enable-semantic,--disable-bm25,--disable-regex,--disable-read - Semantic retrieval:
--semantic-stage1(vector/bm25/hybrid),--semantic-topk1,--semantic-topk2 - Neighbor window:
--neighbor-window up,down - Multimodal:
--enable-multimodal - Embedding:
--embedding-model,--embed-base-url,--embed-api-key - Rerank:
--rerank-api-key,--rerank-base-url,--rerank-model
All options:
python Code/parser_pdf.py --helpCommon flags:
- Input/output:
--input(PDF),--output - OCR server:
--paddle-vl-rec-backend,--paddle-vl-rec-server-url - Embedding:
--build-embeddings,--embedding-model,--embedding-batch-size,--embed-base-url,--embed-api-key
DeepRead reads from environment variables and CLI flags:
- LLM:
OPENROUTER_API_KEY,OPENROUTER_BASE_URL,OPENROUTER_MODEL - Embedding:
EMBED_API_KEY,EMBED_BASE_URL,EMBEDDING_MODEL - Rerank (optional):
RERANK_API_KEY,RERANK_BASE_URL,RERANK_MODEL - Retrieval:
--enable-vector,--enable-hybrid,--enable-semantic,--disable-bm25,--disable-regex,--disable-read - Neighbor window:
--neighbor-window up,down(e.g.1,-1,0,0disables)
parser_pdf.pycurrently accepts PDF only.- OCR requires
paddleocrandPaddleOCRVL(or run the provided Docker server). tiktokenis optional; if missing, token counting falls back to a simple tokenizer.
Related Outstanding Work: PaddleOCR, PageIndex
If DeepRead is helpful for you, please cite us.
@article{li2026deepread,
title={DeepRead: Document Structure-Aware Reasoning to Enhance Agentic Search},
author={Li, Zhanli and Tian, Huiwen and Luo, Lvzhou and Cao, Yixuan and Luo, Ping},
journal={arXiv preprint arXiv:2602.05014},
year={2026}
}
See LICENSE.
