Skip to content

Convert scientific publications in PDF to structured Markdown via only lightweight ONNX OCR models

License

Notifications You must be signed in to change notification settings

yuanjua/PaperStructure

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PaperStructure

PaperStructure is a lightweight CLI tool designed to transform academic papers into clean, structured Markdown. By leveraging ONNX models, it delivers high-performance inference optimized for standard laptops. It is a reliable companion for formula-heavy research, though users may currently observe lower accuracy in table recognition.

Features

  • Layout Detection -- YOLOX detects titles, sections, paragraphs, formulas, tables, figures
  • Text Recognition -- PP-OCRv5 ONNX pipeline
  • Formula Recognition -- Encoder-decoder LaTeX OCR
  • Markdown Export -- clean, readable markdown output
  • Parallel Processing -- multi-threaded PDF page processing

Demo

PDF Markdown
Screenshot 2026-02-11 at 21 54 00 Screenshot 2026-02-11 at 22 20 52

Installation

pip install paper-structure

This registers the paper-structure CLI and installs the Python package.

CLI Usage

# Process a PDF (full pipeline: layout + OCR + formula)
paper-structure process paper.pdf -o output.md

# Shorthand:
paper-structure paper.pdf -o output.md

# OCR an image (text recognition, no layout detection)
paper-structure process photo.png -o output.txt

# Recognize a formula image as LaTeX
paper-structure process formula.png --formula

# PDF options
paper-structure process paper.pdf --max-pages 5 -v --save-images

# Generate annotated preview PDF with bounding boxes
paper-structure preview paper.pdf -o preview.pdf

# Manage models
paper-structure models status
paper-structure models download

Python API

PDF processing (full pipeline)

from paper_structure import PaperStructurePipeline

pipeline = PaperStructurePipeline()
result = pipeline.process_pdf("paper.pdf")
print(result["markdown"])
pipeline.save_markdown(result, "output.md")

Image OCR

from paper_structure import OCR

ocr = OCR()

# Text recognition (default)
print(ocr("table.png"))

# LaTeX formula recognition
print(ocr("formula.png", formula=True))

Model Management

from paper_structure.models import registry

registry.ensure_all()       # pre-download everything
print(registry.status())    # show cache status

Models

The tool automatically downloads models on its first call. All model weights are hosted at hpllduck/PaperStructure (~399 MB total) and cached locally via huggingface_hub.

Group Files Description
latex_ocr encoder, decoder, image_resizer, tokenizer RapidLaTeXOCR formula recognition
yolox yolox_l0.05.onnx YOLOX-L document layout detection
paddle_ocr det, cls, rec, dictionary PP-OCRv5 text detection/recognition

License

Apache License 2.0. Individual model weights retain their original licenses (MIT for LaTeX OCR, Apache-2.0 for YOLOX and PaddleOCR).

Acknowledgments

About

Convert scientific publications in PDF to structured Markdown via only lightweight ONNX OCR models

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages