PaperStructure

PaperStructure is a lightweight CLI tool designed to transform academic papers into clean, structured Markdown. By leveraging ONNX models, it delivers high-performance inference optimized for standard laptops. It is a reliable companion for formula-heavy research, though users may currently observe lower accuracy in table recognition.

Features

Layout Detection -- YOLOX detects titles, sections, paragraphs, formulas, tables, figures
Text Recognition -- PP-OCRv5 ONNX pipeline
Formula Recognition -- Encoder-decoder LaTeX OCR
Markdown Export -- clean, readable markdown output
Parallel Processing -- multi-threaded PDF page processing

Demo

PDF	Markdown

Installation

pip install paper-structure

This registers the paper-structure CLI and installs the Python package.

CLI Usage

# Process a PDF (full pipeline: layout + OCR + formula)
paper-structure process paper.pdf -o output.md

# Shorthand:
paper-structure paper.pdf -o output.md

# OCR an image (text recognition, no layout detection)
paper-structure process photo.png -o output.txt

# Recognize a formula image as LaTeX
paper-structure process formula.png --formula

# PDF options
paper-structure process paper.pdf --max-pages 5 -v --save-images

# Generate annotated preview PDF with bounding boxes
paper-structure preview paper.pdf -o preview.pdf

# Manage models
paper-structure models status
paper-structure models download

Python API

PDF processing (full pipeline)

from paper_structure import PaperStructurePipeline

pipeline = PaperStructurePipeline()
result = pipeline.process_pdf("paper.pdf")
print(result["markdown"])
pipeline.save_markdown(result, "output.md")

Image OCR

from paper_structure import OCR

ocr = OCR()

# Text recognition (default)
print(ocr("table.png"))

# LaTeX formula recognition
print(ocr("formula.png", formula=True))

Model Management

from paper_structure.models import registry

registry.ensure_all()       # pre-download everything
print(registry.status())    # show cache status

Models

The tool automatically downloads models on its first call. All model weights are hosted at hpllduck/PaperStructure (~399 MB total) and cached locally via huggingface_hub.

Group	Files	Description
`latex_ocr`	encoder, decoder, image_resizer, tokenizer	RapidLaTeXOCR formula recognition
`yolox`	yolox_l0.05.onnx	YOLOX-L document layout detection
`paddle_ocr`	det, cls, rec, dictionary	PP-OCRv5 text detection/recognition

License

Apache License 2.0. Individual model weights retain their original licenses (MIT for LaTeX OCR, Apache-2.0 for YOLOX and PaddleOCR).

Acknowledgments

PaddleOCR -- text recognition
OnnxOCR -- ONNX OCR pipeline
RapidLaTeXOCR -- formula recognition
unstructured-inference -- layout detection

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.github/workflows		.github/workflows
demos		demos
src/paper_structure		src/paper_structure
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PaperStructure

Features

Demo

Installation

CLI Usage

Python API

PDF processing (full pipeline)

Image OCR

Model Management

Models

License

Acknowledgments

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors

Uh oh!

Languages

License

yuanjua/PaperStructure

Folders and files

Latest commit

History

Repository files navigation

PaperStructure

Features

Demo

Installation

CLI Usage

Python API

PDF processing (full pipeline)

Image OCR

Model Management

Models

License

Acknowledgments

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages