Skip to content

Python library to parse Tagged PDFs and extract document structure and text

License

Notifications You must be signed in to change notification settings

JMW95/pyPDFStructure

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 

Repository files navigation

pyPDFStructure

Python library to parse Tagged PDFs and extract document structure and text.

Extracts the usually-hidden structural information which is stored in recent PDF versions for accessibility.

This information makes automatically reading tables etc. from the PDF document really easy.

See top of the file for more usage information and details.

Example Usage:

from pyPDFStructure import *

fin = open("somedoc.pdf", "rb")
doc = PDFDocument(fin.read())
fin.close()

tree = doc.get_structure_tree()

About

Python library to parse Tagged PDFs and extract document structure and text

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages