jaPdfExtractor

Extract text from PDF using Japanese layout

Japanese text will be read inside the bounding boxes of annotations, and then output in T2B, R2L order.

To define the regions of the page that you wish to parse for Japanese text, add rectangular annotations to the PDF.

This script will parse the annotations in the order they were added to the page, and then use those rectangles to parse all subsequent pages. I.e., you can draw one set of rectangles on the title page, and then another set on the first body page, which will then be used for all subsequent pages.

At this time, the script attempts to detect paragraph breaks, inserting newlines in the text output, but it does not detect furigana characters.

Usage

python extract_text.py input.pdf output.txt

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
extract_text.py		extract_text.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

jaPdfExtractor

Extract text from PDF using Japanese layout

Usage

About

Uh oh!

Releases

Packages

Languages

mrobe/jaPdfExtractor

Folders and files

Latest commit

History

Repository files navigation

jaPdfExtractor

Extract text from PDF using Japanese layout

Usage

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages