GitHub

This project is intended to fix PDF files to make them PDF/UA compliant. (not 100% yet)

It has an input folder by performing the following actions:

Renders each page as a compressed JPEG image to reduce file size.
Performs Optical Character Recognition (OCR) on the image to extract text.
Creates a new PDF with the compressed image and an invisible text layer on top.
Sets basic document metadata (Title, Author, Language).
Adds a compliant XMP metadata stream and a PDF/UA identifier to improve accessibility and standards compliance.

There are More to do to fix the rest of PDF/UA compliance issues. It is a work in progress and will be updated as needed. This script is not perfect and may not work for all PDFs. (random manual reveiws needed at the stage)

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
input_pdfs		input_pdfs
output_pdfs		output_pdfs
temp_images		temp_images
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
batch_pdf_ocr_converter.py		batch_pdf_ocr_converter.py
lym_PDFUA_final.py		lym_PDFUA_final.py
ocr_and_compress.sh		ocr_and_compress.sh
ocr_image_compressed.py		ocr_image_compressed.py
ocr_image_jpx_compressed.py		ocr_image_jpx_compressed.py
ocr_image_text_combined.py		ocr_image_text_combined.py
ocr_text_aligned_converter.py		ocr_text_aligned_converter.py
pdfua1-ocr-compressed.py		pdfua1-ocr-compressed.py
pdfua2-ocr-compressed.py		pdfua2-ocr-compressed.py
pdfua3_ocr_compressed.py		pdfua3_ocr_compressed.py
pikepdf-test.py		pikepdf-test.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

About

Uh oh!

Releases

Packages

Languages

yhan818/pdf_ua

Folders and files

Latest commit

History

Repository files navigation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages