PDF OCR Text Extraction Tool

A Python script that extracts text from PDF files using Optical Character Recognition (OCR) with Tesseract and pdf2image.

Overview

This tool converts PDF pages to images and then uses Tesseract OCR to extract text content. It's particularly useful for:

Scanned PDFs without text layer
Image-based PDFs
Documents that need OCR processing

Features

Converts PDF pages to images
Performs OCR on each page using Tesseract
Extracts and combines text from all pages
Cleans extracted text (removes null characters)
Configurable OCR settings

Prerequisites

Python 3.8+
Tesseract OCR installed on your system
Poppler (required by pdf2image)

Installation

1. Install Tesseract OCR

Windows:

Download from: https://github.com/UB-Mannheim/tesseract/wiki
Install to: C:\Program Files\Tesseract-OCR\ (or update the path in code)

macOS:

brew install tesseract

Linux:

sudo apt-get install tesseract-ocr

2. Install Poppler

Windows:

Download from: https://github.com/oschwartz10612/poppler-windows/releases
Add to PATH or place in project directory

macOS:

brew install poppler

Linux:

sudo apt-get install poppler-utils

3. Install Python packages

pip install pytesseract pdf2image Pillow

Configuration

Update the Tesseract path if installed in a different location:

pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"

Common paths:

Windows: C:\Program Files\Tesseract-OCR\tesseract.exe
macOS: /usr/local/bin/tesseract (or /opt/homebrew/bin/tesseract)
Linux: /usr/bin/tesseract

Usage

Place your PDF file in the project directory
Update the PDF filename in the code:

with open("your-file.pdf", "rb") as f:

Run the script:

python ocr_pdf.py

Code Example

import pytesseract
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"
from pdf2image import convert_from_bytes

# Read PDF file as bytes
with open("x.pdf", "rb") as f:
    pdf_bytes = f.read()
    
    # Convert PDF to images (multiple pages possible)
    images = convert_from_bytes(pdf_bytes)
    
    # OCR configuration (for Tesseract)
    custom_config = r'--oem 3 --psm 6'
    
    # Perform OCR on each page and combine text
    pdf_txt = '\n\n'.join(
        pytesseract.image_to_string(image, config=custom_config)
        for image in images
    )
    
    # Remove unnecessary null characters
    pdf_txt = pdf_txt.replace("\x00", " ")

# Print extracted text
print(pdf_txt)

OCR Configuration

The script uses Tesseract configuration: --oem 3 --psm 6

OEM (OCR Engine Mode)

0: Legacy engine only
1: Neural nets LSTM engine only
2: Legacy + LSTM engines
3: Default, based on what is available (recommended)

PSM (Page Segmentation Mode)

0: Orientation and script detection (OSD) only
1: Automatic page segmentation with OSD
3: Fully automatic page segmentation (no OSD)
4: Assume a single column of text
6: Assume a single uniform block of text (default)
7: Treat the image as a single text line
11: Sparse text. Find as much text as possible

Change configuration as needed:

custom_config = r'--oem 3 --psm 3'  # For automatic page segmentation

Output

The script outputs extracted text to console with:

Double line breaks between pages (\n\n)
Cleaned text (null characters removed)

Save to file:

with open("output.txt", "w", encoding="utf-8") as output_file:
    output_file.write(pdf_txt)

Performance Considerations

Processing Time: Depends on PDF size and number of pages
Memory Usage: Large PDFs with many pages may consume significant memory
Image Quality: Higher DPI = better accuracy but slower processing

Adjust DPI (optional):

images = convert_from_bytes(pdf_bytes, dpi=300)  # Default is 200

Troubleshooting

Common Issues

Tesseract not found
- Verify installation path
- Update tesseract_cmd path in code
Poppler not found
- Install Poppler
- Add to system PATH
Poor OCR accuracy
- Increase DPI: convert_from_bytes(pdf_bytes, dpi=300)
- Try different PSM modes
- Ensure PDF has good image quality
Memory errors
- Process PDFs page by page instead of all at once
- Reduce DPI setting

Language Support

Tesseract supports 100+ languages. Install additional language packs:

Windows:

Download from: https://github.com/tesseract-ocr/tessdata

macOS/Linux:

# For Turkish
sudo apt-get install tesseract-ocr-tur

Specify language in code:

pdf_txt = pytesseract.image_to_string(image, lang='tur', config=custom_config)

Advanced Usage

Process specific pages only

images = convert_from_bytes(pdf_bytes, first_page=1, last_page=5)

Get bounding boxes

data = pytesseract.image_to_data(image, output_type=pytesseract.Output.DICT)

Get confidence scores

data = pytesseract.image_to_data(image, output_type=pytesseract.Output.DICT)
confidences = data['conf']

Tesseract OCR for OCR engine
pdf2image for PDF to image conversion
pytesseract for Python wrapper

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
ocr.py		ocr.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDF OCR Text Extraction Tool

Overview

Features

Prerequisites

Installation

1. Install Tesseract OCR

2. Install Poppler

3. Install Python packages

Configuration

Usage

Code Example

OCR Configuration

OEM (OCR Engine Mode)

PSM (Page Segmentation Mode)

Output

Performance Considerations

Troubleshooting

Common Issues

Language Support

Advanced Usage

Process specific pages only

Get bounding boxes

Get confidence scores

About

Uh oh!

Releases

Packages

Languages

License

AbdulSametTurkmenoglu/ocr

Folders and files

Latest commit

History

Repository files navigation

PDF OCR Text Extraction Tool

Overview

Features

Prerequisites

Installation

1. Install Tesseract OCR

2. Install Poppler

3. Install Python packages

Configuration

Usage

Code Example

OCR Configuration

OEM (OCR Engine Mode)

PSM (Page Segmentation Mode)

Output

Performance Considerations

Troubleshooting

Common Issues

Language Support

Advanced Usage

Process specific pages only

Get bounding boxes

Get confidence scores

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages