Skip to content

AbdulSametTurkmenoglu/ocr

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 

Repository files navigation

PDF OCR Text Extraction Tool

A Python script that extracts text from PDF files using Optical Character Recognition (OCR) with Tesseract and pdf2image.

Overview

This tool converts PDF pages to images and then uses Tesseract OCR to extract text content. It's particularly useful for:

  • Scanned PDFs without text layer
  • Image-based PDFs
  • Documents that need OCR processing

Features

  • Converts PDF pages to images
  • Performs OCR on each page using Tesseract
  • Extracts and combines text from all pages
  • Cleans extracted text (removes null characters)
  • Configurable OCR settings

Prerequisites

  • Python 3.8+
  • Tesseract OCR installed on your system
  • Poppler (required by pdf2image)

Installation

1. Install Tesseract OCR

Windows:

macOS:

brew install tesseract

Linux:

sudo apt-get install tesseract-ocr

2. Install Poppler

Windows:

macOS:

brew install poppler

Linux:

sudo apt-get install poppler-utils

3. Install Python packages

pip install pytesseract pdf2image Pillow

Configuration

Update the Tesseract path if installed in a different location:

pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"

Common paths:

  • Windows: C:\Program Files\Tesseract-OCR\tesseract.exe
  • macOS: /usr/local/bin/tesseract (or /opt/homebrew/bin/tesseract)
  • Linux: /usr/bin/tesseract

Usage

  1. Place your PDF file in the project directory
  2. Update the PDF filename in the code:
with open("your-file.pdf", "rb") as f:
  1. Run the script:
python ocr_pdf.py

Code Example

import pytesseract
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"
from pdf2image import convert_from_bytes

# Read PDF file as bytes
with open("x.pdf", "rb") as f:
    pdf_bytes = f.read()
    
    # Convert PDF to images (multiple pages possible)
    images = convert_from_bytes(pdf_bytes)
    
    # OCR configuration (for Tesseract)
    custom_config = r'--oem 3 --psm 6'
    
    # Perform OCR on each page and combine text
    pdf_txt = '\n\n'.join(
        pytesseract.image_to_string(image, config=custom_config)
        for image in images
    )
    
    # Remove unnecessary null characters
    pdf_txt = pdf_txt.replace("\x00", " ")

# Print extracted text
print(pdf_txt)

OCR Configuration

The script uses Tesseract configuration: --oem 3 --psm 6

OEM (OCR Engine Mode)

  • 0: Legacy engine only
  • 1: Neural nets LSTM engine only
  • 2: Legacy + LSTM engines
  • 3: Default, based on what is available (recommended)

PSM (Page Segmentation Mode)

  • 0: Orientation and script detection (OSD) only
  • 1: Automatic page segmentation with OSD
  • 3: Fully automatic page segmentation (no OSD)
  • 4: Assume a single column of text
  • 6: Assume a single uniform block of text (default)
  • 7: Treat the image as a single text line
  • 11: Sparse text. Find as much text as possible

Change configuration as needed:

custom_config = r'--oem 3 --psm 3'  # For automatic page segmentation

Output

The script outputs extracted text to console with:

  • Double line breaks between pages (\n\n)
  • Cleaned text (null characters removed)

Save to file:

with open("output.txt", "w", encoding="utf-8") as output_file:
    output_file.write(pdf_txt)

Performance Considerations

  • Processing Time: Depends on PDF size and number of pages
  • Memory Usage: Large PDFs with many pages may consume significant memory
  • Image Quality: Higher DPI = better accuracy but slower processing

Adjust DPI (optional):

images = convert_from_bytes(pdf_bytes, dpi=300)  # Default is 200

Troubleshooting

Common Issues

  1. Tesseract not found

    • Verify installation path
    • Update tesseract_cmd path in code
  2. Poppler not found

    • Install Poppler
    • Add to system PATH
  3. Poor OCR accuracy

    • Increase DPI: convert_from_bytes(pdf_bytes, dpi=300)
    • Try different PSM modes
    • Ensure PDF has good image quality
  4. Memory errors

    • Process PDFs page by page instead of all at once
    • Reduce DPI setting

Language Support

Tesseract supports 100+ languages. Install additional language packs:

Windows:

macOS/Linux:

# For Turkish
sudo apt-get install tesseract-ocr-tur

Specify language in code:

pdf_txt = pytesseract.image_to_string(image, lang='tur', config=custom_config)

Advanced Usage

Process specific pages only

images = convert_from_bytes(pdf_bytes, first_page=1, last_page=5)

Get bounding boxes

data = pytesseract.image_to_data(image, output_type=pytesseract.Output.DICT)

Get confidence scores

data = pytesseract.image_to_data(image, output_type=pytesseract.Output.DICT)
confidences = data['conf']

About

Scanned PDF Problem and Solution with OCR

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages