A Python script that extracts text from PDF files using Optical Character Recognition (OCR) with Tesseract and pdf2image.
This tool converts PDF pages to images and then uses Tesseract OCR to extract text content. It's particularly useful for:
- Scanned PDFs without text layer
- Image-based PDFs
- Documents that need OCR processing
- Converts PDF pages to images
- Performs OCR on each page using Tesseract
- Extracts and combines text from all pages
- Cleans extracted text (removes null characters)
- Configurable OCR settings
- Python 3.8+
- Tesseract OCR installed on your system
- Poppler (required by pdf2image)
Windows:
- Download from: https://github.com/UB-Mannheim/tesseract/wiki
- Install to:
C:\Program Files\Tesseract-OCR\(or update the path in code)
macOS:
brew install tesseractLinux:
sudo apt-get install tesseract-ocrWindows:
- Download from: https://github.com/oschwartz10612/poppler-windows/releases
- Add to PATH or place in project directory
macOS:
brew install popplerLinux:
sudo apt-get install poppler-utilspip install pytesseract pdf2image PillowUpdate the Tesseract path if installed in a different location:
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"Common paths:
- Windows:
C:\Program Files\Tesseract-OCR\tesseract.exe - macOS:
/usr/local/bin/tesseract(or/opt/homebrew/bin/tesseract) - Linux:
/usr/bin/tesseract
- Place your PDF file in the project directory
- Update the PDF filename in the code:
with open("your-file.pdf", "rb") as f:- Run the script:
python ocr_pdf.pyimport pytesseract
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"
from pdf2image import convert_from_bytes
# Read PDF file as bytes
with open("x.pdf", "rb") as f:
pdf_bytes = f.read()
# Convert PDF to images (multiple pages possible)
images = convert_from_bytes(pdf_bytes)
# OCR configuration (for Tesseract)
custom_config = r'--oem 3 --psm 6'
# Perform OCR on each page and combine text
pdf_txt = '\n\n'.join(
pytesseract.image_to_string(image, config=custom_config)
for image in images
)
# Remove unnecessary null characters
pdf_txt = pdf_txt.replace("\x00", " ")
# Print extracted text
print(pdf_txt)The script uses Tesseract configuration: --oem 3 --psm 6
0: Legacy engine only1: Neural nets LSTM engine only2: Legacy + LSTM engines3: Default, based on what is available (recommended)
0: Orientation and script detection (OSD) only1: Automatic page segmentation with OSD3: Fully automatic page segmentation (no OSD)4: Assume a single column of text6: Assume a single uniform block of text (default)7: Treat the image as a single text line11: Sparse text. Find as much text as possible
Change configuration as needed:
custom_config = r'--oem 3 --psm 3' # For automatic page segmentationThe script outputs extracted text to console with:
- Double line breaks between pages (
\n\n) - Cleaned text (null characters removed)
Save to file:
with open("output.txt", "w", encoding="utf-8") as output_file:
output_file.write(pdf_txt)- Processing Time: Depends on PDF size and number of pages
- Memory Usage: Large PDFs with many pages may consume significant memory
- Image Quality: Higher DPI = better accuracy but slower processing
Adjust DPI (optional):
images = convert_from_bytes(pdf_bytes, dpi=300) # Default is 200-
Tesseract not found
- Verify installation path
- Update
tesseract_cmdpath in code
-
Poppler not found
- Install Poppler
- Add to system PATH
-
Poor OCR accuracy
- Increase DPI:
convert_from_bytes(pdf_bytes, dpi=300) - Try different PSM modes
- Ensure PDF has good image quality
- Increase DPI:
-
Memory errors
- Process PDFs page by page instead of all at once
- Reduce DPI setting
Tesseract supports 100+ languages. Install additional language packs:
Windows:
- Download from: https://github.com/tesseract-ocr/tessdata
macOS/Linux:
# For Turkish
sudo apt-get install tesseract-ocr-turSpecify language in code:
pdf_txt = pytesseract.image_to_string(image, lang='tur', config=custom_config)images = convert_from_bytes(pdf_bytes, first_page=1, last_page=5)data = pytesseract.image_to_data(image, output_type=pytesseract.Output.DICT)data = pytesseract.image_to_data(image, output_type=pytesseract.Output.DICT)
confidences = data['conf']- Tesseract OCR for OCR engine
- pdf2image for PDF to image conversion
- pytesseract for Python wrapper