Convert PDFs to markdown using Large Language Models (LLMs) with vision capabilities.
- π High-quality OCR using vision-capable LLMs
- π Batch processing of multiple PDF pages
- π Multiple provider support (Gemini, OpenAI)
- βοΈ Configurable processing settings
- π Automatic retry logic for transient errors
- π Clean markdown output
pip install ocr-llmYou also need to install poppler (required for PDF processing):
# macOS
brew install poppler
# Ubuntu/Debian
sudo apt-get install poppler-utils
# Fedora/RHEL
sudo yum install poppler-utilsThe library requires:
- System:
poppler-utilsfor PDF processing - Python:
google-genaifor Gemini provideropenaifor OpenAI providerpdf2imageandPillowfor PDF processing
import asyncio
from llm_ocr import LLMOCR, OpenAI
async def main():
# Initialize OpenAI provider
provider = OpenAI(
api_key="your-api-key", # Or set OPENAI_API_KEY env var
model=OpenAI.GPT_4O_MINI
)
# Create OCR processor
async with LLMOCR(provider) as ocr:
# Convert PDF to markdown
markdown = await ocr.convert(
"document.pdf",
output_path="output.md"
)
print(markdown)
asyncio.run(main())import asyncio
from llm_ocr import LLMOCR, Gemini
async def main():
# Initialize Gemini provider
provider = Gemini(
api_key="your-api-key", # Or set GEMINI_API_KEY env var
model=Gemini.FLASH_2_5 # Or Gemini.PRO_2_5 for best quality
)
# Create OCR processor
async with LLMOCR(provider) as ocr:
# Convert PDF to markdown
markdown = await ocr.convert(
"document.pdf",
output_path="output.md"
)
print(markdown)
asyncio.run(main())OpenAI.GPT_4OOpenAI.GPT_4O_MINI(default)
Additional models: O1, O3, O4_MINI, GPT_5, GPT_5_MINI, GPT_4_1, and more.
See
llm_ocr/providers/openai.pyfor the complete list.
Gemini.PRO_2_5Gemini.FLASH_2_5(default)
Additional models: PRO_2_0, FLASH_2_0.
See
llm_ocr/providers/gemini.pyfor the complete list.
Customize the OCR processing with OCRConfig:
from llm_ocr import LLMOCR, OpenAI, OCRConfig
config = OCRConfig(
dpi=300, # Higher DPI for better quality
max_pages=10, # Limit number of pages to process
llm_batch_size=2, # Send 2 pages to LLM at once
convert_to_grayscale=True, # Convert images to grayscale
max_retries=3, # Retry failed requests
retry_delay=1.0, # Wait 1 second between retries
include_page_markers=True, # Add page markers in output
)
provider = OpenAI()
ocr = LLMOCR(provider, config=config)| Option | Default | Description |
|---|---|---|
dpi |
200 | DPI for PDF to image conversion (72-600) |
max_pages |
None | Maximum number of pages to process |
batch_size |
5 | PDF to image conversion batch size |
llm_batch_size |
1 | Number of pages to send to LLM at once |
thread_count |
4 | Number of threads for PDF conversion |
convert_to_grayscale |
False | Convert images to grayscale |
optimize_png |
True | Optimize PNG compression |
use_cropbox |
True | Use PDF cropbox for conversion |
max_retries |
3 | Maximum retry attempts for failed requests |
retry_delay |
1.0 | Delay between retries in seconds |
include_page_markers |
False | Add page markers in markdown output |
Pass additional parameters to the LLM provider:
# OpenAI with custom parameters
provider = OpenAI(
model=OpenAI.GPT_4O,
max_tokens=4000,
temperature=0.0,
)
# Gemini with custom parameters
provider = Gemini(
model=Gemini.PRO_2_5,
temperature=0.0,
)import asyncio
from pathlib import Path
from llm_ocr import LLMOCR, OpenAI
async def process_documents():
provider = OpenAI()
async with LLMOCR(provider) as ocr:
pdf_files = Path("pdfs").glob("*.pdf")
for pdf_file in pdf_files:
output_file = pdf_file.with_suffix(".md")
await ocr.convert(pdf_file, output_path=output_file)
print(f"Converted {pdf_file.name} -> {output_file.name}")
asyncio.run(process_documents())If you prefer not to use the context manager:
import asyncio
from llm_ocr import LLMOCR, OpenAI
async def main():
provider = OpenAI()
ocr = LLMOCR(provider)
try:
markdown = await ocr.convert("document.pdf")
print(markdown)
finally:
await ocr.aclose() # Don't forget to close!
asyncio.run(main())Set API keys via environment variables:
# For OpenAI
export OPENAI_API_KEY="your-openai-api-key"
# For Gemini
export GEMINI_API_KEY="your-gemini-api-key"Then use providers without passing API keys:
# API key read from environment variable
provider = OpenAI() # Uses OPENAI_API_KEY
# or
provider = Gemini() # Uses GEMINI_API_KEYThe library uses a fail-fast approach with automatic retries:
import asyncio
from llm_ocr import LLMOCR, OpenAI, OCRConfig
async def main():
provider = OpenAI()
config = OCRConfig(
max_retries=5, # Retry up to 5 times
retry_delay=2.0, # Wait 2 seconds between retries
)
async with LLMOCR(provider, config) as ocr:
try:
markdown = await ocr.convert("document.pdf")
print(markdown)
except Exception as e:
print(f"Failed to process document: {e}")
asyncio.run(main())See LICENSE file for details.