πβ‘οΈβ¨ Feed it crusty scanned PDFs, get clean markdown. Tesseract when you're broke, Claude when you're bougie.
A powerful CLI tool for transcribing scanned PDF documents to markdown using OCR. Built specifically for difficult documents that defeat traditional OCR β faded text, low-resolution scans, typewriter fonts, highlighter marks, and decades-old legal paperwork.
Supports both Tesseract (free, local) for clean modern scans, and Claude Vision AI for everything else. When your documents look like they survived a flood, a fire, and a fax machine, Claude Vision will still read them.
- Dual OCR Engines - Tesseract for free local processing, Claude Vision for AI-powered accuracy
- Batch Processing - Drop PDFs in
input/folder and process them all at once - Parallel Processing - Multi-threaded/multi-process execution with tier-based concurrency
- Image Preprocessing - Binarize, denoise, sharpen, remove red highlights, and more
- AI Text Cleanup - Optional post-processing to fix OCR errors
- Auto-Rotation - Detects and corrects page orientation
- Streaming Output - Results saved page-by-page as processing happens
- Flexible Page Selection - Process specific pages, ranges, or just the first N pages
TL;DR: If Tesseract gives you garbage, Claude Vision will probably nail it.
Traditional OCR engines like Tesseract work great on clean, modern scans. But real-world documents are often a mess. Claude Vision dramatically outperforms Tesseract on:
| Document Type | Tesseract | Claude Vision |
|---|---|---|
| Clean modern scans | β Great | β Great |
| Faded or low-contrast text | β Excellent | |
| Low resolution scans | β Handles well | |
| Highlighter marks / annotations | β Fails | β Ignores marks, reads text |
| Typewriter fonts | β Excellent | |
| Degraded legal documents | β Often unusable | β Accurate |
| Noisy backgrounds / speckles | β Handles natively | |
| Mixed fonts / handwriting | β Poor | β Good |
This tool was built to transcribe decades-old legal documents that were:
- Scanned at low resolution from microfilm
- Typed on manual typewriters with uneven ink
- Covered in red/yellow highlighter marks
- Faded and noisy with age
Tesseract produced mostly unusable output even with aggressive preprocessing. Claude Vision transcribed them nearly perfectly, understanding context to fill in degraded characters and ignoring highlighter marks entirely.
| Engine | Cost | Speed | Quality on Bad Docs |
|---|---|---|---|
| Tesseract | Free | Fast | Poor |
Claude Haiku (--cheapo) |
~$0.001/page | Fast | Good |
| Claude Sonnet (default) | ~$0.01/page | Medium | Excellent |
Claude Opus (--expensive) |
~$0.05/page | Slower | Best |
β οΈ DPI affects AI cost: Higher DPI = larger images = more tokens = higher cost. The default 150 DPI works well for most documents. Only increase DPI (--dpi 300) if you're seeing quality issues. At 300 DPI, expect roughly 4x the cost per page.
Recommendation: Start with --engine claude (Sonnet) at default DPI. Use --cheapo for bulk processing of moderately difficult docs. Only use --expensive or high DPI for the most challenging documents.
Before installing pdf-scribe, you need two system dependencies:
| Dependency | Purpose | Required For |
|---|---|---|
| Tesseract OCR | Optical character recognition engine | --engine tesseract (default) |
| Poppler | PDF to image conversion | All PDF processing |
# Install Tesseract and Poppler
brew install tesseract poppler
# Install ALL language packs (recommended)
brew install tesseract-lang
# Or install specific languages only
brew install tesseract-lang # Includes all languages# Install Tesseract and Poppler
sudo apt-get update
sudo apt-get install tesseract-ocr poppler-utils
# Install ALL language packs
sudo apt-get install tesseract-ocr-all
# Or install specific languages
sudo apt-get install tesseract-ocr-spa # Spanish
sudo apt-get install tesseract-ocr-fra # French
sudo apt-get install tesseract-ocr-deu # German
sudo apt-get install tesseract-ocr-por # Portuguese-
Tesseract: Download installer from UB-Mannheim
- Run the installer
- Important: Check "Add to PATH" during installation
- Select additional languages in the installer
-
Poppler: Download from poppler-windows
- Extract to
C:\Program Files\poppler - Add
C:\Program Files\poppler\binto your PATH
- Extract to
# Check Tesseract
tesseract --version
# Should show: tesseract 5.x.x
# Check available languages
tesseract --list-langs
# Should show: eng, spa, fra, etc.
# Check Poppler
pdftoppm -v
# Should show: pdftoppm version x.x.x| Code | Language |
|---|---|
eng |
English (default) |
spa |
Spanish |
fra |
French |
deu |
German |
por |
Portuguese |
ita |
Italian |
rus |
Russian |
chi_sim |
Chinese (Simplified) |
chi_tra |
Chinese (Traditional) |
jpn |
Japanese |
kor |
Korean |
ara |
Arabic |
Use multiple languages with +: --lang eng+spa
Requirements: Python 3.10 or higher
# Clone the repository
git clone https://github.com/yourusername/pdf-scribe.git
cd pdf-scribeUsing a virtual environment keeps dependencies isolated and avoids conflicts with other projects.
macOS/Linux:
# Create virtual environment
python3 -m venv .venv
# Activate it (run this every time you open a new terminal)
source .venv/bin/activate
# Your prompt should now show (.venv)Windows (PowerShell):
# Create virtual environment
python -m venv .venv
# Activate it
.venv\Scripts\Activate.ps1
# If you get an execution policy error, run:
Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope CurrentUserWindows (Command Prompt):
python -m venv .venv
.venv\Scripts\activate.bat# Make sure your venv is activated (you should see (.venv) in your prompt)
# Install core dependencies
pip install -r requirements.txt
# That's it! Claude Vision dependencies are included in requirements.txtWhen you're done:
deactivateOnly needed if using --engine claude:
cp .env.example .env
# Edit .env and add your Anthropic API keyGet your API key at: https://console.anthropic.com/
# Basic OCR with Tesseract
python main.py document.pdf
# Use Claude Vision for better quality
python main.py document.pdf --engine claude
# Spanish document with image enhancement
python main.py document.pdf --engine claude --lang spa --enhance# Place PDFs in input/ folder, then:
python main.py --engine claude --lang spa
# All PDFs will be processed and saved to output/<document_name>/python main.py [pdf] [options]
| Method | Command |
|---|---|
| Single file | python main.py document.pdf |
| Batch mode | python main.py (processes all PDFs in input/) |
| Engine | Flag | Description |
|---|---|---|
| Tesseract | --engine tesseract |
Free, local OCR (default) |
| Claude Vision | --engine claude |
AI-powered, best for degraded docs |
| Option | Description |
|---|---|
-e, --engine |
OCR engine: tesseract or claude |
-l, --lang |
Language code: eng, spa, fra, etc. |
-o, --output |
Custom output path |
--dpi |
Resolution for PDF conversion (default: 150) |
-w, --workers |
Parallel workers (auto for CPU count) |
| Mode | Flag | Description |
|---|---|---|
| None | --preprocess none |
No preprocessing (fastest) |
| Grayscale | --preprocess grayscale |
Convert to grayscale |
| Binarize | --preprocess binarize |
Black/white (good for faded text) |
| Contrast | --preprocess contrast |
Enhance contrast |
| Sharpen | --preprocess sharpen |
Sharpen edges |
| Denoise | --preprocess denoise |
Remove noise/speckles |
| Remove Red | --preprocess remove-red |
Remove red highlights/marks |
| Clean | --preprocess clean |
Remove red + all enhancements |
| All | --preprocess all |
All enhancements (no red removal) |
# --enhance is equivalent to: --dpi 300 --preprocess all --rotate
python main.py document.pdf --enhance# First N pages only
python main.py document.pdf --first 5
# Specific pages
python main.py document.pdf --pages 1,3,7
# Page ranges
python main.py document.pdf --pages 1-5,10-15
# Mixed
python main.py document.pdf --pages 1-3,7,10-12| Option | Description |
|---|---|
--cleanup |
Post-process with AI to fix OCR errors |
--reflow |
Intelligently join lines into paragraphs |
--cheapo |
Use Haiku 3.5 (faster, cheaper) |
--expensive |
Use Opus 4 (highest quality) |
| Option | Description |
|---|---|
--rotate |
Auto-detect and correct page orientation |
--rotate-confidence |
Minimum confidence for rotation (default: 5.0) |
--psm |
Page Segmentation Mode (3, 4, 6, 11, 12) |
--oem |
OCR Engine Mode (0-3) |
| Mode | Description |
|---|---|
| 3 | Fully automatic (default) |
| 4 | Single column of variable sizes |
| 6 | Single uniform block of text |
| 11 | Sparse text (find as much as possible) |
Each processed document gets its own folder:
output/
βββ document_name/
βββ document_name.md # Full merged transcription
βββ document_name_clean.md # AI-cleaned version (if --cleanup)
βββ pages/
βββ page_001.md # Individual page
βββ page_001_clean.md # Cleaned page (if --cleanup)
βββ page_002.md
βββ ...
# Kitchen sink approach - everything enabled
python main.py old_scan.pdf --enhance --lang spa --workers auto# Remove red highlights before OCR
python main.py marked_up.pdf --preprocess clean --engine claude# Test settings on first 3 pages before full run
python main.py big_document.pdf --first 3 --engine claude# Opus model + cleanup + high DPI
python main.py important.pdf --engine claude --expensive --cleanup --dpi 300# Haiku model for faster/cheaper processing
python main.py document.pdf --engine claude --cheapo# Process all PDFs in input/ with Spanish + cleanup
python main.py --engine claude --lang spa --cleanupFor Claude Vision, set your API tier in .env for optimal concurrency:
# Check your tier at: https://console.anthropic.com/settings/limits
ANTHROPIC_TIER=2 # 1=50 RPM, 2=1000 RPM, 3=2000 RPM, 4=4000 RPM# List available Tesseract languages
python main.py --list-langs
# List preprocessing options
python main.py --list-preprocessCore:
pdf2image- PDF to image conversionPillow- Image processingpytesseract- Tesseract OCR wrapper
For Claude Vision:
anthropic- Anthropic API clientpython-dotenv- Environment variable management
- Start with
--first 3to test settings before processing large documents - Use
--enhancefor poor quality scans (combines DPI boost, preprocessing, rotation) - Use
--preprocess cleanfor documents with red highlights or marks - Use
--workers autoto speed up processing with parallel execution - Use
--cleanupfor AI-powered post-processing to fix OCR errors - Check
output/<doc>/pages/for individual page results if something looks wrong
MIT License - see LICENSE for details.