pdf-scribe

📄➡️✨ Feed it crusty scanned PDFs, get clean markdown. Tesseract when you're broke, Claude when you're bougie.

A powerful CLI tool for transcribing scanned PDF documents to markdown using OCR. Built specifically for difficult documents that defeat traditional OCR – faded text, low-resolution scans, typewriter fonts, highlighter marks, and decades-old legal paperwork.

Supports both Tesseract (free, local) for clean modern scans, and Claude Vision AI for everything else. When your documents look like they survived a flood, a fire, and a fax machine, Claude Vision will still read them.

Features

Dual OCR Engines - Tesseract for free local processing, Claude Vision for AI-powered accuracy
Batch Processing - Drop PDFs in input/ folder and process them all at once
Parallel Processing - Multi-threaded/multi-process execution with tier-based concurrency
Image Preprocessing - Binarize, denoise, sharpen, remove red highlights, and more
AI Text Cleanup - Optional post-processing to fix OCR errors
Auto-Rotation - Detects and corrects page orientation
Streaming Output - Results saved page-by-page as processing happens
Flexible Page Selection - Process specific pages, ranges, or just the first N pages

When to Use Claude Vision

TL;DR: If Tesseract gives you garbage, Claude Vision will probably nail it.

Traditional OCR engines like Tesseract work great on clean, modern scans. But real-world documents are often a mess. Claude Vision dramatically outperforms Tesseract on:

Document Type	Tesseract	Claude Vision
Clean modern scans	✅ Great	✅ Great
Faded or low-contrast text	⚠️ Struggles	✅ Excellent
Low resolution scans	⚠️ Poor	✅ Handles well
Highlighter marks / annotations	❌ Fails	✅ Ignores marks, reads text
Typewriter fonts	⚠️ Hit or miss	✅ Excellent
Degraded legal documents	❌ Often unusable	✅ Accurate
Noisy backgrounds / speckles	⚠️ Needs preprocessing	✅ Handles natively
Mixed fonts / handwriting	❌ Poor	✅ Good

Real-World Example

This tool was built to transcribe decades-old legal documents that were:

Scanned at low resolution from microfilm
Typed on manual typewriters with uneven ink
Covered in red/yellow highlighter marks
Faded and noisy with age

Tesseract produced mostly unusable output even with aggressive preprocessing. Claude Vision transcribed them nearly perfectly, understanding context to fill in degraded characters and ignoring highlighter marks entirely.

Cost vs Quality Trade-off

Engine	Cost	Speed	Quality on Bad Docs
Tesseract	Free	Fast	Poor
Claude Haiku (`--cheapo`)	~$0.001/page	Fast	Good
Claude Sonnet (default)	~$0.01/page	Medium	Excellent
Claude Opus (`--expensive`)	~$0.05/page	Slower	Best

⚠️ DPI affects AI cost: Higher DPI = larger images = more tokens = higher cost. The default 150 DPI works well for most documents. Only increase DPI (--dpi 300) if you're seeing quality issues. At 300 DPI, expect roughly 4x the cost per page.

Recommendation: Start with --engine claude (Sonnet) at default DPI. Use --cheapo for bulk processing of moderately difficult docs. Only use --expensive or high DPI for the most challenging documents.

Installation

System Dependencies

Before installing pdf-scribe, you need two system dependencies:

Dependency	Purpose	Required For
Tesseract OCR	Optical character recognition engine	`--engine tesseract` (default)
Poppler	PDF to image conversion	All PDF processing

macOS (Homebrew)

# Install Tesseract and Poppler
brew install tesseract poppler

# Install ALL language packs (recommended)
brew install tesseract-lang

# Or install specific languages only
brew install tesseract-lang  # Includes all languages

Ubuntu/Debian

# Install Tesseract and Poppler
sudo apt-get update
sudo apt-get install tesseract-ocr poppler-utils

# Install ALL language packs
sudo apt-get install tesseract-ocr-all

# Or install specific languages
sudo apt-get install tesseract-ocr-spa  # Spanish
sudo apt-get install tesseract-ocr-fra  # French
sudo apt-get install tesseract-ocr-deu  # German
sudo apt-get install tesseract-ocr-por  # Portuguese

Windows

Tesseract: Download installer from UB-Mannheim
- Run the installer
- Important: Check "Add to PATH" during installation
- Select additional languages in the installer
Poppler: Download from poppler-windows
- Extract to C:\Program Files\poppler
- Add C:\Program Files\poppler\bin to your PATH

Verify Installation

# Check Tesseract
tesseract --version
# Should show: tesseract 5.x.x

# Check available languages
tesseract --list-langs
# Should show: eng, spa, fra, etc.

# Check Poppler
pdftoppm -v
# Should show: pdftoppm version x.x.x

Common Language Codes

Code	Language
`eng`	English (default)
`spa`	Spanish
`fra`	French
`deu`	German
`por`	Portuguese
`ita`	Italian
`rus`	Russian
`chi_sim`	Chinese (Simplified)
`chi_tra`	Chinese (Traditional)
`jpn`	Japanese
`kor`	Korean
`ara`	Arabic

Use multiple languages with +: --lang eng+spa

Python Setup

Requirements: Python 3.10 or higher

# Clone the repository
git clone https://github.com/yourusername/pdf-scribe.git
cd pdf-scribe

Virtual Environment (Recommended)

Using a virtual environment keeps dependencies isolated and avoids conflicts with other projects.

macOS/Linux:

# Create virtual environment
python3 -m venv .venv

# Activate it (run this every time you open a new terminal)
source .venv/bin/activate

# Your prompt should now show (.venv)

Windows (PowerShell):

# Create virtual environment
python -m venv .venv

# Activate it
.venv\Scripts\Activate.ps1

# If you get an execution policy error, run:
Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope CurrentUser

Windows (Command Prompt):

python -m venv .venv
.venv\Scripts\activate.bat

Install Dependencies

# Make sure your venv is activated (you should see (.venv) in your prompt)

# Install core dependencies
pip install -r requirements.txt

# That's it! Claude Vision dependencies are included in requirements.txt

Deactivate Virtual Environment

When you're done:

deactivate

API Key (for Claude Vision)

Only needed if using --engine claude:

cp .env.example .env
# Edit .env and add your Anthropic API key

Get your API key at: https://console.anthropic.com/

Quick Start

Single File

# Basic OCR with Tesseract
python main.py document.pdf

# Use Claude Vision for better quality
python main.py document.pdf --engine claude

# Spanish document with image enhancement
python main.py document.pdf --engine claude --lang spa --enhance

Batch Processing

# Place PDFs in input/ folder, then:
python main.py --engine claude --lang spa

# All PDFs will be processed and saved to output/<document_name>/

Usage

python main.py [pdf] [options]

Document Selection

Method	Command
Single file	`python main.py document.pdf`
Batch mode	`python main.py` (processes all PDFs in `input/`)

OCR Engines

Engine	Flag	Description
Tesseract	`--engine tesseract`	Free, local OCR (default)
Claude Vision	`--engine claude`	AI-powered, best for degraded docs

Common Options

Option	Description
`-e, --engine`	OCR engine: `tesseract` or `claude`
`-l, --lang`	Language code: `eng`, `spa`, `fra`, etc.
`-o, --output`	Custom output path
`--dpi`	Resolution for PDF conversion (default: 150)
`-w, --workers`	Parallel workers (`auto` for CPU count)

Image Preprocessing

Mode	Flag	Description
None	`--preprocess none`	No preprocessing (fastest)
Grayscale	`--preprocess grayscale`	Convert to grayscale
Binarize	`--preprocess binarize`	Black/white (good for faded text)
Contrast	`--preprocess contrast`	Enhance contrast
Sharpen	`--preprocess sharpen`	Sharpen edges
Denoise	`--preprocess denoise`	Remove noise/speckles
Remove Red	`--preprocess remove-red`	Remove red highlights/marks
Clean	`--preprocess clean`	Remove red + all enhancements
All	`--preprocess all`	All enhancements (no red removal)

Enhancement Shortcut

# --enhance is equivalent to: --dpi 300 --preprocess all --rotate
python main.py document.pdf --enhance

Page Selection

# First N pages only
python main.py document.pdf --first 5

# Specific pages
python main.py document.pdf --pages 1,3,7

# Page ranges
python main.py document.pdf --pages 1-5,10-15

# Mixed
python main.py document.pdf --pages 1-3,7,10-12

Claude-Specific Options

Option	Description
`--cleanup`	Post-process with AI to fix OCR errors
`--reflow`	Intelligently join lines into paragraphs
`--cheapo`	Use Haiku 3.5 (faster, cheaper)
`--expensive`	Use Opus 4 (highest quality)

Tesseract-Specific Options

Option	Description
`--rotate`	Auto-detect and correct page orientation
`--rotate-confidence`	Minimum confidence for rotation (default: 5.0)
`--psm`	Page Segmentation Mode (3, 4, 6, 11, 12)
`--oem`	OCR Engine Mode (0-3)

PSM Modes

Mode	Description
3	Fully automatic (default)
4	Single column of variable sizes
6	Single uniform block of text
11	Sparse text (find as much as possible)

Output Structure

Each processed document gets its own folder:

output/
└── document_name/
    ├── document_name.md        # Full merged transcription
    ├── document_name_clean.md  # AI-cleaned version (if --cleanup)
    └── pages/
        ├── page_001.md         # Individual page
        ├── page_001_clean.md   # Cleaned page (if --cleanup)
        ├── page_002.md
        └── ...

Examples

Poor Quality Scans

# Kitchen sink approach - everything enabled
python main.py old_scan.pdf --enhance --lang spa --workers auto

Highlighted Documents

# Remove red highlights before OCR
python main.py marked_up.pdf --preprocess clean --engine claude

Quick Test Run

# Test settings on first 3 pages before full run
python main.py big_document.pdf --first 3 --engine claude

Maximum Quality

# Opus model + cleanup + high DPI
python main.py important.pdf --engine claude --expensive --cleanup --dpi 300

Budget Processing

# Haiku model for faster/cheaper processing
python main.py document.pdf --engine claude --cheapo

Batch with Custom Settings

# Process all PDFs in input/ with Spanish + cleanup
python main.py --engine claude --lang spa --cleanup

API Tier Configuration

For Claude Vision, set your API tier in .env for optimal concurrency:

# Check your tier at: https://console.anthropic.com/settings/limits
ANTHROPIC_TIER=2  # 1=50 RPM, 2=1000 RPM, 3=2000 RPM, 4=4000 RPM

Utility Commands

# List available Tesseract languages
python main.py --list-langs

# List preprocessing options
python main.py --list-preprocess

Dependencies

Core:

pdf2image - PDF to image conversion
Pillow - Image processing
pytesseract - Tesseract OCR wrapper

For Claude Vision:

anthropic - Anthropic API client
python-dotenv - Environment variable management

Tips

Start with --first 3 to test settings before processing large documents
Use --enhance for poor quality scans (combines DPI boost, preprocessing, rotation)
Use --preprocess clean for documents with red highlights or marks
Use --workers auto to speed up processing with parallel execution
Use --cleanup for AI-powered post-processing to fix OCR errors
Check output/<doc>/pages/ for individual page results if something looks wrong

License

MIT License - see LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
input		input
output		output
transcriptor		transcriptor
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

pdf-scribe

Features

When to Use Claude Vision

Real-World Example

Cost vs Quality Trade-off

Installation

System Dependencies

macOS (Homebrew)

Ubuntu/Debian

Windows

Verify Installation

Common Language Codes

Python Setup

Virtual Environment (Recommended)

Install Dependencies

Deactivate Virtual Environment

API Key (for Claude Vision)

Quick Start

Single File

Batch Processing

Usage

Document Selection

OCR Engines

Common Options

Image Preprocessing

Enhancement Shortcut

Page Selection

Claude-Specific Options

Tesseract-Specific Options

PSM Modes

Output Structure

Examples

Poor Quality Scans

Highlighted Documents

Quick Test Run

Maximum Quality

Budget Processing

Batch with Custom Settings

API Tier Configuration

Utility Commands

Dependencies

Tips

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages