LLM OCR Package

LLM-powered OCR evaluation and correction package that supports multiple language models for OCR processing and text correction tasks.

Features

Multi-Provider LLM Support: Claude, GPT-4, Gemini, and Together AI
Multiple Processing Modes: Single-line, sliding window, and full-page OCR
Evaluation: Character accuracy, word accuracy, case preservation, and error analysis
OCR Correction: LLM-based text correction with configurable output formats
ALTO XML Support: Process ALTO XML files with corresponding images
Detailed Metrics: Extensive evaluation metrics with error pattern analysis
Workflow Management: Complete pipeline orchestration with result tracking

Installation

From source

git clone https://github.com/mary-lev/llm-ocr.git
cd llm-ocr
pip install -e .

Development installation

git clone https://github.com/mary-lev/llm-ocr.git
cd llm-ocr
pip install -e ".[dev]"

Quick Start

1. Set up API keys

Copy the .env.template file and fill in your API key values:

cp .env.template .env
# Edit .env and add your API key values

The following environment variables must be set in your .env file:

OPENAI_API_KEY
ANTHROPIC_API_KEY
GEMINI_API_KEY
TOGETHER_API_KEY
DEEP_SEEK_API_KEY
DEEPINFRA_API_KEY

2. Basic usage

from llm_ocr.workflow import OCRPipelineWorkflow
from llm_ocr.models import ProcessingMode
from llm_ocr.prompts.prompt import PromptVersion

# Initialize workflow
workflow = OCRPipelineWorkflow(
    id="document_001",
    folder="ground_truth",  # Contains .xml, .jpeg files
    ocr_model_name="claude-3-7-sonnet-20250219",
    modes=[ProcessingMode.FULL_PAGE],
    prompt_version=PromptVersion.V3
)

# Run complete pipeline
results = workflow.run_pipeline()

# Or run individual steps
workflow.run_ocr()
workflow.evaluate_ocr()
workflow.run_correction()
workflow.evaluate_correction()

Architecture

Core Components

Pipelines (llm_ocr/pipelines/): OCR and correction processing workflows
LLM Models (llm_ocr/llm/): Multi-provider LLM support with unified interface
Prompts (llm_ocr/prompts/): Modular prompt generation with version control and context enrichment
Evaluators (llm_ocr/evaluators/): Comprehensive metrics and evaluation framework
Processors (llm_ocr/processors/): Input format handling (ALTO XML)
Workflow (llm_ocr/workflow.py): Main orchestration and result management

Processing Modes

SINGLE_LINE: Process each text line individually
SLIDING_WINDOW: Process lines with context window
FULL_PAGE: Process entire page at once

Supported Models

Claude: Anthropic's Claude models (3.5 Sonnet, etc.)
GPT-4: OpenAI's GPT-4 models
Gemini: Google's Gemini models
Together AI: Various open-source models via Together

Prompt System

The package features a modular prompt generation system that allows for flexible experimentation with different prompt strategies and automatic context enrichment.

Prompt Versions

Different prompt versions are available for various use cases and languages:

V1: Basic OCR prompts without additional context
V2: Enhanced with historical book metadata (year, title)
V3: Advanced context with improved instructions
V4: Russian language optimized prompts

from llm_ocr.prompts.prompt import PromptVersion

# Use different prompt versions
workflow = OCRPipelineWorkflow(
    id="document_001",
    folder="ground_truth",
    ocr_model_name="claude-3-7-sonnet-20250219",
    prompt_version=PromptVersion.V2,  # Will include book metadata context
    modes=[ProcessingMode.FULL_PAGE]
)

Prompt Types and Modes

The system supports different output formats and processing modes:

from llm_ocr.prompts.prompt_builder import PromptBuilder, PromptType, PromptVersion

builder = PromptBuilder()

# Structured JSON output for single line processing
structured_prompt = builder.build_prompt(
    mode="single_line",
    prompt_type=PromptType.STRUCTURED,  # JSON format
    version=PromptVersion.V3
)

# Simple text output for full page processing
simple_prompt = builder.build_prompt(
    mode="full_page", 
    prompt_type=PromptType.SIMPLE,  # Plain text format
    version=PromptVersion.V1
)

Automatic Metadata Enrichment

When document metadata is available, prompts can be automatically enriched with context:

from llm_ocr.prompts.prompt_builder import PromptBuilder, PromptType, PromptVersion

builder = PromptBuilder()

# Automatic enrichment using document ID
enriched_prompt = builder.build_prompt(
    mode="single_line",
    prompt_type=PromptType.STRUCTURED,
    version=PromptVersion.V2,
    document_id="historical_doc_001"  # Auto-loads book metadata
)

# Manual context variables
manual_prompt = builder.build_prompt(
    mode="sliding_window",
    prompt_type=PromptType.STRUCTURED, 
    version=PromptVersion.V2,
    book_title="История государства Российского",
    book_year="1767"
)

Custom Prompt Configuration

For advanced users, prompts can be fully customized via JSON configuration:

# Create custom prompt builder with custom config
custom_builder = PromptBuilder(
    config_path="path/to/custom_prompts.json",
    metadata_path="path/to/document_metadata.json"
)

# Use convenience functions for common cases
from llm_ocr.prompts.prompt_builder import get_prompt, PromptType, PromptVersion

prompt = get_prompt(
    mode="correction",
    prompt_type=PromptType.SIMPLE,
    version=PromptVersion.V4,
    book_title="Тестовая книга"
)

Prompt Configuration Format

Custom prompt configurations use JSON format with modular components:

{
  "components": {
    "base_ocr": "Extract OCR text from 18th century Russian book",
    "orthography": "Preserve ѣ, Ѳ, ѳ, ѵ, ъ characters",
    "json_format": "Respond with JSON: {\"line\": \"text\"}"
  },
  "context_enrichment": {
    "v1": "",
    "v2": " from {book_year} book \"{book_title}\"",
    "v3": " processing \"{book_title}\" ({book_year})",
    "v4": " обрабатываете \"{book_title}\" {book_year} года"
  },
  "mode_instructions": {
    "single_line": "Process single line",
    "sliding_window": "Process sliding window", 
    "full_page": "Process full page",
    "correction": "Correct OCR text"
  }
}

Configuration

Model Configuration

from llm_ocr.config import ModelConfig

config = ModelConfig(
    max_tokens=2048,
    temperature=0.0,
    sliding_window_size=3,
    batch_size=10
)

Evaluation Configuration

from llm_ocr.config import EvaluationConfig

eval_config = EvaluationConfig(
    use_char_accuracy=True,
    use_word_accuracy=True,
    use_old_char_preservation=True,
    include_detailed_analysis=True
)

Data Format

Input Requirements

Your data folder should contain:

{id}.xml: ALTO XML file with text coordinates
{id}.jpeg: Corresponding image file
{id}.txt: Ground truth text (optional, for evaluation)

Output Format

Results are saved as JSON files with complete metrics and analysis:

{
  "document_info": {
    "document_name": "document_001",
    "timestamp": "20250124_143022"
  },
  "models": {
    "claude-3-7-sonnet-20250219": {
      "ocr_results": {
        "fullpage": {
          "lines": [...],
          "metrics": {...}
        }
      },
      "correction_results": {
        "original_ocr_text": "...",
        "corrected_text": "...",
        "metrics": {...}
      }
    }
  }
}

Advanced Usage

Custom Model Integration

from llm_ocr.llm.base import BaseOCRModel

class CustomOCRModel(BaseOCRModel):
    def process_single_line(self, image_base64: str):
        # Implement single line processing
        pass

    def process_full_page(self, page_image_base64: str, id: str):
        # Implement full page processing
        pass

    def correct_text(self, text: str, image_base64: str):
        # Implement text correction
        pass

Batch Processing

from llm_ocr.workflow import run_multi_model_workflow

results = run_multi_model_workflow(
    xml_path="data/document.xml",
    image_path="data/document.jpeg",
    ground_truth_path="data/document.txt",
    model_names=["claude-3-7-sonnet-20250219", "gpt-4o-2024-08-06"],
    output_dir="results"
)

Development

Environment Setup

# Clone and set up development environment
git clone https://github.com/mary-lev/llm-ocr.git
cd llm-ocr

# Create and activate virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install with development dependencies
pip install -e ".[dev]"

Running Tests

# Activate virtual environment first
source venv/bin/activate

# Run all tests with coverage
pytest

# Or use the custom test runner
python run_tests.py

# Run specific test files
pytest tests/unit/test_metrics.py
pytest tests/integration/test_basic_workflow.py

# Run with verbose output
pytest -v

# Run with coverage report
pytest --cov=llm_ocr --cov-report=html

Note: Always activate the virtual environment before running tests to avoid dependency conflicts.

Code Quality

# Activate virtual environment first
source venv/bin/activate

# Format code
python -m black llm_ocr/
python -m isort llm_ocr/

# Lint code
python -m ruff check llm_ocr/

# Type checking (requires Python 3.10+ target)
python -m mypy llm_ocr/ --python-version=3.10

# Run all quality checks
python run_tests.py  # Includes tests + quality checks

Pre-commit Hooks

This project uses pre-commit hooks to maintain code quality:

# Install pre-commit hooks
pre-commit install

# Run hooks manually
pre-commit run --all-files

Contributing

Fork the repository
Create a feature branch: git checkout -b feature-name
Set up development environment: source venv/bin/activate && pip install -e ".[dev]"
Make your changes and add tests
Run the test suite: pytest or python run_tests.py
Ensure code quality: pre-commit run --all-files
Submit a pull request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Citation

If you use this package in your research, please cite:

@software{llm_ocr_package,
  title = {LLM OCR: Multi-Provider OCR Evaluation and Correction},
  author = {Maria Levchenko},
  year = {2025},
  url = {https://github.com/mary-lev/llm-ocr}
}

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
ground_truth		ground_truth
llm_ocr		llm_ocr
tests		tests
.env.template		.env.template
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
WORKFLOW_DEMO.md		WORKFLOW_DEMO.md
pyproject.toml		pyproject.toml
run_tests.py		run_tests.py
workflow_demo.ipynb		workflow_demo.ipynb

License

mary-lev/llm-ocr

Folders and files

Latest commit

History

Repository files navigation

LLM OCR Package

Features

Installation

From source

Development installation

Quick Start

1. Set up API keys

2. Basic usage

Architecture

Core Components

Processing Modes

Supported Models

Prompt System

Prompt Versions

Prompt Types and Modes

Automatic Metadata Enrichment

Custom Prompt Configuration

Prompt Configuration Format

Configuration

Model Configuration

Evaluation Configuration

Data Format

Input Requirements

Output Format

Advanced Usage

Custom Model Integration

Batch Processing

Development

Environment Setup

Running Tests

Code Quality

Pre-commit Hooks

Contributing

License

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages