Skip to content

LLM-powered OCR evaluation and correction package that supports multiple language models for OCR processing and text correction tasks.

License

Notifications You must be signed in to change notification settings

mary-lev/llm-ocr

Repository files navigation

LLM OCR Package

LLM-powered OCR evaluation and correction package that supports multiple language models for OCR processing and text correction tasks.

Features

  • Multi-Provider LLM Support: Claude, GPT-4, Gemini, and Together AI
  • Multiple Processing Modes: Single-line, sliding window, and full-page OCR
  • Evaluation: Character accuracy, word accuracy, case preservation, and error analysis
  • OCR Correction: LLM-based text correction with configurable output formats
  • ALTO XML Support: Process ALTO XML files with corresponding images
  • Detailed Metrics: Extensive evaluation metrics with error pattern analysis
  • Workflow Management: Complete pipeline orchestration with result tracking

Installation

From source

git clone https://github.com/mary-lev/llm-ocr.git
cd llm-ocr
pip install -e .

Development installation

git clone https://github.com/mary-lev/llm-ocr.git
cd llm-ocr
pip install -e ".[dev]"

Quick Start

1. Set up API keys

Copy the .env.template file and fill in your API key values:

cp .env.template .env
# Edit .env and add your API key values

The following environment variables must be set in your .env file:

  • OPENAI_API_KEY
  • ANTHROPIC_API_KEY
  • GEMINI_API_KEY
  • TOGETHER_API_KEY
  • DEEP_SEEK_API_KEY
  • DEEPINFRA_API_KEY

2. Basic usage

from llm_ocr.workflow import OCRPipelineWorkflow
from llm_ocr.models import ProcessingMode
from llm_ocr.prompts.prompt import PromptVersion

# Initialize workflow
workflow = OCRPipelineWorkflow(
    id="document_001",
    folder="ground_truth",  # Contains .xml, .jpeg files
    ocr_model_name="claude-3-7-sonnet-20250219",
    modes=[ProcessingMode.FULL_PAGE],
    prompt_version=PromptVersion.V3
)

# Run complete pipeline
results = workflow.run_pipeline()

# Or run individual steps
workflow.run_ocr()
workflow.evaluate_ocr()
workflow.run_correction()
workflow.evaluate_correction()

Architecture

Core Components

  • Pipelines (llm_ocr/pipelines/): OCR and correction processing workflows
  • LLM Models (llm_ocr/llm/): Multi-provider LLM support with unified interface
  • Prompts (llm_ocr/prompts/): Modular prompt generation with version control and context enrichment
  • Evaluators (llm_ocr/evaluators/): Comprehensive metrics and evaluation framework
  • Processors (llm_ocr/processors/): Input format handling (ALTO XML)
  • Workflow (llm_ocr/workflow.py): Main orchestration and result management

Processing Modes

  • SINGLE_LINE: Process each text line individually
  • SLIDING_WINDOW: Process lines with context window
  • FULL_PAGE: Process entire page at once

Supported Models

  • Claude: Anthropic's Claude models (3.5 Sonnet, etc.)
  • GPT-4: OpenAI's GPT-4 models
  • Gemini: Google's Gemini models
  • Together AI: Various open-source models via Together

Prompt System

The package features a modular prompt generation system that allows for flexible experimentation with different prompt strategies and automatic context enrichment.

Prompt Versions

Different prompt versions are available for various use cases and languages:

  • V1: Basic OCR prompts without additional context
  • V2: Enhanced with historical book metadata (year, title)
  • V3: Advanced context with improved instructions
  • V4: Russian language optimized prompts
from llm_ocr.prompts.prompt import PromptVersion

# Use different prompt versions
workflow = OCRPipelineWorkflow(
    id="document_001",
    folder="ground_truth",
    ocr_model_name="claude-3-7-sonnet-20250219",
    prompt_version=PromptVersion.V2,  # Will include book metadata context
    modes=[ProcessingMode.FULL_PAGE]
)

Prompt Types and Modes

The system supports different output formats and processing modes:

from llm_ocr.prompts.prompt_builder import PromptBuilder, PromptType, PromptVersion

builder = PromptBuilder()

# Structured JSON output for single line processing
structured_prompt = builder.build_prompt(
    mode="single_line",
    prompt_type=PromptType.STRUCTURED,  # JSON format
    version=PromptVersion.V3
)

# Simple text output for full page processing
simple_prompt = builder.build_prompt(
    mode="full_page", 
    prompt_type=PromptType.SIMPLE,  # Plain text format
    version=PromptVersion.V1
)

Automatic Metadata Enrichment

When document metadata is available, prompts can be automatically enriched with context:

from llm_ocr.prompts.prompt_builder import PromptBuilder, PromptType, PromptVersion

builder = PromptBuilder()

# Automatic enrichment using document ID
enriched_prompt = builder.build_prompt(
    mode="single_line",
    prompt_type=PromptType.STRUCTURED,
    version=PromptVersion.V2,
    document_id="historical_doc_001"  # Auto-loads book metadata
)

# Manual context variables
manual_prompt = builder.build_prompt(
    mode="sliding_window",
    prompt_type=PromptType.STRUCTURED, 
    version=PromptVersion.V2,
    book_title="История государства Российского",
    book_year="1767"
)

Custom Prompt Configuration

For advanced users, prompts can be fully customized via JSON configuration:

# Create custom prompt builder with custom config
custom_builder = PromptBuilder(
    config_path="path/to/custom_prompts.json",
    metadata_path="path/to/document_metadata.json"
)

# Use convenience functions for common cases
from llm_ocr.prompts.prompt_builder import get_prompt, PromptType, PromptVersion

prompt = get_prompt(
    mode="correction",
    prompt_type=PromptType.SIMPLE,
    version=PromptVersion.V4,
    book_title="Тестовая книга"
)

Prompt Configuration Format

Custom prompt configurations use JSON format with modular components:

{
  "components": {
    "base_ocr": "Extract OCR text from 18th century Russian book",
    "orthography": "Preserve ѣ, Ѳ, ѳ, ѵ, ъ characters",
    "json_format": "Respond with JSON: {\"line\": \"text\"}"
  },
  "context_enrichment": {
    "v1": "",
    "v2": " from {book_year} book \"{book_title}\"",
    "v3": " processing \"{book_title}\" ({book_year})",
    "v4": " обрабатываете \"{book_title}\" {book_year} года"
  },
  "mode_instructions": {
    "single_line": "Process single line",
    "sliding_window": "Process sliding window", 
    "full_page": "Process full page",
    "correction": "Correct OCR text"
  }
}

Configuration

Model Configuration

from llm_ocr.config import ModelConfig

config = ModelConfig(
    max_tokens=2048,
    temperature=0.0,
    sliding_window_size=3,
    batch_size=10
)

Evaluation Configuration

from llm_ocr.config import EvaluationConfig

eval_config = EvaluationConfig(
    use_char_accuracy=True,
    use_word_accuracy=True,
    use_old_char_preservation=True,
    include_detailed_analysis=True
)

Data Format

Input Requirements

Your data folder should contain:

  • {id}.xml: ALTO XML file with text coordinates
  • {id}.jpeg: Corresponding image file
  • {id}.txt: Ground truth text (optional, for evaluation)

Output Format

Results are saved as JSON files with complete metrics and analysis:

{
  "document_info": {
    "document_name": "document_001",
    "timestamp": "20250124_143022"
  },
  "models": {
    "claude-3-7-sonnet-20250219": {
      "ocr_results": {
        "fullpage": {
          "lines": [...],
          "metrics": {...}
        }
      },
      "correction_results": {
        "original_ocr_text": "...",
        "corrected_text": "...",
        "metrics": {...}
      }
    }
  }
}

Advanced Usage

Custom Model Integration

from llm_ocr.llm.base import BaseOCRModel

class CustomOCRModel(BaseOCRModel):
    def process_single_line(self, image_base64: str):
        # Implement single line processing
        pass

    def process_full_page(self, page_image_base64: str, id: str):
        # Implement full page processing
        pass

    def correct_text(self, text: str, image_base64: str):
        # Implement text correction
        pass

Batch Processing

from llm_ocr.workflow import run_multi_model_workflow

results = run_multi_model_workflow(
    xml_path="data/document.xml",
    image_path="data/document.jpeg",
    ground_truth_path="data/document.txt",
    model_names=["claude-3-7-sonnet-20250219", "gpt-4o-2024-08-06"],
    output_dir="results"
)

Development

Environment Setup

# Clone and set up development environment
git clone https://github.com/mary-lev/llm-ocr.git
cd llm-ocr

# Create and activate virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install with development dependencies
pip install -e ".[dev]"

Running Tests

# Activate virtual environment first
source venv/bin/activate

# Run all tests with coverage
pytest

# Or use the custom test runner
python run_tests.py

# Run specific test files
pytest tests/unit/test_metrics.py
pytest tests/integration/test_basic_workflow.py

# Run with verbose output
pytest -v

# Run with coverage report
pytest --cov=llm_ocr --cov-report=html

Note: Always activate the virtual environment before running tests to avoid dependency conflicts.

Code Quality

# Activate virtual environment first
source venv/bin/activate

# Format code
python -m black llm_ocr/
python -m isort llm_ocr/

# Lint code
python -m ruff check llm_ocr/

# Type checking (requires Python 3.10+ target)
python -m mypy llm_ocr/ --python-version=3.10

# Run all quality checks
python run_tests.py  # Includes tests + quality checks

Pre-commit Hooks

This project uses pre-commit hooks to maintain code quality:

# Install pre-commit hooks
pre-commit install

# Run hooks manually
pre-commit run --all-files

Contributing

  1. Fork the repository
  2. Create a feature branch: git checkout -b feature-name
  3. Set up development environment: source venv/bin/activate && pip install -e ".[dev]"
  4. Make your changes and add tests
  5. Run the test suite: pytest or python run_tests.py
  6. Ensure code quality: pre-commit run --all-files
  7. Submit a pull request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Citation

If you use this package in your research, please cite:

@software{llm_ocr_package,
  title = {LLM OCR: Multi-Provider OCR Evaluation and Correction},
  author = {Maria Levchenko},
  year = {2025},
  url = {https://github.com/mary-lev/llm-ocr}
}

About

LLM-powered OCR evaluation and correction package that supports multiple language models for OCR processing and text correction tasks.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published