LLM-powered OCR evaluation and correction package that supports multiple language models for OCR processing and text correction tasks.
- Multi-Provider LLM Support: Claude, GPT-4, Gemini, and Together AI
- Multiple Processing Modes: Single-line, sliding window, and full-page OCR
- Evaluation: Character accuracy, word accuracy, case preservation, and error analysis
- OCR Correction: LLM-based text correction with configurable output formats
- ALTO XML Support: Process ALTO XML files with corresponding images
- Detailed Metrics: Extensive evaluation metrics with error pattern analysis
- Workflow Management: Complete pipeline orchestration with result tracking
git clone https://github.com/mary-lev/llm-ocr.git
cd llm-ocr
pip install -e .git clone https://github.com/mary-lev/llm-ocr.git
cd llm-ocr
pip install -e ".[dev]"Copy the .env.template file and fill in your API key values:
cp .env.template .env
# Edit .env and add your API key valuesThe following environment variables must be set in your .env file:
OPENAI_API_KEYANTHROPIC_API_KEYGEMINI_API_KEYTOGETHER_API_KEYDEEP_SEEK_API_KEYDEEPINFRA_API_KEY
from llm_ocr.workflow import OCRPipelineWorkflow
from llm_ocr.models import ProcessingMode
from llm_ocr.prompts.prompt import PromptVersion
# Initialize workflow
workflow = OCRPipelineWorkflow(
id="document_001",
folder="ground_truth", # Contains .xml, .jpeg files
ocr_model_name="claude-3-7-sonnet-20250219",
modes=[ProcessingMode.FULL_PAGE],
prompt_version=PromptVersion.V3
)
# Run complete pipeline
results = workflow.run_pipeline()
# Or run individual steps
workflow.run_ocr()
workflow.evaluate_ocr()
workflow.run_correction()
workflow.evaluate_correction()- Pipelines (
llm_ocr/pipelines/): OCR and correction processing workflows - LLM Models (
llm_ocr/llm/): Multi-provider LLM support with unified interface - Prompts (
llm_ocr/prompts/): Modular prompt generation with version control and context enrichment - Evaluators (
llm_ocr/evaluators/): Comprehensive metrics and evaluation framework - Processors (
llm_ocr/processors/): Input format handling (ALTO XML) - Workflow (
llm_ocr/workflow.py): Main orchestration and result management
- SINGLE_LINE: Process each text line individually
- SLIDING_WINDOW: Process lines with context window
- FULL_PAGE: Process entire page at once
- Claude: Anthropic's Claude models (3.5 Sonnet, etc.)
- GPT-4: OpenAI's GPT-4 models
- Gemini: Google's Gemini models
- Together AI: Various open-source models via Together
The package features a modular prompt generation system that allows for flexible experimentation with different prompt strategies and automatic context enrichment.
Different prompt versions are available for various use cases and languages:
- V1: Basic OCR prompts without additional context
- V2: Enhanced with historical book metadata (year, title)
- V3: Advanced context with improved instructions
- V4: Russian language optimized prompts
from llm_ocr.prompts.prompt import PromptVersion
# Use different prompt versions
workflow = OCRPipelineWorkflow(
id="document_001",
folder="ground_truth",
ocr_model_name="claude-3-7-sonnet-20250219",
prompt_version=PromptVersion.V2, # Will include book metadata context
modes=[ProcessingMode.FULL_PAGE]
)The system supports different output formats and processing modes:
from llm_ocr.prompts.prompt_builder import PromptBuilder, PromptType, PromptVersion
builder = PromptBuilder()
# Structured JSON output for single line processing
structured_prompt = builder.build_prompt(
mode="single_line",
prompt_type=PromptType.STRUCTURED, # JSON format
version=PromptVersion.V3
)
# Simple text output for full page processing
simple_prompt = builder.build_prompt(
mode="full_page",
prompt_type=PromptType.SIMPLE, # Plain text format
version=PromptVersion.V1
)When document metadata is available, prompts can be automatically enriched with context:
from llm_ocr.prompts.prompt_builder import PromptBuilder, PromptType, PromptVersion
builder = PromptBuilder()
# Automatic enrichment using document ID
enriched_prompt = builder.build_prompt(
mode="single_line",
prompt_type=PromptType.STRUCTURED,
version=PromptVersion.V2,
document_id="historical_doc_001" # Auto-loads book metadata
)
# Manual context variables
manual_prompt = builder.build_prompt(
mode="sliding_window",
prompt_type=PromptType.STRUCTURED,
version=PromptVersion.V2,
book_title="История государства Российского",
book_year="1767"
)For advanced users, prompts can be fully customized via JSON configuration:
# Create custom prompt builder with custom config
custom_builder = PromptBuilder(
config_path="path/to/custom_prompts.json",
metadata_path="path/to/document_metadata.json"
)
# Use convenience functions for common cases
from llm_ocr.prompts.prompt_builder import get_prompt, PromptType, PromptVersion
prompt = get_prompt(
mode="correction",
prompt_type=PromptType.SIMPLE,
version=PromptVersion.V4,
book_title="Тестовая книга"
)Custom prompt configurations use JSON format with modular components:
{
"components": {
"base_ocr": "Extract OCR text from 18th century Russian book",
"orthography": "Preserve ѣ, Ѳ, ѳ, ѵ, ъ characters",
"json_format": "Respond with JSON: {\"line\": \"text\"}"
},
"context_enrichment": {
"v1": "",
"v2": " from {book_year} book \"{book_title}\"",
"v3": " processing \"{book_title}\" ({book_year})",
"v4": " обрабатываете \"{book_title}\" {book_year} года"
},
"mode_instructions": {
"single_line": "Process single line",
"sliding_window": "Process sliding window",
"full_page": "Process full page",
"correction": "Correct OCR text"
}
}from llm_ocr.config import ModelConfig
config = ModelConfig(
max_tokens=2048,
temperature=0.0,
sliding_window_size=3,
batch_size=10
)from llm_ocr.config import EvaluationConfig
eval_config = EvaluationConfig(
use_char_accuracy=True,
use_word_accuracy=True,
use_old_char_preservation=True,
include_detailed_analysis=True
)Your data folder should contain:
{id}.xml: ALTO XML file with text coordinates{id}.jpeg: Corresponding image file{id}.txt: Ground truth text (optional, for evaluation)
Results are saved as JSON files with complete metrics and analysis:
{
"document_info": {
"document_name": "document_001",
"timestamp": "20250124_143022"
},
"models": {
"claude-3-7-sonnet-20250219": {
"ocr_results": {
"fullpage": {
"lines": [...],
"metrics": {...}
}
},
"correction_results": {
"original_ocr_text": "...",
"corrected_text": "...",
"metrics": {...}
}
}
}
}from llm_ocr.llm.base import BaseOCRModel
class CustomOCRModel(BaseOCRModel):
def process_single_line(self, image_base64: str):
# Implement single line processing
pass
def process_full_page(self, page_image_base64: str, id: str):
# Implement full page processing
pass
def correct_text(self, text: str, image_base64: str):
# Implement text correction
passfrom llm_ocr.workflow import run_multi_model_workflow
results = run_multi_model_workflow(
xml_path="data/document.xml",
image_path="data/document.jpeg",
ground_truth_path="data/document.txt",
model_names=["claude-3-7-sonnet-20250219", "gpt-4o-2024-08-06"],
output_dir="results"
)# Clone and set up development environment
git clone https://github.com/mary-lev/llm-ocr.git
cd llm-ocr
# Create and activate virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install with development dependencies
pip install -e ".[dev]"# Activate virtual environment first
source venv/bin/activate
# Run all tests with coverage
pytest
# Or use the custom test runner
python run_tests.py
# Run specific test files
pytest tests/unit/test_metrics.py
pytest tests/integration/test_basic_workflow.py
# Run with verbose output
pytest -v
# Run with coverage report
pytest --cov=llm_ocr --cov-report=htmlNote: Always activate the virtual environment before running tests to avoid dependency conflicts.
# Activate virtual environment first
source venv/bin/activate
# Format code
python -m black llm_ocr/
python -m isort llm_ocr/
# Lint code
python -m ruff check llm_ocr/
# Type checking (requires Python 3.10+ target)
python -m mypy llm_ocr/ --python-version=3.10
# Run all quality checks
python run_tests.py # Includes tests + quality checksThis project uses pre-commit hooks to maintain code quality:
# Install pre-commit hooks
pre-commit install
# Run hooks manually
pre-commit run --all-files- Fork the repository
- Create a feature branch:
git checkout -b feature-name - Set up development environment:
source venv/bin/activate && pip install -e ".[dev]" - Make your changes and add tests
- Run the test suite:
pytestorpython run_tests.py - Ensure code quality:
pre-commit run --all-files - Submit a pull request
This project is licensed under the MIT License - see the LICENSE file for details.
If you use this package in your research, please cite:
@software{llm_ocr_package,
title = {LLM OCR: Multi-Provider OCR Evaluation and Correction},
author = {Maria Levchenko},
year = {2025},
url = {https://github.com/mary-lev/llm-ocr}
}