Skip to content

LokaHQ/idp-template

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

IDP Template

A production-ready Intelligent Document Processing (IDP) template for extracting, processing, and classifying documents using multiple methods including MarkItDown, AWS Bedrock Data Automation, Tesseract OCR, and AI-powered classification.

Architecture

IDP Template Architecture

The architecture follows a serverless, event-driven pattern using AWS Batch for scalable document processing:

  1. User Request: Users invoke the API Gateway endpoint to submit documents for processing
  2. Trigger: API Gateway triggers a Lambda function to handle the request
  3. Job Submission: Lambda submits a batch job to AWS Batch with document details
  4. Queue Management: AWS Batch Queue manages job scheduling and prioritization
  5. Compute: Fargate Batch Compute environment runs containerized processing jobs
  6. AI Classification: Jobs call Amazon Bedrock for AI-powered document classification
  7. Storage: Input documents are read from and results are written to S3
  8. Container Registry: ECR stores the Docker images used by Fargate
  9. CI/CD: GitHub Actions builds and pushes Docker images to ECR
  10. Monitoring: CloudWatch Logs captures logs and metrics for observability

Features

Document Processing

  • MarkItDown: High-quality PDF to Markdown conversion using Microsoft's MarkItDown library (recommended for born-digital PDFs)
  • AWS Bedrock Data Automation (BDA): Enterprise-grade text extraction using AWS Bedrock's Data Automation service
  • Tesseract OCR: Open-source OCR for scanned documents and images
  • Extensible Architecture: Easily add your own processing methods

Document Classification

  • AI-Powered Classification: Classify documents using LiteLLM with AWS Bedrock (Claude 3.5 Sonnet/Haiku)
  • Configurable Document Types: Define custom document categories via YAML
  • Confidence Scoring: Get confidence scores and reasoning for classifications
  • Batch Processing: Efficiently process multiple documents with statistics

Architecture

  • Type-Safe: Pydantic schemas for documents, classifications, and results
  • Configuration-Driven: Flexible YAML configuration
  • Modular Pipelines: Composable pipelines for different workflows
  • Comprehensive Logging: Built-in structured logging with loguru
  • AWS Batch Ready: Pre-structured for scalable serverless deployment with AWS Batch and Fargate

Project Structure

idp-template/
├── config/
│   └── config.yaml              # Main configuration file
├── data/
│   ├── raw/                     # Input PDFs
│   └── processed/               # (Optional) Intermediate files
├── src/
│   ├── classifiers/             # Document classification
│   │   ├── __init__.py
│   │   └── document_classifier.py
│   ├── pipelines/               # Processing pipelines
│   │   ├── __init__.py
│   │   ├── full_pipeline.py     # End-to-end pipeline
│   │   ├── pdf_processing/      # PDF processing pipeline
│   │   └── classification/      # Classification pipeline
│   ├── processors/              # Core processing methods
│   │   ├── __init__.py
│   │   ├── pdf_processing.py    # MarkItDown processor
│   │   └── ocr_processing.py    # BDA and Tesseract processors
│   ├── schemas/                 # Pydantic data models
│   │   ├── __init__.py
│   │   └── document.py
│   ├── batch/                   # AWS Batch job handlers
│   │   ├── classification/
│   │   └── pdf_processing/
│   ├── scripts/                 # Utility scripts
│   │   └── run_pipeline.py      # Example pipeline runner
│   ├── utils/                   # Helper functions
│   └── config.py                # Configuration management
├── results/                     # Output directory (auto-created)
│   ├── markitdown/             # MarkItDown outputs
│   ├── bda/                    # BDA outputs
│   ├── tesseract/              # Tesseract outputs
│   └── classification/         # Classification results
├── tests/                      # Unit and integration tests
├── Dockerfile                  # Container definition for Batch jobs
├── pyproject.toml              # Project dependencies
└── README.md

Installation

Prerequisites

  • Python 3.13+ (recommended) or 3.11+
  • uv package manager
  • AWS credentials configured (for BDA and Bedrock classification)
  • Tesseract OCR installed on your system (optional, for OCR processing)
  • Docker (for building container images for AWS Batch)

System Dependencies

Tesseract OCR (Optional)

Fedora/RHEL:

sudo dnf install tesseract tesseract-langpack-eng

Ubuntu/Debian:

sudo apt-get install tesseract-ocr tesseract-ocr-eng

macOS:

brew install tesseract

Verify installation:

tesseract --version

Install Python Dependencies

# Clone the repository
git clone https://github.com/LokaHQ/idp-template.git
cd idp-template

# Install dependencies using uv
uv sync

# Verify installation
uv run python -c "from src.pipelines import FullIDPPipeline; print('Installation successful!')"

Quick Start

1. Update Configuration

Edit config/config.yaml to match your setup:

aws:
  region: "us-east-1"
  s3_bucket: "your-s3-bucket-name"  # Required for BDA

classification:
  model: "bedrock/us.anthropic.claude-sonnet-4-5-20250929-v1:0"
  document_types:
    - "invoice"
    - "receipt"
    - "contract"
    - "letter"
    - "other"

2. Run the Pipeline

Simple Script Runner

Create a file run_full_pipeline.py:

from src.pipelines.full_pipeline import FullIDPPipeline

def main():
    # Initialize pipeline with desired methods
    pipeline = FullIDPPipeline(
        use_markitdown=True,      # Primary method (recommended)
        use_bda=False,            # Requires S3 bucket
        use_tesseract=True,       # Fallback for scanned docs
        prefer_method="markitdown",
        save_classification_results=True,
    )
    
    # Process a single document
    result = pipeline.process_and_classify("data/raw/sample_document.pdf")
    
    if result:
        print(f"âś“ Document Type: {result.document_type}")
        print(f"âś“ Confidence: {result.classification.confidence:.2%}")
        print(f"âś“ Reasoning: {result.classification.reasoning}")
    else:
        print("âś— Processing failed")

if __name__ == "__main__":
    main()

Run it:

uv run python run_full_pipeline.py

Using the Included Script

# Run the example pipeline
uv run python src/scripts/run_pipeline.py

Expected output:

2025-11-28 13:47:06.205 | INFO     | Configuration loaded from config/config.yaml
2025-11-28 13:47:07.492 | INFO     | Running full IDP pipeline on: data/raw/sample_document.pdf
2025-11-28 13:47:07.567 | SUCCESS  | MarkItDown conversion saved to results/markitdown/sample_document_markitdown.md
2025-11-28 13:47:13.324 | SUCCESS  | Document classified as 'invoice' with confidence 0.99
2025-11-28 13:47:13.324 | SUCCESS  | Pipeline complete: classified as 'DocumentType.INVOICE'

Usage Examples

Example 1: PDF Processing Only

from src.pipelines.pdf_processing import PDFProcessingPipeline

# Initialize with multiple methods
pipeline = PDFProcessingPipeline(
    use_markitdown=True,
    use_tesseract=True,
    use_bda=False,
)

# Process a PDF
results = pipeline.process_pdf("data/raw/invoice.pdf")

# Access results
if results["markitdown"]:
    doc = results["markitdown"]
    print(f"Content length: {len(doc.content)} characters")
    print(f"Processed with: {doc.metadata.processing_method}")
    print(f"Markdown saved to: {doc.content_markdown_path}")

Example 2: Classification Only

from src.pipelines.classification import ClassificationPipeline
from src.schemas import Document, DocumentMetadata

# Read pre-processed markdown
with open("results/markitdown/invoice_markitdown.md", "r") as f:
    content = f.read()

# Create document
metadata = DocumentMetadata(
    file_path="data/raw/invoice.pdf",
    file_name="invoice.pdf"
)
document = Document(metadata=metadata, content=content)

# Classify
pipeline = ClassificationPipeline(save_results=True)
result = pipeline.classify_document(document)

print(f"Type: {result.document_type}")
print(f"Confidence: {result.classification.confidence:.2%}")
print(f"Reasoning: {result.classification.reasoning}")

Example 3: Batch Processing

from pathlib import Path
from src.pipelines.full_pipeline import FullIDPPipeline

# Gather all PDFs
pdf_dir = Path("data/raw")
pdf_files = [str(f) for f in pdf_dir.glob("*.pdf")]

# Initialize pipeline
pipeline = FullIDPPipeline(
    use_markitdown=True,
    prefer_method="markitdown",
)

# Batch process
batch_result = pipeline.batch_process_and_classify(pdf_files)

# Print summary
print(f"\n📊 Batch Processing Summary:")
print(f"   Total: {batch_result.total_count}")
print(f"   Successful: {batch_result.successful_count}")
print(f"   Failed: {batch_result.failed_count}")
print(f"   Success Rate: {batch_result.success_rate:.1%}")
print(f"   Processing Time: {batch_result.processing_time:.2f}s")

# Get classification statistics
if batch_result.documents:
    from src.pipelines.classification import ClassificationPipeline
    pipeline = ClassificationPipeline()
    summary = pipeline.get_classification_summary(batch_result.documents)
    
    print(f"\nđź“‹ Classification Summary:")
    print(f"   Document Types:")
    for doc_type, count in summary["type_distribution"].items():
        print(f"      {doc_type}: {count}")
    print(f"   Average Confidence: {summary['average_confidence']:.2%}")

Example 4: Using Individual Processors

from src.processors import (
    convert_with_markitdown,
    extract_with_tesseract,
    extract_with_bda,
)

pdf_file = "data/raw/invoice.pdf"

# Method 1: MarkItDown (best for digital PDFs)
markdown_path = convert_with_markitdown(pdf_file)
print(f"MarkItDown output: {markdown_path}")

# Method 2: Tesseract OCR (best for scanned documents)
markdown_path = extract_with_tesseract(pdf_file)
print(f"Tesseract output: {markdown_path}")

# Method 3: AWS BDA (enterprise solution)
markdown_path = extract_with_bda(pdf_file, s3_bucket="my-bucket")
print(f"BDA output: {markdown_path}")

Example 5: Direct Classification

from src.classifiers import classify_document, classify_document_file

# Option 1: Classify text directly
document_text = """
Invoice Number: INV-001
Date: 2025-01-15
Amount: $1,234.56
"""

result = classify_document(
    document_text=document_text,
    save_result=False
)

# Option 2: Classify from markdown file
result = classify_document_file("results/markitdown/invoice_markitdown.md")

print(f"Document Type: {result['document_type']}")
print(f"Confidence: {result['confidence']}")
print(f"All Scores: {result['all_scores']}")

Configuration

Complete Configuration Reference

# AWS Configuration
aws:
  region: "us-east-1"
  s3_bucket: "your-idp-bucket"  # Required for BDA

# Output Directories
output:
  base_dir: "results"
  markitdown_dir: "results/markitdown"
  bda_dir: "results/bda"
  tesseract_dir: "results/tesseract"
  classification_dir: "results/classification"

# PDF Processing
pdf:
  dpi: 300  # DPI for PDF to image conversion

# Tesseract OCR Configuration
tesseract:
  lang: "eng"  # Language: 'eng', 'deu', 'fra', 'mkd', etc.
  oem: 3       # OCR Engine Mode (0-3, 3 is default/best)
  psm: 6       # Page Segmentation Mode (6 = uniform block)

# Classification Configuration
classification:
  # Model (use one of):
  # - bedrock/us.anthropic.claude-sonnet-4-5-20250929-v1:0 (best quality)
  # - bedrock/us.anthropic.claude-haiku-4-5-20251001-v1:0 (faster/cheaper)
  model: "bedrock/us.anthropic.claude-sonnet-4-5-20250929-v1:0"
  
  max_tokens: 1000
  temperature: 0.0  # Use 0 for deterministic classification
  
  # Document types to classify
  document_types:
    - "invoice"
    - "receipt"
    - "contract"
    - "letter"
    - "form"
    - "other"
  
  # System prompt (optional customization)
  system_prompt: |
    You are a document classification assistant. Analyze documents and classify
    them into predefined categories based on structure, content, and formatting.
  
  # Confidence threshold (0.0-1.0)
  confidence_threshold: 0.7

# BDA Configuration
bda:
  project_stage: "LIVE"
  extraction:
    granularity_types:
      - "DOCUMENT"
    bounding_box_enabled: false
    generative_field_enabled: false
  output:
    text_formats:
      - "MARKDOWN"
    additional_file_format_enabled: false

Environment Variables

Override configuration with environment variables:

export AWS_REGION=us-east-1
export AWS_ACCESS_KEY_ID=your_key
export AWS_SECRET_ACCESS_KEY=your_secret
export S3_BUCKET=your-bucket-name

Data Schemas

The project uses Pydantic for type-safe data modeling:

Document Schema

from src.schemas import Document, DocumentMetadata

metadata = DocumentMetadata(
    file_path="path/to/document.pdf",
    file_name="document.pdf",
    file_size=102400,  # bytes
    processing_method="markitdown"
)

document = Document(
    metadata=metadata,
    content="Extracted text content...",
    content_markdown_path="results/markitdown/document.md"
)

Classification Result

from src.schemas import ClassificationResult, DocumentType

result = ClassificationResult(
    document_type=DocumentType.INVOICE,
    confidence=0.95,
    reasoning="Document contains invoice number, date, amounts...",
    all_scores={
        "invoice": 0.95,
        "receipt": 0.03,
        "other": 0.02
    },
    model="bedrock/anthropic.claude-3-5-sonnet-20241022-v2:0",
    meets_threshold=True
)

Processed Document

from src.schemas import ProcessedDocument

processed_doc = ProcessedDocument(
    document=document,
    classification=result
)

# Convenient properties
print(processed_doc.document_type)  # DocumentType.INVOICE
print(processed_doc.is_classified)  # True

Batch Result

from src.schemas import BatchProcessingResult

batch_result = BatchProcessingResult(
    documents=[processed_doc1, processed_doc2],
    total_count=10,
    successful_count=8,
    failed_count=2,
    processing_time=45.2,
    errors=[
        {"file": "bad.pdf", "error": "Corrupted file"}
    ]
)

print(f"Success rate: {batch_result.success_rate:.1%}")

Extending the Template

Adding a Custom Processing Method

Here's how to add your own document processor (e.g., using a different OCR API):

1. Create Processor Function

Create src/processors/my_custom_processor.py:

from pathlib import Path
from typing import Optional
from loguru import logger
from ..config import config

def process_with_custom_method(pdf_path: str) -> Optional[str]:
    """
    Process PDF with your custom method.
    
    Args:
        pdf_path: Path to PDF file
        
    Returns:
        Path to output markdown file, or None if failed
    """
    try:
        logger.info(f"Processing with Custom Method: {pdf_path}")
        
        # Your processing logic here
        # Example: Call external API, run custom algorithm, etc.
        result_text = your_custom_processing(pdf_path)
        
        # Save to markdown
        pdf_name = Path(pdf_path).stem
        output_dir = Path(config.output.base_dir) / "custom"
        output_dir.mkdir(parents=True, exist_ok=True)
        
        output_path = output_dir / f"{pdf_name}_custom.md"
        output_path.write_text(result_text, encoding="utf-8")
        
        logger.success(f"Custom processing saved to {output_path}")
        return str(output_path)
        
    except Exception as e:
        logger.error(f"Error in custom processing: {e}")
        return None

2. Export the Processor

Update src/processors/__init__.py:

from .pdf_processing import convert_with_markitdown
from .ocr_processing import extract_with_bda, extract_with_tesseract
from .my_custom_processor import process_with_custom_method

__all__ = [
    "convert_with_markitdown",
    "extract_with_bda",
    "extract_with_tesseract",
    "process_with_custom_method",
]

3. Add to Processing Enum

Update src/schemas/document.py:

class ProcessingMethod(str, Enum):
    """Processing method enumeration."""
    MARKITDOWN = "markitdown"
    BDA = "bda"
    TESSERACT = "tesseract"
    CUSTOM = "custom"  # Add your method

4. Integrate with Pipeline

Update src/pipelines/pdf_processing/pipeline.py:

def __init__(
    self,
    use_markitdown: bool = True,
    use_bda: bool = False,
    use_tesseract: bool = False,
    use_custom: bool = False,  # Add parameter
    s3_bucket: Optional[str] = None,
):
    # ... existing code ...
    self.use_custom = use_custom

def process_pdf(self, pdf_path: str) -> Dict[str, Optional[Document]]:
    # ... existing code ...
    
    # Add custom processing
    if self.use_custom:
        from ...processors import process_with_custom_method
        logger.info("Processing with Custom Method...")
        markdown_path = process_with_custom_method(str(pdf_path))
        
        if markdown_path:
            results["custom"] = self._create_document(
                pdf_path, markdown_path, ProcessingMethod.CUSTOM
            )
        else:
            results["custom"] = None
    
    # ... rest of code ...

5. Use Your Custom Processor

from src.pipelines import FullIDPPipeline

pipeline = FullIDPPipeline(
    use_markitdown=True,
    use_custom=True,  # Enable your custom processor
    prefer_method="custom"
)

result = pipeline.process_and_classify("document.pdf")

AWS Bedrock Setup

Prerequisites

  1. AWS Account with Bedrock access
  2. IAM Permissions for Bedrock and S3
  3. Model Access enabled for Claude models

Enable Bedrock Models

  1. Go to AWS Bedrock Console
  2. Navigate to "Model access"
  3. Request access to:
    • Anthropic Claude 3.5 Sonnet
    • Anthropic Claude 3.5 Haiku (optional)
  4. Wait for approval (usually instant)

For Classification

No additional setup required beyond model access. The template uses LiteLLM which automatically handles Bedrock authentication via your AWS credentials.

For BDA (Bedrock Data Automation)

  1. Create S3 Bucket:

    aws s3 mb s3://your-idp-bucket --region us-east-1
  2. Update Configuration:

    aws:
      s3_bucket: "your-idp-bucket"
  3. IAM Permissions (attach to your role/user):

    {
      "Version": "2012-10-17",
      "Statement": [
        {
          "Effect": "Allow",
          "Action": [
            "bedrock:*",
            "bedrock-data-automation:*",
            "bedrock-data-automation-runtime:*"
          ],
          "Resource": "*"
        },
        {
          "Effect": "Allow",
          "Action": [
            "s3:PutObject",
            "s3:GetObject",
            "s3:ListBucket"
          ],
          "Resource": [
            "arn:aws:s3:::your-idp-bucket",
            "arn:aws:s3:::your-idp-bucket/*"
          ]
        }
      ]
    }

AWS Batch Deployment

The template is designed to run at scale using AWS Batch with Fargate compute. This provides serverless, containerized execution without managing infrastructure.

Architecture Components

  • API Gateway: Entry point for document processing requests
  • Lambda: Lightweight function to submit jobs to AWS Batch
  • AWS Batch: Manages job queues and compute environments
  • Fargate: Serverless containers that run the processing jobs
  • ECR: Stores Docker images for the batch jobs
  • S3: Input/output storage for documents and results
  • Bedrock: AI-powered document classification
  • CloudWatch: Logging and monitoring

Building the Docker Image

# Build the image
docker build -t idp-template:latest .

# Tag for ECR
aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin <account-id>.dkr.ecr.us-east-1.amazonaws.com
docker tag idp-template:latest <account-id>.dkr.ecr.us-east-1.amazonaws.com/idp-template:latest

# Push to ECR
docker push <account-id>.dkr.ecr.us-east-1.amazonaws.com/idp-template:latest

IAM Permissions for Batch

The Fargate task execution role needs permissions for:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "bedrock:InvokeModel",
        "bedrock-data-automation:*",
        "bedrock-data-automation-runtime:*"
      ],
      "Resource": "*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:PutObject",
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws:s3:::your-idp-bucket",
        "arn:aws:s3:::your-idp-bucket/*"
      ]
    },
    {
      "Effect": "Allow",
      "Action": [
        "logs:CreateLogStream",
        "logs:PutLogEvents"
      ],
      "Resource": "arn:aws:logs:*:*:*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "ecr:GetAuthorizationToken",
        "ecr:BatchCheckLayerAvailability",
        "ecr:GetDownloadUrlForLayer",
        "ecr:BatchGetImage"
      ],
      "Resource": "*"
    }
  ]
}

Troubleshooting

Common Issues

1. Tesseract Not Found

Error: tesseract is not installed or it's not in your PATH

Solution:

# Fedora
sudo dnf install tesseract tesseract-langpack-eng

# Ubuntu
sudo apt-get install tesseract-ocr

# Verify
tesseract --version

2. AWS Credentials Not Found

Error: Unable to locate credentials

Solution:

# Configure AWS CLI
aws configure

# Or set environment variables
export AWS_ACCESS_KEY_ID=your_key
export AWS_SECRET_ACCESS_KEY=your_secret
export AWS_REGION=us-east-1

3. S3 Bucket Not Found (BDA)

Error: NoSuchBucket: The specified bucket does not exist

Solution:

# Create the bucket
aws s3 mb s3://your-idp-bucket --region us-east-1

# Update config.yaml
aws:
  s3_bucket: "your-idp-bucket"

4. Batch Job Fails to Start

Error: Job stuck in RUNNABLE state

Solution:

  • Check that ECR image exists and is accessible
  • Verify Fargate task execution role has correct permissions
  • Check CloudWatch logs for detailed error messages
  • Ensure VPC has proper networking (NAT gateway for private subnets)

Debug Mode

Enable debug logging:

from loguru import logger
import sys

# Add debug level logging
logger.remove()
logger.add(sys.stderr, level="DEBUG")

# Now run your pipeline
pipeline.process_and_classify("document.pdf")

Development

Running Tests

# Run all tests
uv run pytest

# Run with coverage
uv run pytest --cov=src --cov-report=html

# Run specific test
uv run pytest tests/unit/test_classifiers.py -v

Acknowledgments

Built with:

  • MarkItDown by Microsoft
  • LiteLLM for unified LLM APIs
  • Pydantic for data validation
  • loguru for logging
  • AWS Bedrock for AI capabilities
  • AWS Batch for scalable compute

About

An IDP (Intelligent Document Processing) template

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages