IDP Template

A production-ready Intelligent Document Processing (IDP) template for extracting, processing, and classifying documents using multiple methods including MarkItDown, AWS Bedrock Data Automation, Tesseract OCR, and AI-powered classification.

Architecture

The architecture follows a serverless, event-driven pattern using AWS Batch for scalable document processing:

User Request: Users invoke the API Gateway endpoint to submit documents for processing
Trigger: API Gateway triggers a Lambda function to handle the request
Job Submission: Lambda submits a batch job to AWS Batch with document details
Queue Management: AWS Batch Queue manages job scheduling and prioritization
Compute: Fargate Batch Compute environment runs containerized processing jobs
AI Classification: Jobs call Amazon Bedrock for AI-powered document classification
Storage: Input documents are read from and results are written to S3
Container Registry: ECR stores the Docker images used by Fargate
CI/CD: GitHub Actions builds and pushes Docker images to ECR
Monitoring: CloudWatch Logs captures logs and metrics for observability

Features

Document Processing

MarkItDown: High-quality PDF to Markdown conversion using Microsoft's MarkItDown library (recommended for born-digital PDFs)
AWS Bedrock Data Automation (BDA): Enterprise-grade text extraction using AWS Bedrock's Data Automation service
Tesseract OCR: Open-source OCR for scanned documents and images
Extensible Architecture: Easily add your own processing methods

Document Classification

AI-Powered Classification: Classify documents using LiteLLM with AWS Bedrock (Claude 3.5 Sonnet/Haiku)
Configurable Document Types: Define custom document categories via YAML
Confidence Scoring: Get confidence scores and reasoning for classifications
Batch Processing: Efficiently process multiple documents with statistics

Architecture

Type-Safe: Pydantic schemas for documents, classifications, and results
Configuration-Driven: Flexible YAML configuration
Modular Pipelines: Composable pipelines for different workflows
Comprehensive Logging: Built-in structured logging with loguru
AWS Batch Ready: Pre-structured for scalable serverless deployment with AWS Batch and Fargate

Project Structure

idp-template/
├── config/
│   └── config.yaml              # Main configuration file
├── data/
│   ├── raw/                     # Input PDFs
│   └── processed/               # (Optional) Intermediate files
├── src/
│   ├── classifiers/             # Document classification
│   │   ├── __init__.py
│   │   └── document_classifier.py
│   ├── pipelines/               # Processing pipelines
│   │   ├── __init__.py
│   │   ├── full_pipeline.py     # End-to-end pipeline
│   │   ├── pdf_processing/      # PDF processing pipeline
│   │   └── classification/      # Classification pipeline
│   ├── processors/              # Core processing methods
│   │   ├── __init__.py
│   │   ├── pdf_processing.py    # MarkItDown processor
│   │   └── ocr_processing.py    # BDA and Tesseract processors
│   ├── schemas/                 # Pydantic data models
│   │   ├── __init__.py
│   │   └── document.py
│   ├── batch/                   # AWS Batch job handlers
│   │   ├── classification/
│   │   └── pdf_processing/
│   ├── scripts/                 # Utility scripts
│   │   └── run_pipeline.py      # Example pipeline runner
│   ├── utils/                   # Helper functions
│   └── config.py                # Configuration management
├── results/                     # Output directory (auto-created)
│   ├── markitdown/             # MarkItDown outputs
│   ├── bda/                    # BDA outputs
│   ├── tesseract/              # Tesseract outputs
│   └── classification/         # Classification results
├── tests/                      # Unit and integration tests
├── Dockerfile                  # Container definition for Batch jobs
├── pyproject.toml              # Project dependencies
└── README.md

Installation

Prerequisites

Python 3.13+ (recommended) or 3.11+
uv package manager
AWS credentials configured (for BDA and Bedrock classification)
Tesseract OCR installed on your system (optional, for OCR processing)
Docker (for building container images for AWS Batch)

System Dependencies

Tesseract OCR (Optional)

Fedora/RHEL:

sudo dnf install tesseract tesseract-langpack-eng

Ubuntu/Debian:

sudo apt-get install tesseract-ocr tesseract-ocr-eng

macOS:

brew install tesseract

Verify installation:

tesseract --version

Install Python Dependencies

# Clone the repository
git clone https://github.com/LokaHQ/idp-template.git
cd idp-template

# Install dependencies using uv
uv sync

# Verify installation
uv run python -c "from src.pipelines import FullIDPPipeline; print('Installation successful!')"

Quick Start

1. Update Configuration

Edit config/config.yaml to match your setup:

aws:
  region: "us-east-1"
  s3_bucket: "your-s3-bucket-name"  # Required for BDA

classification:
  model: "bedrock/us.anthropic.claude-sonnet-4-5-20250929-v1:0"
  document_types:
    - "invoice"
    - "receipt"
    - "contract"
    - "letter"
    - "other"

2. Run the Pipeline

Simple Script Runner

Create a file run_full_pipeline.py:

from src.pipelines.full_pipeline import FullIDPPipeline

def main():
    # Initialize pipeline with desired methods
    pipeline = FullIDPPipeline(
        use_markitdown=True,      # Primary method (recommended)
        use_bda=False,            # Requires S3 bucket
        use_tesseract=True,       # Fallback for scanned docs
        prefer_method="markitdown",
        save_classification_results=True,
    )
    
    # Process a single document
    result = pipeline.process_and_classify("data/raw/sample_document.pdf")
    
    if result:
        print(f"✓ Document Type: {result.document_type}")
        print(f"✓ Confidence: {result.classification.confidence:.2%}")
        print(f"✓ Reasoning: {result.classification.reasoning}")
    else:
        print("✗ Processing failed")

if __name__ == "__main__":
    main()

Run it:

uv run python run_full_pipeline.py

Using the Included Script

# Run the example pipeline
uv run python src/scripts/run_pipeline.py

Expected output:

2025-11-28 13:47:06.205 | INFO     | Configuration loaded from config/config.yaml
2025-11-28 13:47:07.492 | INFO     | Running full IDP pipeline on: data/raw/sample_document.pdf
2025-11-28 13:47:07.567 | SUCCESS  | MarkItDown conversion saved to results/markitdown/sample_document_markitdown.md
2025-11-28 13:47:13.324 | SUCCESS  | Document classified as 'invoice' with confidence 0.99
2025-11-28 13:47:13.324 | SUCCESS  | Pipeline complete: classified as 'DocumentType.INVOICE'

Usage Examples

Example 1: PDF Processing Only

from src.pipelines.pdf_processing import PDFProcessingPipeline

# Initialize with multiple methods
pipeline = PDFProcessingPipeline(
    use_markitdown=True,
    use_tesseract=True,
    use_bda=False,
)

# Process a PDF
results = pipeline.process_pdf("data/raw/invoice.pdf")

# Access results
if results["markitdown"]:
    doc = results["markitdown"]
    print(f"Content length: {len(doc.content)} characters")
    print(f"Processed with: {doc.metadata.processing_method}")
    print(f"Markdown saved to: {doc.content_markdown_path}")

Example 2: Classification Only

from src.pipelines.classification import ClassificationPipeline
from src.schemas import Document, DocumentMetadata

# Read pre-processed markdown
with open("results/markitdown/invoice_markitdown.md", "r") as f:
    content = f.read()

# Create document
metadata = DocumentMetadata(
    file_path="data/raw/invoice.pdf",
    file_name="invoice.pdf"
)
document = Document(metadata=metadata, content=content)

# Classify
pipeline = ClassificationPipeline(save_results=True)
result = pipeline.classify_document(document)

print(f"Type: {result.document_type}")
print(f"Confidence: {result.classification.confidence:.2%}")
print(f"Reasoning: {result.classification.reasoning}")

Example 3: Batch Processing

from pathlib import Path
from src.pipelines.full_pipeline import FullIDPPipeline

# Gather all PDFs
pdf_dir = Path("data/raw")
pdf_files = [str(f) for f in pdf_dir.glob("*.pdf")]

# Initialize pipeline
pipeline = FullIDPPipeline(
    use_markitdown=True,
    prefer_method="markitdown",
)

# Batch process
batch_result = pipeline.batch_process_and_classify(pdf_files)

# Print summary
print(f"\n📊 Batch Processing Summary:")
print(f"   Total: {batch_result.total_count}")
print(f"   Successful: {batch_result.successful_count}")
print(f"   Failed: {batch_result.failed_count}")
print(f"   Success Rate: {batch_result.success_rate:.1%}")
print(f"   Processing Time: {batch_result.processing_time:.2f}s")

# Get classification statistics
if batch_result.documents:
    from src.pipelines.classification import ClassificationPipeline
    pipeline = ClassificationPipeline()
    summary = pipeline.get_classification_summary(batch_result.documents)
    
    print(f"\n📋 Classification Summary:")
    print(f"   Document Types:")
    for doc_type, count in summary["type_distribution"].items():
        print(f"      {doc_type}: {count}")
    print(f"   Average Confidence: {summary['average_confidence']:.2%}")

Example 4: Using Individual Processors

from src.processors import (
    convert_with_markitdown,
    extract_with_tesseract,
    extract_with_bda,
)

pdf_file = "data/raw/invoice.pdf"

# Method 1: MarkItDown (best for digital PDFs)
markdown_path = convert_with_markitdown(pdf_file)
print(f"MarkItDown output: {markdown_path}")

# Method 2: Tesseract OCR (best for scanned documents)
markdown_path = extract_with_tesseract(pdf_file)
print(f"Tesseract output: {markdown_path}")

# Method 3: AWS BDA (enterprise solution)
markdown_path = extract_with_bda(pdf_file, s3_bucket="my-bucket")
print(f"BDA output: {markdown_path}")

Example 5: Direct Classification

from src.classifiers import classify_document, classify_document_file

# Option 1: Classify text directly
document_text = """
Invoice Number: INV-001
Date: 2025-01-15
Amount: $1,234.56
"""

result = classify_document(
    document_text=document_text,
    save_result=False
)

# Option 2: Classify from markdown file
result = classify_document_file("results/markitdown/invoice_markitdown.md")

print(f"Document Type: {result['document_type']}")
print(f"Confidence: {result['confidence']}")
print(f"All Scores: {result['all_scores']}")

Configuration

Complete Configuration Reference

# AWS Configuration
aws:
  region: "us-east-1"
  s3_bucket: "your-idp-bucket"  # Required for BDA

# Output Directories
output:
  base_dir: "results"
  markitdown_dir: "results/markitdown"
  bda_dir: "results/bda"
  tesseract_dir: "results/tesseract"
  classification_dir: "results/classification"

# PDF Processing
pdf:
  dpi: 300  # DPI for PDF to image conversion

# Tesseract OCR Configuration
tesseract:
  lang: "eng"  # Language: 'eng', 'deu', 'fra', 'mkd', etc.
  oem: 3       # OCR Engine Mode (0-3, 3 is default/best)
  psm: 6       # Page Segmentation Mode (6 = uniform block)

# Classification Configuration
classification:
  # Model (use one of):
  # - bedrock/us.anthropic.claude-sonnet-4-5-20250929-v1:0 (best quality)
  # - bedrock/us.anthropic.claude-haiku-4-5-20251001-v1:0 (faster/cheaper)
  model: "bedrock/us.anthropic.claude-sonnet-4-5-20250929-v1:0"
  
  max_tokens: 1000
  temperature: 0.0  # Use 0 for deterministic classification
  
  # Document types to classify
  document_types:
    - "invoice"
    - "receipt"
    - "contract"
    - "letter"
    - "form"
    - "other"
  
  # System prompt (optional customization)
  system_prompt: |
    You are a document classification assistant. Analyze documents and classify
    them into predefined categories based on structure, content, and formatting.
  
  # Confidence threshold (0.0-1.0)
  confidence_threshold: 0.7

# BDA Configuration
bda:
  project_stage: "LIVE"
  extraction:
    granularity_types:
      - "DOCUMENT"
    bounding_box_enabled: false
    generative_field_enabled: false
  output:
    text_formats:
      - "MARKDOWN"
    additional_file_format_enabled: false

Environment Variables

Override configuration with environment variables:

export AWS_REGION=us-east-1
export AWS_ACCESS_KEY_ID=your_key
export AWS_SECRET_ACCESS_KEY=your_secret
export S3_BUCKET=your-bucket-name

Data Schemas

The project uses Pydantic for type-safe data modeling:

Document Schema

from src.schemas import Document, DocumentMetadata

metadata = DocumentMetadata(
    file_path="path/to/document.pdf",
    file_name="document.pdf",
    file_size=102400,  # bytes
    processing_method="markitdown"
)

document = Document(
    metadata=metadata,
    content="Extracted text content...",
    content_markdown_path="results/markitdown/document.md"
)

Classification Result

from src.schemas import ClassificationResult, DocumentType

result = ClassificationResult(
    document_type=DocumentType.INVOICE,
    confidence=0.95,
    reasoning="Document contains invoice number, date, amounts...",
    all_scores={
        "invoice": 0.95,
        "receipt": 0.03,
        "other": 0.02
    },
    model="bedrock/anthropic.claude-3-5-sonnet-20241022-v2:0",
    meets_threshold=True
)

Processed Document

from src.schemas import ProcessedDocument

processed_doc = ProcessedDocument(
    document=document,
    classification=result
)

# Convenient properties
print(processed_doc.document_type)  # DocumentType.INVOICE
print(processed_doc.is_classified)  # True

Batch Result

from src.schemas import BatchProcessingResult

batch_result = BatchProcessingResult(
    documents=[processed_doc1, processed_doc2],
    total_count=10,
    successful_count=8,
    failed_count=2,
    processing_time=45.2,
    errors=[
        {"file": "bad.pdf", "error": "Corrupted file"}
    ]
)

print(f"Success rate: {batch_result.success_rate:.1%}")

Extending the Template

Adding a Custom Processing Method

Here's how to add your own document processor (e.g., using a different OCR API):

1. Create Processor Function

Create src/processors/my_custom_processor.py:

from pathlib import Path
from typing import Optional
from loguru import logger
from ..config import config

def process_with_custom_method(pdf_path: str) -> Optional[str]:
    """
    Process PDF with your custom method.
    
    Args:
        pdf_path: Path to PDF file
        
    Returns:
        Path to output markdown file, or None if failed
    """
    try:
        logger.info(f"Processing with Custom Method: {pdf_path}")
        
        # Your processing logic here
        # Example: Call external API, run custom algorithm, etc.
        result_text = your_custom_processing(pdf_path)
        
        # Save to markdown
        pdf_name = Path(pdf_path).stem
        output_dir = Path(config.output.base_dir) / "custom"
        output_dir.mkdir(parents=True, exist_ok=True)
        
        output_path = output_dir / f"{pdf_name}_custom.md"
        output_path.write_text(result_text, encoding="utf-8")
        
        logger.success(f"Custom processing saved to {output_path}")
        return str(output_path)
        
    except Exception as e:
        logger.error(f"Error in custom processing: {e}")
        return None

2. Export the Processor

Update src/processors/__init__.py:

from .pdf_processing import convert_with_markitdown
from .ocr_processing import extract_with_bda, extract_with_tesseract
from .my_custom_processor import process_with_custom_method

__all__ = [
    "convert_with_markitdown",
    "extract_with_bda",
    "extract_with_tesseract",
    "process_with_custom_method",
]

3. Add to Processing Enum

Update src/schemas/document.py:

class ProcessingMethod(str, Enum):
    """Processing method enumeration."""
    MARKITDOWN = "markitdown"
    BDA = "bda"
    TESSERACT = "tesseract"
    CUSTOM = "custom"  # Add your method

4. Integrate with Pipeline

Update src/pipelines/pdf_processing/pipeline.py:

def __init__(
    self,
    use_markitdown: bool = True,
    use_bda: bool = False,
    use_tesseract: bool = False,
    use_custom: bool = False,  # Add parameter
    s3_bucket: Optional[str] = None,
):
    # ... existing code ...
    self.use_custom = use_custom

def process_pdf(self, pdf_path: str) -> Dict[str, Optional[Document]]:
    # ... existing code ...
    
    # Add custom processing
    if self.use_custom:
        from ...processors import process_with_custom_method
        logger.info("Processing with Custom Method...")
        markdown_path = process_with_custom_method(str(pdf_path))
        
        if markdown_path:
            results["custom"] = self._create_document(
                pdf_path, markdown_path, ProcessingMethod.CUSTOM
            )
        else:
            results["custom"] = None
    
    # ... rest of code ...

5. Use Your Custom Processor

from src.pipelines import FullIDPPipeline

pipeline = FullIDPPipeline(
    use_markitdown=True,
    use_custom=True,  # Enable your custom processor
    prefer_method="custom"
)

result = pipeline.process_and_classify("document.pdf")

AWS Bedrock Setup

Prerequisites

AWS Account with Bedrock access
IAM Permissions for Bedrock and S3
Model Access enabled for Claude models

Enable Bedrock Models

Go to AWS Bedrock Console
Navigate to "Model access"
Request access to:
- Anthropic Claude 3.5 Sonnet
- Anthropic Claude 3.5 Haiku (optional)
Wait for approval (usually instant)

For Classification

No additional setup required beyond model access. The template uses LiteLLM which automatically handles Bedrock authentication via your AWS credentials.

For BDA (Bedrock Data Automation)

Create S3 Bucket:

aws s3 mb s3://your-idp-bucket --region us-east-1

Update Configuration:
```
aws:
  s3_bucket: "your-idp-bucket"
```

IAM Permissions (attach to your role/user):

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "bedrock:*",
        "bedrock-data-automation:*",
        "bedrock-data-automation-runtime:*"
      ],
      "Resource": "*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "s3:PutObject",
        "s3:GetObject",
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws:s3:::your-idp-bucket",
        "arn:aws:s3:::your-idp-bucket/*"
      ]
    }
  ]
}

AWS Batch Deployment

The template is designed to run at scale using AWS Batch with Fargate compute. This provides serverless, containerized execution without managing infrastructure.

Architecture Components

API Gateway: Entry point for document processing requests
Lambda: Lightweight function to submit jobs to AWS Batch
AWS Batch: Manages job queues and compute environments
Fargate: Serverless containers that run the processing jobs
ECR: Stores Docker images for the batch jobs
S3: Input/output storage for documents and results
Bedrock: AI-powered document classification
CloudWatch: Logging and monitoring

Building the Docker Image

# Build the image
docker build -t idp-template:latest .

# Tag for ECR
aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin <account-id>.dkr.ecr.us-east-1.amazonaws.com
docker tag idp-template:latest <account-id>.dkr.ecr.us-east-1.amazonaws.com/idp-template:latest

# Push to ECR
docker push <account-id>.dkr.ecr.us-east-1.amazonaws.com/idp-template:latest

IAM Permissions for Batch

The Fargate task execution role needs permissions for:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "bedrock:InvokeModel",
        "bedrock-data-automation:*",
        "bedrock-data-automation-runtime:*"
      ],
      "Resource": "*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:PutObject",
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws:s3:::your-idp-bucket",
        "arn:aws:s3:::your-idp-bucket/*"
      ]
    },
    {
      "Effect": "Allow",
      "Action": [
        "logs:CreateLogStream",
        "logs:PutLogEvents"
      ],
      "Resource": "arn:aws:logs:*:*:*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "ecr:GetAuthorizationToken",
        "ecr:BatchCheckLayerAvailability",
        "ecr:GetDownloadUrlForLayer",
        "ecr:BatchGetImage"
      ],
      "Resource": "*"
    }
  ]
}

Troubleshooting

Common Issues

1. Tesseract Not Found

Error: tesseract is not installed or it's not in your PATH

Solution:

# Fedora
sudo dnf install tesseract tesseract-langpack-eng

# Ubuntu
sudo apt-get install tesseract-ocr

# Verify
tesseract --version

2. AWS Credentials Not Found

Error: Unable to locate credentials

Solution:

# Configure AWS CLI
aws configure

# Or set environment variables
export AWS_ACCESS_KEY_ID=your_key
export AWS_SECRET_ACCESS_KEY=your_secret
export AWS_REGION=us-east-1

3. S3 Bucket Not Found (BDA)

Error: NoSuchBucket: The specified bucket does not exist

Solution:

# Create the bucket
aws s3 mb s3://your-idp-bucket --region us-east-1

# Update config.yaml
aws:
  s3_bucket: "your-idp-bucket"

4. Batch Job Fails to Start

Error: Job stuck in RUNNABLE state

Solution:

Check that ECR image exists and is accessible
Verify Fargate task execution role has correct permissions
Check CloudWatch logs for detailed error messages
Ensure VPC has proper networking (NAT gateway for private subnets)

Debug Mode

Enable debug logging:

from loguru import logger
import sys

# Add debug level logging
logger.remove()
logger.add(sys.stderr, level="DEBUG")

# Now run your pipeline
pipeline.process_and_classify("document.pdf")

Development

Running Tests

# Run all tests
uv run pytest

# Run with coverage
uv run pytest --cov=src --cov-report=html

# Run specific test
uv run pytest tests/unit/test_classifiers.py -v

Acknowledgments

Built with:

MarkItDown by Microsoft
LiteLLM for unified LLM APIs
Pydantic for data validation
loguru for logging
AWS Bedrock for AI capabilities
AWS Batch for scalable compute

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
assets		assets
config		config
data/raw		data/raw
src		src
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

LokaHQ/idp-template

Folders and files

Latest commit

History

Repository files navigation

IDP Template

Architecture

Features

Document Processing

Document Classification

Architecture

Project Structure

Installation

Prerequisites

System Dependencies

Tesseract OCR (Optional)

Install Python Dependencies

Quick Start

1. Update Configuration

2. Run the Pipeline

Simple Script Runner

Using the Included Script

Usage Examples

Example 1: PDF Processing Only

Example 2: Classification Only

Example 3: Batch Processing

Example 4: Using Individual Processors

Example 5: Direct Classification

Configuration

Complete Configuration Reference

Environment Variables

Data Schemas

Document Schema

Classification Result

Processed Document

Batch Result

Extending the Template

Adding a Custom Processing Method

1. Create Processor Function

2. Export the Processor

3. Add to Processing Enum

4. Integrate with Pipeline

5. Use Your Custom Processor

AWS Bedrock Setup

Prerequisites

Enable Bedrock Models

For Classification

For BDA (Bedrock Data Automation)

AWS Batch Deployment

Architecture Components

Building the Docker Image

IAM Permissions for Batch

Troubleshooting

Common Issues

1. Tesseract Not Found

2. AWS Credentials Not Found

3. S3 Bucket Not Found (BDA)

4. Batch Job Fails to Start

Debug Mode

Development

Running Tests

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages