A production-ready Intelligent Document Processing (IDP) template for extracting, processing, and classifying documents using multiple methods including MarkItDown, AWS Bedrock Data Automation, Tesseract OCR, and AI-powered classification.
The architecture follows a serverless, event-driven pattern using AWS Batch for scalable document processing:
- User Request: Users invoke the API Gateway endpoint to submit documents for processing
- Trigger: API Gateway triggers a Lambda function to handle the request
- Job Submission: Lambda submits a batch job to AWS Batch with document details
- Queue Management: AWS Batch Queue manages job scheduling and prioritization
- Compute: Fargate Batch Compute environment runs containerized processing jobs
- AI Classification: Jobs call Amazon Bedrock for AI-powered document classification
- Storage: Input documents are read from and results are written to S3
- Container Registry: ECR stores the Docker images used by Fargate
- CI/CD: GitHub Actions builds and pushes Docker images to ECR
- Monitoring: CloudWatch Logs captures logs and metrics for observability
- MarkItDown: High-quality PDF to Markdown conversion using Microsoft's MarkItDown library (recommended for born-digital PDFs)
- AWS Bedrock Data Automation (BDA): Enterprise-grade text extraction using AWS Bedrock's Data Automation service
- Tesseract OCR: Open-source OCR for scanned documents and images
- Extensible Architecture: Easily add your own processing methods
- AI-Powered Classification: Classify documents using LiteLLM with AWS Bedrock (Claude 3.5 Sonnet/Haiku)
- Configurable Document Types: Define custom document categories via YAML
- Confidence Scoring: Get confidence scores and reasoning for classifications
- Batch Processing: Efficiently process multiple documents with statistics
- Type-Safe: Pydantic schemas for documents, classifications, and results
- Configuration-Driven: Flexible YAML configuration
- Modular Pipelines: Composable pipelines for different workflows
- Comprehensive Logging: Built-in structured logging with loguru
- AWS Batch Ready: Pre-structured for scalable serverless deployment with AWS Batch and Fargate
idp-template/
├── config/
│ └── config.yaml # Main configuration file
├── data/
│ ├── raw/ # Input PDFs
│ └── processed/ # (Optional) Intermediate files
├── src/
│ ├── classifiers/ # Document classification
│ │ ├── __init__.py
│ │ └── document_classifier.py
│ ├── pipelines/ # Processing pipelines
│ │ ├── __init__.py
│ │ ├── full_pipeline.py # End-to-end pipeline
│ │ ├── pdf_processing/ # PDF processing pipeline
│ │ └── classification/ # Classification pipeline
│ ├── processors/ # Core processing methods
│ │ ├── __init__.py
│ │ ├── pdf_processing.py # MarkItDown processor
│ │ └── ocr_processing.py # BDA and Tesseract processors
│ ├── schemas/ # Pydantic data models
│ │ ├── __init__.py
│ │ └── document.py
│ ├── batch/ # AWS Batch job handlers
│ │ ├── classification/
│ │ └── pdf_processing/
│ ├── scripts/ # Utility scripts
│ │ └── run_pipeline.py # Example pipeline runner
│ ├── utils/ # Helper functions
│ └── config.py # Configuration management
├── results/ # Output directory (auto-created)
│ ├── markitdown/ # MarkItDown outputs
│ ├── bda/ # BDA outputs
│ ├── tesseract/ # Tesseract outputs
│ └── classification/ # Classification results
├── tests/ # Unit and integration tests
├── Dockerfile # Container definition for Batch jobs
├── pyproject.toml # Project dependencies
└── README.md
- Python 3.13+ (recommended) or 3.11+
- uv package manager
- AWS credentials configured (for BDA and Bedrock classification)
- Tesseract OCR installed on your system (optional, for OCR processing)
- Docker (for building container images for AWS Batch)
Fedora/RHEL:
sudo dnf install tesseract tesseract-langpack-engUbuntu/Debian:
sudo apt-get install tesseract-ocr tesseract-ocr-engmacOS:
brew install tesseractVerify installation:
tesseract --version# Clone the repository
git clone https://github.com/LokaHQ/idp-template.git
cd idp-template
# Install dependencies using uv
uv sync
# Verify installation
uv run python -c "from src.pipelines import FullIDPPipeline; print('Installation successful!')"Edit config/config.yaml to match your setup:
aws:
region: "us-east-1"
s3_bucket: "your-s3-bucket-name" # Required for BDA
classification:
model: "bedrock/us.anthropic.claude-sonnet-4-5-20250929-v1:0"
document_types:
- "invoice"
- "receipt"
- "contract"
- "letter"
- "other"Create a file run_full_pipeline.py:
from src.pipelines.full_pipeline import FullIDPPipeline
def main():
# Initialize pipeline with desired methods
pipeline = FullIDPPipeline(
use_markitdown=True, # Primary method (recommended)
use_bda=False, # Requires S3 bucket
use_tesseract=True, # Fallback for scanned docs
prefer_method="markitdown",
save_classification_results=True,
)
# Process a single document
result = pipeline.process_and_classify("data/raw/sample_document.pdf")
if result:
print(f"âś“ Document Type: {result.document_type}")
print(f"âś“ Confidence: {result.classification.confidence:.2%}")
print(f"âś“ Reasoning: {result.classification.reasoning}")
else:
print("âś— Processing failed")
if __name__ == "__main__":
main()Run it:
uv run python run_full_pipeline.py# Run the example pipeline
uv run python src/scripts/run_pipeline.pyExpected output:
2025-11-28 13:47:06.205 | INFO | Configuration loaded from config/config.yaml
2025-11-28 13:47:07.492 | INFO | Running full IDP pipeline on: data/raw/sample_document.pdf
2025-11-28 13:47:07.567 | SUCCESS | MarkItDown conversion saved to results/markitdown/sample_document_markitdown.md
2025-11-28 13:47:13.324 | SUCCESS | Document classified as 'invoice' with confidence 0.99
2025-11-28 13:47:13.324 | SUCCESS | Pipeline complete: classified as 'DocumentType.INVOICE'
from src.pipelines.pdf_processing import PDFProcessingPipeline
# Initialize with multiple methods
pipeline = PDFProcessingPipeline(
use_markitdown=True,
use_tesseract=True,
use_bda=False,
)
# Process a PDF
results = pipeline.process_pdf("data/raw/invoice.pdf")
# Access results
if results["markitdown"]:
doc = results["markitdown"]
print(f"Content length: {len(doc.content)} characters")
print(f"Processed with: {doc.metadata.processing_method}")
print(f"Markdown saved to: {doc.content_markdown_path}")from src.pipelines.classification import ClassificationPipeline
from src.schemas import Document, DocumentMetadata
# Read pre-processed markdown
with open("results/markitdown/invoice_markitdown.md", "r") as f:
content = f.read()
# Create document
metadata = DocumentMetadata(
file_path="data/raw/invoice.pdf",
file_name="invoice.pdf"
)
document = Document(metadata=metadata, content=content)
# Classify
pipeline = ClassificationPipeline(save_results=True)
result = pipeline.classify_document(document)
print(f"Type: {result.document_type}")
print(f"Confidence: {result.classification.confidence:.2%}")
print(f"Reasoning: {result.classification.reasoning}")from pathlib import Path
from src.pipelines.full_pipeline import FullIDPPipeline
# Gather all PDFs
pdf_dir = Path("data/raw")
pdf_files = [str(f) for f in pdf_dir.glob("*.pdf")]
# Initialize pipeline
pipeline = FullIDPPipeline(
use_markitdown=True,
prefer_method="markitdown",
)
# Batch process
batch_result = pipeline.batch_process_and_classify(pdf_files)
# Print summary
print(f"\n📊 Batch Processing Summary:")
print(f" Total: {batch_result.total_count}")
print(f" Successful: {batch_result.successful_count}")
print(f" Failed: {batch_result.failed_count}")
print(f" Success Rate: {batch_result.success_rate:.1%}")
print(f" Processing Time: {batch_result.processing_time:.2f}s")
# Get classification statistics
if batch_result.documents:
from src.pipelines.classification import ClassificationPipeline
pipeline = ClassificationPipeline()
summary = pipeline.get_classification_summary(batch_result.documents)
print(f"\nđź“‹ Classification Summary:")
print(f" Document Types:")
for doc_type, count in summary["type_distribution"].items():
print(f" {doc_type}: {count}")
print(f" Average Confidence: {summary['average_confidence']:.2%}")from src.processors import (
convert_with_markitdown,
extract_with_tesseract,
extract_with_bda,
)
pdf_file = "data/raw/invoice.pdf"
# Method 1: MarkItDown (best for digital PDFs)
markdown_path = convert_with_markitdown(pdf_file)
print(f"MarkItDown output: {markdown_path}")
# Method 2: Tesseract OCR (best for scanned documents)
markdown_path = extract_with_tesseract(pdf_file)
print(f"Tesseract output: {markdown_path}")
# Method 3: AWS BDA (enterprise solution)
markdown_path = extract_with_bda(pdf_file, s3_bucket="my-bucket")
print(f"BDA output: {markdown_path}")from src.classifiers import classify_document, classify_document_file
# Option 1: Classify text directly
document_text = """
Invoice Number: INV-001
Date: 2025-01-15
Amount: $1,234.56
"""
result = classify_document(
document_text=document_text,
save_result=False
)
# Option 2: Classify from markdown file
result = classify_document_file("results/markitdown/invoice_markitdown.md")
print(f"Document Type: {result['document_type']}")
print(f"Confidence: {result['confidence']}")
print(f"All Scores: {result['all_scores']}")# AWS Configuration
aws:
region: "us-east-1"
s3_bucket: "your-idp-bucket" # Required for BDA
# Output Directories
output:
base_dir: "results"
markitdown_dir: "results/markitdown"
bda_dir: "results/bda"
tesseract_dir: "results/tesseract"
classification_dir: "results/classification"
# PDF Processing
pdf:
dpi: 300 # DPI for PDF to image conversion
# Tesseract OCR Configuration
tesseract:
lang: "eng" # Language: 'eng', 'deu', 'fra', 'mkd', etc.
oem: 3 # OCR Engine Mode (0-3, 3 is default/best)
psm: 6 # Page Segmentation Mode (6 = uniform block)
# Classification Configuration
classification:
# Model (use one of):
# - bedrock/us.anthropic.claude-sonnet-4-5-20250929-v1:0 (best quality)
# - bedrock/us.anthropic.claude-haiku-4-5-20251001-v1:0 (faster/cheaper)
model: "bedrock/us.anthropic.claude-sonnet-4-5-20250929-v1:0"
max_tokens: 1000
temperature: 0.0 # Use 0 for deterministic classification
# Document types to classify
document_types:
- "invoice"
- "receipt"
- "contract"
- "letter"
- "form"
- "other"
# System prompt (optional customization)
system_prompt: |
You are a document classification assistant. Analyze documents and classify
them into predefined categories based on structure, content, and formatting.
# Confidence threshold (0.0-1.0)
confidence_threshold: 0.7
# BDA Configuration
bda:
project_stage: "LIVE"
extraction:
granularity_types:
- "DOCUMENT"
bounding_box_enabled: false
generative_field_enabled: false
output:
text_formats:
- "MARKDOWN"
additional_file_format_enabled: falseOverride configuration with environment variables:
export AWS_REGION=us-east-1
export AWS_ACCESS_KEY_ID=your_key
export AWS_SECRET_ACCESS_KEY=your_secret
export S3_BUCKET=your-bucket-nameThe project uses Pydantic for type-safe data modeling:
from src.schemas import Document, DocumentMetadata
metadata = DocumentMetadata(
file_path="path/to/document.pdf",
file_name="document.pdf",
file_size=102400, # bytes
processing_method="markitdown"
)
document = Document(
metadata=metadata,
content="Extracted text content...",
content_markdown_path="results/markitdown/document.md"
)from src.schemas import ClassificationResult, DocumentType
result = ClassificationResult(
document_type=DocumentType.INVOICE,
confidence=0.95,
reasoning="Document contains invoice number, date, amounts...",
all_scores={
"invoice": 0.95,
"receipt": 0.03,
"other": 0.02
},
model="bedrock/anthropic.claude-3-5-sonnet-20241022-v2:0",
meets_threshold=True
)from src.schemas import ProcessedDocument
processed_doc = ProcessedDocument(
document=document,
classification=result
)
# Convenient properties
print(processed_doc.document_type) # DocumentType.INVOICE
print(processed_doc.is_classified) # Truefrom src.schemas import BatchProcessingResult
batch_result = BatchProcessingResult(
documents=[processed_doc1, processed_doc2],
total_count=10,
successful_count=8,
failed_count=2,
processing_time=45.2,
errors=[
{"file": "bad.pdf", "error": "Corrupted file"}
]
)
print(f"Success rate: {batch_result.success_rate:.1%}")Here's how to add your own document processor (e.g., using a different OCR API):
Create src/processors/my_custom_processor.py:
from pathlib import Path
from typing import Optional
from loguru import logger
from ..config import config
def process_with_custom_method(pdf_path: str) -> Optional[str]:
"""
Process PDF with your custom method.
Args:
pdf_path: Path to PDF file
Returns:
Path to output markdown file, or None if failed
"""
try:
logger.info(f"Processing with Custom Method: {pdf_path}")
# Your processing logic here
# Example: Call external API, run custom algorithm, etc.
result_text = your_custom_processing(pdf_path)
# Save to markdown
pdf_name = Path(pdf_path).stem
output_dir = Path(config.output.base_dir) / "custom"
output_dir.mkdir(parents=True, exist_ok=True)
output_path = output_dir / f"{pdf_name}_custom.md"
output_path.write_text(result_text, encoding="utf-8")
logger.success(f"Custom processing saved to {output_path}")
return str(output_path)
except Exception as e:
logger.error(f"Error in custom processing: {e}")
return NoneUpdate src/processors/__init__.py:
from .pdf_processing import convert_with_markitdown
from .ocr_processing import extract_with_bda, extract_with_tesseract
from .my_custom_processor import process_with_custom_method
__all__ = [
"convert_with_markitdown",
"extract_with_bda",
"extract_with_tesseract",
"process_with_custom_method",
]Update src/schemas/document.py:
class ProcessingMethod(str, Enum):
"""Processing method enumeration."""
MARKITDOWN = "markitdown"
BDA = "bda"
TESSERACT = "tesseract"
CUSTOM = "custom" # Add your methodUpdate src/pipelines/pdf_processing/pipeline.py:
def __init__(
self,
use_markitdown: bool = True,
use_bda: bool = False,
use_tesseract: bool = False,
use_custom: bool = False, # Add parameter
s3_bucket: Optional[str] = None,
):
# ... existing code ...
self.use_custom = use_custom
def process_pdf(self, pdf_path: str) -> Dict[str, Optional[Document]]:
# ... existing code ...
# Add custom processing
if self.use_custom:
from ...processors import process_with_custom_method
logger.info("Processing with Custom Method...")
markdown_path = process_with_custom_method(str(pdf_path))
if markdown_path:
results["custom"] = self._create_document(
pdf_path, markdown_path, ProcessingMethod.CUSTOM
)
else:
results["custom"] = None
# ... rest of code ...from src.pipelines import FullIDPPipeline
pipeline = FullIDPPipeline(
use_markitdown=True,
use_custom=True, # Enable your custom processor
prefer_method="custom"
)
result = pipeline.process_and_classify("document.pdf")- AWS Account with Bedrock access
- IAM Permissions for Bedrock and S3
- Model Access enabled for Claude models
- Go to AWS Bedrock Console
- Navigate to "Model access"
- Request access to:
- Anthropic Claude 3.5 Sonnet
- Anthropic Claude 3.5 Haiku (optional)
- Wait for approval (usually instant)
No additional setup required beyond model access. The template uses LiteLLM which automatically handles Bedrock authentication via your AWS credentials.
-
Create S3 Bucket:
aws s3 mb s3://your-idp-bucket --region us-east-1
-
Update Configuration:
aws: s3_bucket: "your-idp-bucket"
-
IAM Permissions (attach to your role/user):
{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "bedrock:*", "bedrock-data-automation:*", "bedrock-data-automation-runtime:*" ], "Resource": "*" }, { "Effect": "Allow", "Action": [ "s3:PutObject", "s3:GetObject", "s3:ListBucket" ], "Resource": [ "arn:aws:s3:::your-idp-bucket", "arn:aws:s3:::your-idp-bucket/*" ] } ] }
The template is designed to run at scale using AWS Batch with Fargate compute. This provides serverless, containerized execution without managing infrastructure.
- API Gateway: Entry point for document processing requests
- Lambda: Lightweight function to submit jobs to AWS Batch
- AWS Batch: Manages job queues and compute environments
- Fargate: Serverless containers that run the processing jobs
- ECR: Stores Docker images for the batch jobs
- S3: Input/output storage for documents and results
- Bedrock: AI-powered document classification
- CloudWatch: Logging and monitoring
# Build the image
docker build -t idp-template:latest .
# Tag for ECR
aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin <account-id>.dkr.ecr.us-east-1.amazonaws.com
docker tag idp-template:latest <account-id>.dkr.ecr.us-east-1.amazonaws.com/idp-template:latest
# Push to ECR
docker push <account-id>.dkr.ecr.us-east-1.amazonaws.com/idp-template:latestThe Fargate task execution role needs permissions for:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"bedrock:InvokeModel",
"bedrock-data-automation:*",
"bedrock-data-automation-runtime:*"
],
"Resource": "*"
},
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:PutObject",
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::your-idp-bucket",
"arn:aws:s3:::your-idp-bucket/*"
]
},
{
"Effect": "Allow",
"Action": [
"logs:CreateLogStream",
"logs:PutLogEvents"
],
"Resource": "arn:aws:logs:*:*:*"
},
{
"Effect": "Allow",
"Action": [
"ecr:GetAuthorizationToken",
"ecr:BatchCheckLayerAvailability",
"ecr:GetDownloadUrlForLayer",
"ecr:BatchGetImage"
],
"Resource": "*"
}
]
}Error: tesseract is not installed or it's not in your PATH
Solution:
# Fedora
sudo dnf install tesseract tesseract-langpack-eng
# Ubuntu
sudo apt-get install tesseract-ocr
# Verify
tesseract --versionError: Unable to locate credentials
Solution:
# Configure AWS CLI
aws configure
# Or set environment variables
export AWS_ACCESS_KEY_ID=your_key
export AWS_SECRET_ACCESS_KEY=your_secret
export AWS_REGION=us-east-1Error: NoSuchBucket: The specified bucket does not exist
Solution:
# Create the bucket
aws s3 mb s3://your-idp-bucket --region us-east-1
# Update config.yaml
aws:
s3_bucket: "your-idp-bucket"Error: Job stuck in RUNNABLE state
Solution:
- Check that ECR image exists and is accessible
- Verify Fargate task execution role has correct permissions
- Check CloudWatch logs for detailed error messages
- Ensure VPC has proper networking (NAT gateway for private subnets)
Enable debug logging:
from loguru import logger
import sys
# Add debug level logging
logger.remove()
logger.add(sys.stderr, level="DEBUG")
# Now run your pipeline
pipeline.process_and_classify("document.pdf")# Run all tests
uv run pytest
# Run with coverage
uv run pytest --cov=src --cov-report=html
# Run specific test
uv run pytest tests/unit/test_classifiers.py -vBuilt with:
- MarkItDown by Microsoft
- LiteLLM for unified LLM APIs
- Pydantic for data validation
- loguru for logging
- AWS Bedrock for AI capabilities
- AWS Batch for scalable compute
