This is a tool to facilitate LLM experiments with PDFs, especially those that contain sensitive information. Remember to only use services that provide appropriate privacy. Because of the Azure policy covering HIPAA and providing a BAA for Azure customers, many of the functions of this library are Azure-centric.
This application provides an API that uses Azure Document Intelligence to convert PDFs to Markdown and structured JSON, handling PDFs of arbitrary size (rather than being limited to Azure's single-request limit of 2000 pages). The system preserves document structure through intelligent segmentation that maintains hierarchical heading context (H1-H6). Every JSON element is automatically assigned a unique ID for tracing back to the source document. The filtering endpoint facilitates stripping out unnecessary JSON components to optimize for LLM token usage.
Docling conversion is handled by a separate docling-serve instance. Use docling-serve for conversion, then pass its outputs into this API’s post-processing endpoints (segmentation, filtering, anonymization, prompt composition, etc.).
There is also an endpoint to anonymize documents using LLM-Guard with the AI4Privacy BERT model for comprehensive PII detection (supported entity list depends on your LLM-Guard version), as well as an endpoint to compose prompts around large documents with instructions and the beginning and the end (as recommended by the GPT-4.1 documentation).
FastAPI provides automatic interactive API documentation:
-
Start the FastAPI server:
uv run run.pyor
uvicorn main:app --reload -
Visit the interactive documentation:
- Swagger UI - Interactive API testing
- ReDoc - Alternative API documentation
-
Test endpoints directly through the Swagger UI interface by:
- Clicking on any endpoint
- Clicking "Try it out"
- Filling in the request parameters
- Clicking "Execute"
- Python 3.13.x (3.14 is not supported yet due to spaCy/pydantic compatibility)
- Azure Document Intelligence account and credentials
- Install dependencies using uv:
uv sync
The application requires Azure Document Intelligence credentials. Create a .env file in the project root with:
# Azure Document Intelligence Configuration
AZURE_DOCUMENT_INTELLIGENCE_ENDPOINT=https://your-resource-name.cognitiveservices.azure.com/
AZURE_DOCUMENT_INTELLIGENCE_KEY=your-api-key-here
# Optional: Logging level (DEBUG, INFO, WARNING, ERROR)
LOG_LEVEL=INFOImportant: Never commit the .env file to version control. It's already included in .gitignore.
- Azure Document Intelligence: Enterprise-grade PDF to structured data extraction
- Docling-serve: External document conversion service (run separately)
- LLM-Guard: Advanced PII detection and anonymization with the AI4Privacy BERT model (supported entity list depends on your LLM-Guard version)
- FastAPI: Modern, fast web framework with automatic API documentation
- UV: Ultra-fast Python package and project management
- Python 3.13+: Required for latest performance improvements and type hints
Example scripts are provided in the examples/ directory:
pseudonymization_demo.py: Demonstrates stateless pseudonymization and deanonymization workflows
To start the FastAPI server:
uv run run.py- Description: Root endpoint that returns information about all available API endpoints.
- Response:
- JSON object with welcome message and list of available endpoints with their descriptions.
- Description: Health check endpoint to verify the service is running.
- Response:
- JSON object with status "healthy" and timestamp.
- Description: Compose a prompt from multiple text fields and/or files, each wrapped in a specified XML tag.
- Request:
multipart/form-datawith amappingfield (JSON:{tag: value}), where each key is an XML tag and the value is either:- A string (content), or
- The name of an uploaded file field
- Optionally, upload files with field names matching the mapping values.
- Special Case:
- If a tag is named
instructions, its section is wrapped in<instructions>and appears at both the top and bottom of the result.
- If a tag is named
- Response:
- Plain text: The composed prompt with each section wrapped in its XML tag.
- Example (text only):
{ "document": "Text of your document goes here", "transcript": "Transcript of therapy session goes here", "manual": "Scoring manual for scoring therapy session fidelity", "instructions": "Score the attached transcript, wrapped in transcript tags, according to the manual, wrapped in manual tags. Provide a score for each scale in the manual." }
- Description: Extracts structured data and markdown from PDF documents using Azure Document Intelligence with intelligent batch processing and optional segmentation. Implements Phase 1 of the PDF processing strategy with perfect document reconstruction for documents of any size. Can automatically segment results into semantically meaningful chunks for LLM processing.
- Request:
multipart/form-datawith the following fields:file: PDF file to process (required)batch_size: Number of pages per batch (optional, default: 1500)include_element_ids: Add unique IDs to all elements (optional, default: true)return_both: Return both original and ID-enriched versions (optional, default: false)enable_segmentation: Enable automatic segmentation of results (optional, default: false)segment_min_tokens: Minimum tokens per segment (optional, default: 10000)segment_max_tokens: Maximum tokens per segment (optional, default: 30000)
- Advanced Features:
- Intelligent Batch Processing: Automatically processes large documents in configurable page batches
- Perfect Stitching: Reconstructs complete documents with 100% accuracy using advanced stitching algorithms
- Element ID Generation: Automatically adds stable
_idfields to all elements for tracking through filtering/segmentation - Automatic Segmentation: Optional segmentation into semantically meaningful chunks respecting document structure
- Token-Based Control: Configurable min/max tokens per segment for optimal LLM context window usage
- Structural Awareness: Segments break at logical boundaries (H1/H2 headings) while maintaining context
- Automatic Offset Calculation: Seamlessly handles page numbering and content offsets across batches
- Concurrent Processing: Processes multiple batches simultaneously for optimal performance
- Input Validation: Comprehensive validation of Azure DI structure and batch sequences
- Production Ready: Handles documents up to 353+ pages with robust error handling
- Performance:
- Sub-second execution with minimal memory usage
- Validated with complete 353-page document reconstruction
- Perfect accuracy: 100% ground truth matching for content integrity
- Response:
- A JSON object containing:
{ "markdown_content": "Complete markdown of entire document", "json_content": { "content": "Full document content...", "paragraphs": [ { "_id": "para_1_0_a3f2b1", // Unique element ID "content": "...", "role": "paragraph", "boundingRegions": [...], // ... other Azure DI fields } ], "tables": [ { "_id": "table_5_2_d4e5f6", "cells": [ { "_id": "cell_5_2_0_0_b7c8d9", "content": "...", // ... other cell fields } ], // ... other table fields } ], // ... other elements with _id fields }, "segments": [ // Only present when enable_segmentation=true { "segment_id": 1, "source_file": "document.pdf", "token_count": 12543, "structural_context": { "h1": "Chapter 1", "h2": "Section A", "h3": null, "h4": null, "h5": null, "h6": null }, "elements": [ // Azure DI elements in this segment ] } // ... more segments ], "metadata": { "page_count": 10, "processing_type": "azure_di", "processing_time": 2.5, "file_size": 1048576, "filename": "document.pdf", "batch_size": 1500, "element_ids_included": true, // Segmentation metadata (when enabled) "segmentation_enabled": true, "segment_count": 3, "segment_config": { "min_tokens": 10000, "max_tokens": 30000 } } } - When
include_element_ids=true(default): Returns Azure DI format with added_idfields - When
include_element_ids=false: Returns pure Azure DI format without IDs - When
return_both=true: Returns bothjson_content(with IDs) andjson_content_original(without IDs) - When
enable_segmentation=true: Includessegmentsarray with document broken into semantically meaningful chunks
- A JSON object containing:
Docling conversion is performed directly via docling-serve’s API (for example: /v1/convert/file or /v1/convert/source). This service expects standard docling-serve responses (with document.md_content and document.json_content) and provides downstream post-processing endpoints for those outputs.
- Description: Chunks a docling-serve response into rich segments using docling-core chunking.
- Request:
- A JSON object with the following structure:
{ "source_file": "document.pdf", "docling_response": { "... full docling-serve response ..." }, "min_segment_tokens": 1000, "max_segment_tokens": 30000, "merge_peers": true } - Parameters:
source_file: Name of the original document (required)docling_response: Full docling-serve response (required)min_segment_tokens: Minimum tokens per segment (optional, default: 1,000)max_segment_tokens: Maximum tokens per segment (optional, default: 30,000)merge_peers: Merge adjacent peers when chunking (optional, default: true)
- A JSON object with the following structure:
- Response:
- A JSON array of "Rich Segment" objects (same schema as
/segment).
- A JSON array of "Rich Segment" objects (same schema as
- Description: Transforms complete Azure Document Intelligence analysis results into rich, structurally-aware segments with configurable token thresholds. This creates large, coherent document chunks suitable for advanced analysis.
- Request:
- A JSON object with the following structure:
{ "source_file": "document.pdf", "json_content": { "... complete Azure DI analysis result ..." }, "min_segment_tokens": 10000, "max_segment_tokens": 30000 } - Parameters:
source_file: Name of the original document (required)analysis_result: Complete Azure DI analysis result from/extractendpoint (required)min_segment_tokens: Minimum tokens per segment (optional, default: 10,000)max_segment_tokens: Maximum tokens per segment - soft limit (optional, default: 30,000)
- A JSON object with the following structure:
- Features:
- Configurable token thresholds for different use cases
- Intelligent boundary detection at heading levels (H1/H2)
- Preserves full Azure DI metadata (bounding boxes, page numbers, etc.)
- Maintains hierarchical context (current H1-H6 headings)
- Processes all Azure DI element types (paragraphs, tables, figures, formulas, keyValuePairs)
- Response:
- A JSON array of "Rich Segment" objects with the following structure:
[ { "segment_id": 1, "source_file": "document.pdf", "token_count": 12543, "structural_context": { "h1": "Chapter 1", "h2": "Section A", "h3": null, "h4": null, "h5": null, "h6": null }, "elements": [ { "role": "paragraph", "content": "...", "bounding_regions": [...], "page_number": 1 } ] } ]
- A JSON array of "Rich Segment" objects with the following structure:
- Description: Combines filtering and segmentation to prepare documents for LLM processing with significantly reduced token usage. Applies configurable filters to remove unnecessary fields while preserving element IDs for traceability.
- Request:
- A JSON object with the following structure:
{ "source_file": "document.pdf", "json_content": { "... Azure DI result with _id fields ..." }, "filter_config": { "filter_preset": "llm_ready", "include_element_ids": true }, "min_segment_tokens": 10000, "max_segment_tokens": 30000 } - Parameters:
filter_config.filter_preset: Name of preset or "custom" (optional, default: "llm_ready")filter_config.fields: Custom list of fields to include when using "custom" preset (optional)filter_config.include_element_ids: Whether to include _id fields (optional, default: true)
- A JSON object with the following structure:
- Filter Presets:
no_filter: Preserves all original fields (returns raw dictionary format)llm_ready: Optimal balance - includes content, structure, and headers/footers for citations (default)forensic_extraction: Includes document metadata for complex multi-document analysiscitation_optimized: Minimal fields - content, page numbers, and IDs only
- Simplified Allowlist Filtering:
- Single Field List: Each preset defines a simple list of fields to include
- No Complex Rules: Removed confusing include/exclude patterns in favor of explicit field lists
- Exact Field Definitions:
no_filter:["*"]- includes all fields from Azure DIcitation_optimized:["_id", "content", "pageNumber", "elementIndex", "pageFooter"]llm_ready:["_id", "content", "pageNumber", "role", "elementType", "elementIndex", "pageHeader", "pageFooter", "parentSection"]forensic_extraction:["_id", "content", "pageNumber", "role", "elementType", "elementIndex", "pageHeader", "pageFooter", "parentSection", "documentMetadata"]
- Custom Filtering Example:
{ "filter_config": { "filter_preset": "custom", "fields": ["_id", "content", "pageNumber", "myCustomField"], "include_element_ids": true } } - Features:
- Element ID Preservation: The
_idfield is included based on filter preset - Hybrid Return Types:
no_filterreturns raw dictionaries, other presets return typed FilteredElement objects - Token Optimization: Typically achieves 50-75% reduction in token usage
- Metrics Tracking: Reports size reduction, element counts, and excluded fields
- Element ID Preservation: The
- Response:
- A JSON object containing:
{ "segments": [ { "segment_id": 1, "source_file": "document.pdf", "token_count": 12543, "structural_context": { "h1": "Chapter 1", "h2": "Section A" }, "elements": [ { "_id": "para_1_0_a3f2b1", // Preserved from extraction "content": "...", "pageNumber": 1, "role": "paragraph" // Only fields allowed by filter preset } ] } ], "element_mappings": [ [ /* mappings for segment 1 */ ], [ /* mappings for segment 2 */ ] ], "metrics": { "original_size_bytes": 500000, "filtered_size_bytes": 150000, "reduction_percentage": 70.0, "excluded_fields": ["boundingBox", "polygon", "confidence", ...] } }
- A JSON object containing:
- Description: Returns all available filter presets and their descriptions for use with the filtering and segmentation endpoints.
- Response:
- JSON object containing preset names as keys and their configurations as values:
{ "no_filter": { "description": "Preserves all original fields from Azure DI", "fields": ["*"] }, "llm_ready": { "description": "Optimal balance for LLM processing - includes content, structure, and headers/footers", "fields": [ "_id", "content", "pageNumber", "role", "elementType", "elementIndex", "pageHeader", "pageFooter", "parentSection" ] }, "forensic_extraction": { "description": "Includes document metadata for complex multi-document analysis", "fields": [ "_id", "content", "pageNumber", "role", "elementType", "elementIndex", "pageHeader", "pageFooter", "parentSection", "documentMetadata" ] }, "citation_optimized": { "description": "Minimal fields - content, page numbers, and IDs only", "fields": ["_id", "content", "pageNumber", "elementIndex", "pageFooter"] } }
- JSON object containing preset names as keys and their configurations as values:
- Description: Anonymizes sensitive information in Azure Document Intelligence output using LLM-Guard with the AI4Privacy BERT model. Supports stateless operation by accepting and returning vault data for consistent anonymization across requests.
- Request:
- A JSON object with the following structure:
{ "azure_di_json": { /* Azure DI analysis result */ }, "config": { "entity_types": [ "PERSON", "DATE_TIME", "LOCATION", "PHONE_NUMBER", "EMAIL_ADDRESS", "US_SSN", "MEDICAL_LICENSE" ], "score_threshold": 0.5, "anonymize_all_strings": true, "date_shift_days": 365, "return_decision_process": false }, "vault_data": [ /* Optional: Previous vault data for consistent replacements */ [ "John Doe", "Jane Smith" ], ["_date_offset", "-365"] ] }
- A JSON object with the following structure:
- Features:
- Uses LLM-Guard with AI4Privacy BERT model (Isotonic/distilbert_finetuned_ai4privacy_v2)
- Broad PII coverage with strong F1 score (see LLM-Guard/AI4Privacy model documentation)
- Advanced pattern recognition beyond basic NER
- Configurable confidence threshold to reduce false positives
- Realistic fake data generation using Faker library
- Cryptographically secure random generation for sensitive IDs
- Session-isolated replacements for security
- Preserves document structure and element IDs
- Optional decision process debugging
- Supported Entity Types:
- Use
entity_typesto constrain detection to specific types. - Use
["all"]to include all entity types supported by the installed LLM-Guard configuration.
- Use
- Response:
- JSON object containing:
{ "anonymized_json": { /* Anonymized Azure DI JSON */ }, "statistics": { "PERSON": 5, "DATE_TIME": 2, ... }, "vault_data": [ /* Updated vault with all anonymization mappings */ ["John Doe", "Jane Smith"], ["john@example.com", "jane@example.com"], ["_date_offset", "-365"] ] }
- JSON object containing:
- Description: Anonymizes sensitive information in markdown or plain text while preserving formatting. Supports stateless operation with vault data.
- Request:
- A JSON object with the following structure:
{ "markdown_text": "Your markdown or plain text content...", "config": { "entity_types": ["PERSON", "DATE_TIME", ...], "score_threshold": 0.5, "anonymize_all_strings": true, "return_decision_process": false }, "vault_data": [ /* Optional: Previous vault data */ ["placeholder", "original"], ... ] }
- A JSON object with the following structure:
- Features:
- Same powerful anonymization engine as the Azure DI endpoint
- Preserves markdown formatting (headers, lists, code blocks, etc.)
- Configurable entity detection with score threshold
- Consistent replacements across the document
- Optional decision process for debugging
- Response:
- JSON object containing:
{ "anonymized_text": "Anonymized markdown content...", "statistics": { "PERSON": 3, "EMAIL_ADDRESS": 2, ... }, "decision_process": [ /* optional debugging info */ ], "vault_data": [ /* Updated vault data */ ] }
- JSON object containing:
- Description: Health check endpoint for the anonymization service. Verifies that the LLM-Guard scanner with AI4Privacy BERT model is ready.
- Response:
- JSON object with service status and model information:
{ "status": "healthy", "service": "anonymization", "engines_initialized": true, "recognizers": "LLM-Guard with AI4Privacy model (supported entity list depends on version)", "model": "Isotonic/distilbert_finetuned_ai4privacy_v2" } - Returns
"status": "unhealthy"with error details if the service is not ready.
- JSON object with service status and model information:
- Description: Pseudonymize text with consistent replacements designed for reversibility. Uses vault state for maintaining mappings across multiple documents.
- Request:
{ "text": "Text to pseudonymize", "config": { "entity_types": ["PERSON", "EMAIL_ADDRESS", ...], "date_shift_days": 365 }, "vault_data": [ /* Optional: Previous vault data */ ] } - Response:
{ "pseudonymized_text": "Text with consistent pseudonyms", "statistics": { "PERSON": 2, ... }, "vault_data": [ /* Updated vault with all mappings */ ] }
- Description: Reverse pseudonymization using vault mappings to restore original values.
- Request:
{ "text": "Pseudonymized text", "vault_data": [ /* Required: Vault data from pseudonymization */ ] } - Response:
{ "deanonymized_text": "Original text restored", "statistics": { "PERSON": 2, ... } }
The FastAPI backend implements a stable element identification system that enables tracking elements throughout the processing pipeline:
Element IDs are generated during the extraction phase (/extract) and follow this format:
- Pattern:
{element_type}_{page}_{global_index}_{content_hash} - Examples:
para_1_0_a3f2b1- First paragraph on page 1table_5_2_d4e5f6- Third table on page 5cell_5_2_0_0_b7c8d9- Cell at row 0, column 0 in the third table on page 5
1. Extraction (/extract)
├── Stitches batches from Azure DI
└── Adds _id to all elements
2. Filtering (/segment-filtered)
├── Removes unwanted fields based on preset
└── PRESERVES _id fields
3. Segmentation
├── Groups filtered elements into chunks
└── Elements keep their original _id
4. LLM Processing
└── Can reference specific elements by _id
- Traceability: Track any element from LLM output back to its exact location in the original document
- Stability: IDs remain constant regardless of filtering or segmentation choices
- Debugging: Easy to correlate elements across different processing stages
- Caching: Can cache processed elements by ID for efficiency
- PDF Processing:
- Maximum tested: 353+ pages with perfect reconstruction
- Default batch size: 1500 pages per Azure DI request
- Concurrent batch processing for optimal speed
- Sub-second execution for most operations
- Segmentation:
- Token limits: Configurable 10k-30k tokens per segment
- Intelligent boundary detection at H1/H2 headings
- Minimal memory usage through streaming architecture
- Anonymization:
- Sub-second processing for typical documents
- BERT model initialization: ~2-3 seconds on first request
- High accuracy for common PII types
- API Limits:
- Request size: Limited by web server configuration (typically 100MB)
- Timeout: Default 120 seconds, configurable
- Concurrent requests: Handled by multiple workers (default: 4)
The anonymization endpoints support the following configuration parameters:
- entity_types: List of entity types to detect and anonymize
- Default: Basic types like
PERSON,DATE_TIME,LOCATION,PHONE_NUMBER,EMAIL_ADDRESS,US_SSN,MEDICAL_LICENSE - Use
nullor[]to use the default list - Use
["all"]to include all entity types supported by the installed LLM-Guard configuration - Note: US_SSN detection requires valid SSN patterns (not test patterns like 123-45-6789)
- Default: Basic types like
- pattern_sets: Enable predefined pattern sets (list of strings)
"legal": Bates numbers, case numbers, docket numbers, court filings"medical": Medical record numbers, insurance IDs, provider numbers
- custom_patterns: Define your own regex patterns (list of pattern objects)
- Each pattern needs:
name,expressions(regex list) - Optional:
examples,context,score,languages
- Each pattern needs:
- score_threshold: Minimum confidence score (0.0-1.0, default: 0.5)
- Higher values reduce false positives but may miss some entities
- Recommended range: 0.5-0.7
- anonymize_all_strings: Anonymize all string fields (true) or only known PII fields (false) (default: true)
- date_shift_days: Maximum days to shift dates for anonymization (default: 365)
- return_decision_process: Include debugging information about detection reasoning (default: false) - Note: Not currently supported with LLM-Guard
- Custom Regex Pattern Support: Allow users to define domain-specific entity patterns
- Multi-language Support: Currently English-only, planning to add other languages
- Batch Processing: Anonymize multiple documents in a single request
Here's a complete workflow showing how to process a sensitive PDF document:
# Extract structured data from PDF
curl -X POST http://localhost:8000/extract \
-F "file=@confidential_document.pdf" \
-F "batch_size=1500" \
-F "include_element_ids=true" \
> extracted_result.json# Extract and segment PDF for direct LLM processing
curl -X POST http://localhost:8000/extract \
-F "file=@large_document.pdf" \
-F "enable_segmentation=true" \
-F "segment_min_tokens=10000" \
-F "segment_max_tokens=30000" \
> segmented_result.json
# Result includes both full content and segments:
# {
# "markdown_content": "Full document...",
# "json_content": { ... },
# "segments": [
# {
# "segment_id": 1,
# "token_count": 15234,
# "structural_context": { "h1": "Chapter 1", ... },
# "elements": [ ... ]
# }
# ]
# }Docling conversion should be performed directly against docling-serve (not this API):
curl -X POST http://localhost:5001/v1/convert/file \
-F "files=@document.pdf" \
-F "to_formats=md" \
-F "to_formats=json" \
-F "do_ocr=true" \
> docling_result.json# Prepare document for LLM with optimized token usage
curl -X POST http://localhost:8000/segment-filtered \
-H "Content-Type: application/json" \
-d '{
"source_file": "confidential_document.pdf",
"json_content": '$(cat extracted_result.json | jq .json_content)',
"filter_config": {
"filter_preset": "llm_ready",
"include_element_ids": true
},
"min_segment_tokens": 10000,
"max_segment_tokens": 30000
}' \
> segmented_result.json# Remove PII before sending to LLM
curl -X POST http://localhost:8000/anonymization/anonymize-azure-di \
-H "Content-Type: application/json" \
-d '{
"azure_di_json": '$(cat extracted_result.json | jq .json_content)',
"config": {
"entity_types": ["PERSON", "EMAIL_ADDRESS", "PHONE_NUMBER", "US_SSN"],
"pattern_sets": ["legal", "medical"], # Enable domain-specific patterns
"score_threshold": 0.6,
"anonymize_all_strings": true
}
}' \
> anonymized_result.json# Anonymize with custom patterns
curl -X POST http://localhost:8000/anonymization/anonymize-markdown \
-H "Content-Type: application/json" \
-d '{
"markdown_text": "Case No. 1:23-cv-45678 references BATES-001234",
"config": {
"pattern_sets": ["legal"],
"custom_patterns": [
{
"name": "INTERNAL_ID",
"expressions": ["\\bID-\\d{8}\\b"],
"examples": ["ID-12345678"]
}
]
}
}'# Create structured prompt with instructions
curl -X POST http://localhost:8000/compose-prompt \
-F 'mapping={"instructions":"Summarize the key findings","document":"@segmented_result.json"}' \
-F "document=@segmented_result.json" \
> final_prompt.txt-
413 Request Entity Too Large
- Solution: Reduce the
batch_sizeparameter in/extract - Default file size limit can be increased in server configuration
- Solution: Reduce the
-
Azure DI Timeout (504 Gateway Timeout)
- Large PDFs may exceed Azure's processing time
- Solution: Use smaller batch sizes (e.g., 500-1000 pages)
-
Memory Errors
- For documents with many tables or complex layouts
- Solution: Process in smaller segments or increase server memory
-
AI4Privacy Model Loading Errors
- LLM-Guard will download the AI4Privacy model on first use
- Solution: Ensure internet connectivity for model download (~134MB)
-
Invalid Azure Credentials
- Check your
.envfile configuration - Verify endpoint URL includes
https://and trailing/
- Check your
- Enable debug logging: Set
LOG_LEVEL=DEBUGin.env - Check element IDs for tracking issues through the pipeline
- Use
/anonymization/healthto verify service status - Test with smaller documents first
- Use HTTPS in Production: Always deploy with TLS/SSL certificates
- Secure Credentials:
- Store Azure keys in environment variables, never in code
- Use Azure Key Vault or similar for production deployments
- Rotate API keys regularly
- One-Way Anonymization: Mappings are not stored by default (future deanonymization support planned)
- Review Output: Always verify anonymized content before sharing
- Session Isolation: Each anonymization request uses isolated replacement mappings
- Score Threshold: Adjust based on your security requirements (higher = fewer false positives)
- API Authentication: Consider adding authentication middleware for production
- Network Isolation: Deploy in a private network for sensitive documents
- Rate Limiting: Implement to prevent abuse
- CORS Configuration: Restrict to trusted domains only
- Azure Document Intelligence is HIPAA compliant with proper configuration
- Anonymization helps meet GDPR/CCPA requirements
- Audit logs should be implemented for forensic use cases
- Consider data residency requirements for your jurisdiction