Skip to content

Latest commit

 

History

History

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 

README.md

Module 05 · Document Processing

Automate enterprise document intake: extract, classify, and structure data from PDFs, Word docs, and scanned images.

Files

File Description
document_pipeline.py Full pipeline: extract → classify → structure → PII detect

Quick Start

python 05-document-processing/document_pipeline.py

Pipeline Stages

  1. Text Extraction — PDF (pypdf), DOCX (python-docx), Images (Tesseract OCR)
  2. Classification — Zero-shot document type detection (HuggingFace)
  3. Structured Extraction — LLM-powered field extraction (Grok/OpenRouter)
  4. PII Detection — NER-based entity detection before storage

Supported Document Types

  • Invoices → vendor, line items, totals
  • Contracts → parties, dates, obligations
  • Policies → requirements, owners, review dates

Enterprise Use Cases

  • Accounts payable automation (invoice processing)
  • Contract lifecycle management
  • Compliance document review
  • HR document digitization