Automate enterprise document intake: extract, classify, and structure data from PDFs, Word docs, and scanned images.
| File | Description |
|---|---|
document_pipeline.py |
Full pipeline: extract → classify → structure → PII detect |
python 05-document-processing/document_pipeline.py- Text Extraction — PDF (pypdf), DOCX (python-docx), Images (Tesseract OCR)
- Classification — Zero-shot document type detection (HuggingFace)
- Structured Extraction — LLM-powered field extraction (Grok/OpenRouter)
- PII Detection — NER-based entity detection before storage
- Invoices → vendor, line items, totals
- Contracts → parties, dates, obligations
- Policies → requirements, owners, review dates
- Accounts payable automation (invoice processing)
- Contract lifecycle management
- Compliance document review
- HR document digitization