A Streamlit-powered application to extract structured purchase order (PO) data—including vendor details and line items—from scanned or digital PDFs using intelligent parsing and regex techniques.
Organizations deal with thousands of purchase orders (POs) daily, many of which are unstructured PDFs—scanned, poorly formatted, or handwritten. Manual entry or parsing of such documents is time-consuming, error-prone, and inefficient.
This tool automates the extraction of key fields and line items from single or multi-PO PDFs, saving hours of manual labor while improving accuracy.
The goal is to build a lightweight, interactive PO extraction system that:
- Processes multiple POs from a single PDF file.
- Extracts key metadata (PO number, vendor, address, date, total).
- Parses line items using robust pattern-matching techniques.
- Supports non-tabular and scanned PDF formats (OCR-ready architecture).
- Allows easy download of structured results (Excel, JSON, Annotated PDF).
| Domain | Use Case |
|---|---|
| 🏢 Enterprises | Automate invoice/PO processing in procurement & finance teams. |
| 📦 Supply Chain | Quickly parse bulk POs to match items with inventory or shipment data. |
| 🏛️ Government | Speed up document digitization and archival of procurement files. |
| 📊 Data Entry Automation | Reduce cost and increase throughput for data entry BPOs. |
| 🧾 Audit & Compliance | Extract structured logs for analysis and cross-verification. |
- ✅ Multi-PO Extraction: Supports one or many purchase orders in a single PDF.
- 🧠 Regex + NLP Parsing: Extracts line items from raw text without relying on tables.
- 📸 PDF Annotation: Highlights extracted values in the PDF using PyMuPDF.
- 💾 Download Outputs: Exports structured JSON, Excel files, and annotated PDF.
- 🎛️ Streamlit UI: Clean, user-friendly interface for drag-and-drop uploads.
- 📜 OCR-Ready: Can be extended to support scanned PDFs using
pytesseract.
- Python 3.8+
- Streamlit
- pdfplumber (for text extraction)
- PyMuPDF (fitz) (for PDF annotation)
- re (Regex) (for field & item parsing)
- pandas (for structured data)
zipfile,io,jsonfor file handling and downloads
- Upload PDF – Drag and drop a purchase order PDF (with single or multiple POs).
- Text Parsing – The PDF is parsed using
pdfplumber; text is split into PO blocks. - Field Extraction – Key metadata is extracted using regex:
PO_NumberVendorAddressDateTotal_Amount
- Line Item Extraction – Each block is scanned for itemized purchases using flexible regex:
e.g., HP Printer 123 - Qty: 2 - Price: 1200.00 - Annotation – Values are highlighted in the original PDF using
fitz. - Download – Outputs available in:
- 📄
All_PO_Main_Fields.xlsx - 📦
All_PO_Line_Items.xlsx - 🔖
Annotated_PO.pdf - 🧾
All_PO_Structured_Data.json - 🗜️ All bundled in a downloadable ZIP.
- 📄
To ensure accurate parsing, your POs should follow this ideal structure:
Purchase Order
PO Number:
Vendor:
Address:
Date:
Total Amount:
Item - Quantity: - Unit Price:
You can include multiple POs in a single PDF. Each PO should begin with "Purchase Order" keyword.
bash
git clone https://github.com/yourusername/po-automation-tool.git
cd po-automation-tool
python -m venv venv
source venv/bin/activate # or `venv\Scripts\activate` on Windows
bash
pip install -r requirements.txt
streamlit run app.py
- 🧾 Support table-based line items with layout-parser.
- 🧠 Use LLMs for more accurate description/line-item segmentation.
-👩💻 Ankita Ghosh
-Postgraduate in CSE (Data Science) | IEEE Researcher | AI/ML Developer
-Feel free to connect for collaboration or contributions!
-Pull requests are welcome. For major changes, open an issue first.
-Please ensure tests are updated as appropriate.