An hybrid Invoice OCR pipeline that extracts structured JSON data from invoices. This project demonstrates an end-to-end intelligent document processing pipeline.
📹 (https://drive.google.com/file/d/17MpFdzs3mrm-jd801CzsPZ-NdsQOmw6s/view?usp=drive_link)
- Accepts PDF or image uploads (pdf2image used to convert PDFs).
- Multi-page PDFs supported (each page converted to an image and processed).
- Extract structured data (invoice_number, vendor_name, invoice_date, line_items, grand_total, etc.).
- Switch between Paid API and Open-source OCR.
- Returns structured JSON.
- Sample invoices included for testing/demo.
- Backend: FastAPI
- Frontend: Gradio
- OCR: pdf2image + Tesseract (open-source) OR Together AI API
- Language Model (API): Qwen2.5-VL-72B-Instruct (Together AI)
Modes:
- paid: Uses Together AI (Qwen2.5-VL-72B-Instruct) to extract structured JSON from invoice images.
- open_source: Uses
pytesseractOCR + heuristics for a free fallback option.
On Windows PowerShell:
python -m venv .venv
.\.venv\Scripts\Activate.ps1
pip install --upgrade pip
pip install -r requirements.txt
On macOS/Linux (bash/zsh):
python3 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt
-
Poppler (required for PDF → image conversion via pdf2image)
- Windows: Download a Poppler build (e.g., from https://github.com/oschwartz10612/poppler-windows/releases), unzip to
C:\poppler\poppler-23.08.0, and ensure theLibrary\binfolder exists. - macOS:
brew install poppler - Ubuntu/Debian:
sudo apt install poppler-utils
- Windows: Download a Poppler build (e.g., from https://github.com/oschwartz10612/poppler-windows/releases), unzip to
-
Tesseract (recommended for open-source OCR mode)
- Windows: Install from https://github.com/tesseract-ocr/tesseract (ensure the installer adds Tesseract to PATH, or note the install path, e.g.,
C:\Program Files\Tesseract-OCR\tesseract.exe). - macOS:
brew install tesseract - Ubuntu/Debian:
sudo apt install tesseract-ocr
- Windows: Install from https://github.com/tesseract-ocr/tesseract (ensure the installer adds Tesseract to PATH, or note the install path, e.g.,
Copy .env.example to .env and set values as needed (only the API key is required for paid mode):
TOGETHER_API_KEY="your_api_key_here"
TOGETHER_MODEL="Qwen/Qwen2.5-VL-72B-Instruct"
TOGETHER_INFERENCE_URL="https://api.together.xyz/v1/chat/completions"
Windows-specific optional settings:
-
Set
POPPLER_PATHif Poppler is not on PATH. The backend uses this for PDF conversion on Windows and defaults toC:\\poppler\\poppler-23.08.0\\Library\\binif not set.- PowerShell (current session):
$env:POPPLER_PATH = "C:\\poppler\\poppler-23.08.0\\Library\\bin" - Or set permanently via System Properties → Environment Variables.
- PowerShell (current session):
-
Ensure Tesseract is on PATH for
open_sourcemode. If not, either add it to PATH or configurepytesseract.pytesseract.tesseract_cmdin code to the full path, e.g.C:\\Program Files\\Tesseract-OCR\\tesseract.exe.
PowerShell:
uvicorn src.backend.main:app --reload
In a second terminal (with the same venv activated):
python src/frontend/gradio_app.py
Open the Gradio UI at the URL printed in the terminal (typically http://127.0.0.1:7860, may vary). Try files from sample_invoices/.
-
Poppler not found / 500 error converting PDFs:
- Ensure Poppler is installed and
POPPLER_PATHpoints to itsLibrary\binfolder (Windows). The backend uses this path when converting PDFs viapdf2image.
- Ensure Poppler is installed and
-
Tesseract not found in
open_sourcemode:- Add Tesseract to PATH or set
pytesseract.pytesseract.tesseract_cmdto its full path.
- Add Tesseract to PATH or set
-
401/403 errors in paid mode:
- Verify
TOGETHER_API_KEYin.env. Restart the backend after changes.
- Verify
-
Connection failed between frontend and backend:
- Confirm backend at
http://127.0.0.1:8000is running before starting the frontend. The frontend posts to that URL.
- Confirm backend at
-
Poor OCR quality (open_source):
- Try higher DPI for PDFs or better scans. Our backend uses 300 DPI for PDF conversion by default.
-
Multi-page PDFs:
- Each page is processed and aggregated. The response includes
pagesto confirm count.
- Each page is processed and aggregated. The response includes