Skip to content

A lightweight app that extracts structured information from PDF invoices using local LLMs (like LLaMA3.2) and semantic search with sentence-transformers—upload an invoice, ask a question, and get accurate answers.

Notifications You must be signed in to change notification settings

Prayesh13/Invoice-Data-Extraction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🧾 Invoice Data Extraction with LLM RAG on CPU

This project is a local Retrieval-Augmented Generation (RAG) pipeline built with LangChain, ChromaDB, and Ollama to extract structured information from invoice PDFs using LLaMA3.2:latest running on CPU.

✅ No GPU required. ✅ Entirely offline and private. ✅ Powered by sentence-transformers (all-mpnet-base-v2) for embeddings and ChromaDB for vector storage.


⚙️ Features

  • Upload any text-based PDF invoice

  • Ask structured or natural language questions like "What is the invoice number?" or "Who is the client?"

  • Built on LLaMA3.2:latest (7B model via Ollama)

  • Uses Sentence Transformers for semantic chunking and vector similarity

  • Full offline support, privacy-preserving, and works on CPU

  • UI built with Streamlit for PDF upload and querying


🧪 Tech Stack

  • 🧠 LLM: LLaMA3.2:latest via Ollama

  • 🧲 Embedding Model: sentence-transformers/all-mpnet-base-v2

  • 📚 LangChain: For RAG pipeline and chaining

  • 📦 ChromaDB: Fast and simple vector store

  • 📄 PDF Loader: LangChain’s PyPDFLoader

  • 🎛️ Streamlit: UI interface


🛠️ Setup Instructions

1. Install dependencies

pip install -r requirements.txt

2. Install and set up Ollama

Follow instructions at https://ollama.com Then pull the model:

ollama pull llama3:latest

3. Prepare PDF data

Place your text-based invoice PDFs in the data/ directory.


4. Generate vector store from PDF

python ingest.py

5. Run the app with question:

python main.py "What is the client name?"

📁 Folder Structure

.
├── data/                # Folder with PDF invoices
├── vectorstore/         # Persisted ChromaDB vectors
├── rag/
│   ├── pipeline.py      # RAG chain builder
├── app.py               # Streamlit interface
├── ingest.py            # PDF to vector processor
├── main.py              # CLI for querying
├── config.yml           # Config settings
├── requirements.txt

📌 Example Query

python main.py "What is the invoice date?"

Output:

{
  "invoice_date": "2024-03-18"
}

About

A lightweight app that extracts structured information from PDF invoices using local LLMs (like LLaMA3.2) and semantic search with sentence-transformers—upload an invoice, ask a question, and get accurate answers.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages