🧾 Invoice Data Extraction with LLM RAG on CPU

This project is a local Retrieval-Augmented Generation (RAG) pipeline built with LangChain, ChromaDB, and Ollama to extract structured information from invoice PDFs using LLaMA3.2:latest running on CPU.

✅ No GPU required. ✅ Entirely offline and private. ✅ Powered by sentence-transformers (all-mpnet-base-v2) for embeddings and ChromaDB for vector storage.

⚙️ Features

Upload any text-based PDF invoice
Ask structured or natural language questions like "What is the invoice number?" or "Who is the client?"
Built on LLaMA3.2:latest (7B model via Ollama)
Uses Sentence Transformers for semantic chunking and vector similarity
Full offline support, privacy-preserving, and works on CPU
UI built with Streamlit for PDF upload and querying

🧪 Tech Stack

🧠 LLM: LLaMA3.2:latest via Ollama
🧲 Embedding Model: sentence-transformers/all-mpnet-base-v2
📚 LangChain: For RAG pipeline and chaining
📦 ChromaDB: Fast and simple vector store
📄 PDF Loader: LangChain’s PyPDFLoader
🎛️ Streamlit: UI interface

🛠️ Setup Instructions

1. Install dependencies

pip install -r requirements.txt

2. Install and set up Ollama

Follow instructions at https://ollama.com Then pull the model:

ollama pull llama3:latest

3. Prepare PDF data

Place your text-based invoice PDFs in the data/ directory.

4. Generate vector store from PDF

python ingest.py

5. Run the app with question:

python main.py "What is the client name?"

📁 Folder Structure

.
├── data/                # Folder with PDF invoices
├── vectorstore/         # Persisted ChromaDB vectors
├── rag/
│   ├── pipeline.py      # RAG chain builder
├── app.py               # Streamlit interface
├── ingest.py            # PDF to vector processor
├── main.py              # CLI for querying
├── config.yml           # Config settings
├── requirements.txt

📌 Example Query

python main.py "What is the invoice date?"

Output:

{
  "invoice_date": "2024-03-18"
}

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
__pycache__		__pycache__
data		data
rag		rag
uploaded_pdfs		uploaded_pdfs
.gitignore		.gitignore
README.md		README.md
config.yml		config.yml
ingest.py		ingest.py
main.py		main.py
main2.py		main2.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🧾 Invoice Data Extraction with LLM RAG on CPU

⚙️ Features

🧪 Tech Stack

🛠️ Setup Instructions

1. Install dependencies

2. Install and set up Ollama

3. Prepare PDF data

4. Generate vector store from PDF

5. Run the app with question:

📁 Folder Structure

📌 Example Query

About

Uh oh!

Releases

Packages

Languages

Prayesh13/Invoice-Data-Extraction

Folders and files

Latest commit

History

Repository files navigation

🧾 Invoice Data Extraction with LLM RAG on CPU

⚙️ Features

🧪 Tech Stack

🛠️ Setup Instructions

1. Install dependencies

2. Install and set up Ollama

3. Prepare PDF data

4. Generate vector store from PDF

5. Run the app with question:

📁 Folder Structure

📌 Example Query

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages