A comprehensive RAG (Retrieval-Augmented Generation) application suite for interacting with PDF documents using Large Language Models. This repository contains two implementations: a text-only PDF chat application and an advanced multimodal PDF chat that can understand text, images, and tables.
- Text-only PDF processing: Extract and process text content from PDFs
- RAG pipeline: Retrieve relevant context and generate answers
- Streamlit interface: User-friendly web interface for chatting with PDFs
- Vector search: Semantic search using ChromaDB and embeddings
- Multimodal processing: Handles text, images, and tables from PDFs
- Vision model support: Uses vision LLMs (llava, bakllava, moondream) to understand images
- Table extraction: Automatically extracts and processes tables using pdfplumber
- Image analysis: Visual understanding of charts, diagrams, and figures
- Conversation history: Maintains chat history with visual context
- Comprehensive RAG: Retrieves and uses text, images, and tables for answering questions
ChatWithPDFs/
├── TextPDFRag/
│ ├── app.py # Text-only PDF chat application
│ └── chroma_db/ # Vector database storage
├── MultiModalPDFChat/
│ ├── app.py # Multimodal PDF chat application
│ ├── generate_multi_modal_pdf.py # Utility to generate test PDFs
│ ├── chroma_db/ # Vector database storage
│ └── pdfs/ # Sample PDF files
└── README.md
- Python 3.8+
- Ollama installed and running
- Download from ollama.ai
- Install required models:
# For TextPDFRag ollama pull llama3 ollama pull nomic-embed-text # For MultiModalPDFChat (vision model) ollama pull llava ollama pull nomic-embed-text
-
Clone the repository (or navigate to the project directory)
-
Install Python dependencies:
pip install streamlit pypdf langchain-ollama langchain-community langchain-core chromadb pymupdf pillow pdfplumber pandas
Or create a
requirements.txtand install:pip install -r requirements.txt
-
Start the application:
cd TextPDFRag streamlit run app.py -
Use the application:
- Upload a PDF file through the web interface
- Wait for the PDF to be processed (text extraction and vectorization)
- Ask questions about the PDF content
- Type 'quit' or 'exit' to end the conversation
-
Start the application:
cd MultiModalPDFChat streamlit run app.py -
Use the application:
- Upload a PDF file (can contain text, images, and tables)
- Wait for processing (extracts text, images, and tables)
- Ask questions about the PDF
- The system will retrieve relevant text, images, and tables
- View the conversation history with images and tables displayed
- Type 'quit' or 'exit' to end the conversation
-
Generate test PDFs (optional):
cd MultiModalPDFChat python generate_multi_modal_pdf.pyThis creates a sample PDF with text, images, and tables for testing.
- Streamlit: Web application framework
- LangChain: LLM application framework
langchain-ollama: Ollama integrationlangchain-community: Community integrations (ChromaDB)langchain-core: Core LangChain functionality
- Ollama: Local LLM inference server
- Models:
llama3,llava,nomic-embed-text
- Models:
- ChromaDB: Vector database for embeddings storage
- PyPDF: PDF text extraction
- PyMuPDF (fitz): PDF image extraction
- pdfplumber: PDF table extraction
- Pillow (PIL): Image processing
- Pandas: Table data manipulation
A simple RAG implementation that:
- Extracts text from PDFs
- Splits text into chunks
- Creates embeddings and stores in ChromaDB
- Retrieves relevant chunks based on queries
- Generates answers using the retrieved context
Best for: Text-heavy PDFs like research papers, articles, documentation
An advanced RAG implementation that:
- Extracts text, images, and tables from PDFs
- Creates separate document chunks for each modality
- Uses vision models to understand images
- Retrieves relevant text, images, and tables
- Generates comprehensive answers using all modalities
Best for: PDFs with charts, diagrams, tables, technical reports, presentations
-
Document Processing:
- Extract content (text/images/tables) from PDF
- Split into manageable chunks
- Create embeddings for each chunk
-
Storage:
- Store embeddings in ChromaDB vector database
- Maintain metadata (page numbers, content type, etc.)
-
Retrieval:
- Convert user query to embedding
- Find similar chunks using vector similarity search
- Retrieve top-k most relevant chunks
-
Generation:
- Augment LLM prompt with retrieved context
- Generate answer using the context
- For multimodal: Include images in vision model input
- The vector databases are stored locally in
chroma_db/directories - Each application maintains its own separate vector database
- Vision models require more computational resources than text-only models
- Processing time depends on PDF size and complexity
Feel free to submit issues, fork the repository, and create pull requests for any improvements.
This project is for educational purposes.