hilabs_rag

Submission for hackathon

PDF Chatbot with LLM and Table Extraction

This project implements a PDF chatbot that uses Large Language Model (LLM) integration and table extraction to provide intelligent responses to questions about PDF content. It combines vector search, keyword extraction, and table analysis to offer comprehensive answers.

Features

PDF text and table extraction
Vector store creation for efficient similarity search
Integration with Ollama for LLM-powered responses
Dynamic query generation and multi-query processing
Gradio-based user interface for easy interaction

Requirements

Python 3.7+
PyPDF2
pytesseract
pdf2image
langchain
langchain-community
sentence-transformers
gradio
pandas
ollama
tabula-py
chromadb

Installation

Clone this repository:

git clone https://github.com/Readyaddy/hilabs_rag.git cd pdf-chatbot
Install the required packages:

pip install PyPDF2 pytesseract pdf2image langchain gradio pandas ollama tabula-py chromadb sentence-transformers -U langchain-community
Install Java on the system, if not already installed, from (https://www.java.com/en/download/manual.jsp)
Install Tesseract OCR on your system if not already installed:
- For Ubuntu: sudo apt-get install tesseract-ocr
- For macOS: brew install tesseract
- For Windows: Download and install from (https://github.com/UB-Mannheim/tesseract/wiki)
Install Poppler-utils on your system if not already installed:
- For Ubuntu: sudo apt-get install poppler-utils
- For macOS: brew install poppler
- For Windows: Download and install from (https://github.com/oschwartz10612/poppler-windows/releases), extract the zip and add the bin directory of the extracted folder to your system’s PATH.
Install Ollama following the instructions on the official website.

Usage

In CMD run:

Ollama run gemma2:2b
Run the script: python pdf_chatbot.py
Open the Gradio interface in your web browser (the URL will be displayed in the console).
In the "Process PDF" tab:
- Upload a PDF file
- Click "Process PDF"
- Wait for the "PDF processed successfully!" message
Switch to the "Ask Questions" tab:
- Enter your question about the PDF content
- Click "Ask"
- View the JSON response containing the answer and related information

How It Works

PDF Processing:
- Extracts text using PyPDF2
- Extracts tables using tabula
- Creates a vector store using Chroma and HuggingFace embeddings
Question Answering:
- Checks for specific codes or names in the question
- Extracts keywords from the question
- Performs unified retrieval combining similarity search and keyword-based search
- Generates an initial answer using the LLM
Answer Quality Evaluation:
- Evaluates the quality of the initial answer
- If the quality is insufficient, generates multiple related queries
- Processes each query and combines the results for a comprehensive answer
LLM Integration:
- Uses Ollama to generate responses, extract keywords, and evaluate answer quality
User Interface:
- Provides a Gradio-based interface for easy PDF processing and question answering

Why This Approach

Most of the important data in the PDF was in the form of tables, so we initially started by extracting text. However, the tables were not formatted correctly when extracted as plain text.

We then tried using OCR, which formatted the tables correctly, but it took too much time. To optimize the process, we moved to using Tabula for table extraction and PyPDF2 for normal text extraction.

Our approach works as follows:

Code Detection: Codes were the most important thing in the tables as they were used to define things.We first check if the question contains any specific codes. If a code is found, we extract it and search within the tables. then put it in the prompt to the llm for answering
Unified Retrieval: If no code is detected, we look for specific names like diseases or other entities that might be present in the tables. We then perform a unified retrieval, combining both semantic retrieval and keyword retrieval.
LLM Integration: All relevant documents retrieved are passed to the LLM for generating a response.
Multi-Query Retrieval: If the initial response is not satisfactory, we perform multi-query retrieval on the PDF to improve the answer quality.

Customization

To use a different LLM, modify the LLMServer class in the script
Adjust the chunk_size and chunk_overlap in the create_vectorstore function for different text splitting behavior
Modify the prompts in various functions to change the LLM's behavior

Limitations

The current implementation assumes English language content
Performance may vary depending on the PDF structure and content complexity
Large PDFs may require significant processing time and memory

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is open-source and available under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
LICENSE		LICENSE
README.md		README.md
architecture.png		architecture.png
submission.ipynb		submission.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

hilabs_rag

PDF Chatbot with LLM and Table Extraction

Features

Requirements

Installation

Usage

How It Works

Why This Approach

Customization

Limitations

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

hilabs_rag

PDF Chatbot with LLM and Table Extraction

Features

Requirements

Installation

Usage

How It Works

Why This Approach

Customization

Limitations

Contributing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages