Submission for hackathon
This project implements a PDF chatbot that uses Large Language Model (LLM) integration and table extraction to provide intelligent responses to questions about PDF content. It combines vector search, keyword extraction, and table analysis to offer comprehensive answers.
- PDF text and table extraction
- Vector store creation for efficient similarity search
- Integration with Ollama for LLM-powered responses
- Dynamic query generation and multi-query processing
- Gradio-based user interface for easy interaction
- Python 3.7+
- PyPDF2
- pytesseract
- pdf2image
- langchain
- langchain-community
- sentence-transformers
- gradio
- pandas
- ollama
- tabula-py
- chromadb
-
Clone this repository:
git clone https://github.com/Readyaddy/hilabs_rag.git cd pdf-chatbot
-
Install the required packages:
pip install PyPDF2 pytesseract pdf2image langchain gradio pandas ollama tabula-py chromadb sentence-transformers -U langchain-community
-
Install Java on the system, if not already installed, from (https://www.java.com/en/download/manual.jsp)
-
Install Tesseract OCR on your system if not already installed:
- For Ubuntu: sudo apt-get install tesseract-ocr
- For macOS: brew install tesseract
- For Windows: Download and install from (https://github.com/UB-Mannheim/tesseract/wiki)
-
Install Poppler-utils on your system if not already installed:
- For Ubuntu: sudo apt-get install poppler-utils
- For macOS: brew install poppler
- For Windows: Download and install from (https://github.com/oschwartz10612/poppler-windows/releases), extract the zip and add the bin directory of the extracted folder to your system’s PATH.
-
Install Ollama following the instructions on the official website.
-
In CMD run:
Ollama run gemma2:2b
-
Run the script: python pdf_chatbot.py
-
Open the Gradio interface in your web browser (the URL will be displayed in the console).
-
In the "Process PDF" tab:
- Upload a PDF file
- Click "Process PDF"
- Wait for the "PDF processed successfully!" message
-
Switch to the "Ask Questions" tab:
- Enter your question about the PDF content
- Click "Ask"
- View the JSON response containing the answer and related information
- PDF Processing:
- Extracts text using PyPDF2
- Extracts tables using tabula
- Creates a vector store using Chroma and HuggingFace embeddings
- Question Answering:
- Checks for specific codes or names in the question
- Extracts keywords from the question
- Performs unified retrieval combining similarity search and keyword-based search
- Generates an initial answer using the LLM
- Answer Quality Evaluation:
- Evaluates the quality of the initial answer
- If the quality is insufficient, generates multiple related queries
- Processes each query and combines the results for a comprehensive answer
- LLM Integration:
- Uses Ollama to generate responses, extract keywords, and evaluate answer quality
- User Interface:
- Provides a Gradio-based interface for easy PDF processing and question answering
Most of the important data in the PDF was in the form of tables, so we initially started by extracting text. However, the tables were not formatted correctly when extracted as plain text.
We then tried using OCR, which formatted the tables correctly, but it took too much time. To optimize the process, we moved to using Tabula for table extraction and PyPDF2 for normal text extraction.
Our approach works as follows:
- Code Detection: Codes were the most important thing in the tables as they were used to define things.We first check if the question contains any specific codes. If a code is found, we extract it and search within the tables. then put it in the prompt to the llm for answering
- Unified Retrieval: If no code is detected, we look for specific names like diseases or other entities that might be present in the tables. We then perform a unified retrieval, combining both semantic retrieval and keyword retrieval.
- LLM Integration: All relevant documents retrieved are passed to the LLM for generating a response.
- Multi-Query Retrieval: If the initial response is not satisfactory, we perform multi-query retrieval on the PDF to improve the answer quality.
- To use a different LLM, modify the LLMServer class in the script
- Adjust the chunk_size and chunk_overlap in the create_vectorstore function for different text splitting behavior
- Modify the prompts in various functions to change the LLM's behavior
- The current implementation assumes English language content
- Performance may vary depending on the PDF structure and content complexity
- Large PDFs may require significant processing time and memory
Contributions are welcome! Please feel free to submit a Pull Request.
This project is open-source and available under the MIT License.
