A powerful Python-based chatbot that uses Retrieval-Augmented Generation (RAG) to answer questions about multiple PDF documents. This system combines advanced natural language processing, semantic search, and large language models to provide accurate answers based on document content.
- Process multiple PDF documents from a folder
- Extract and process text from PDFs efficiently
- Split text into semantic chunks for better context
- Use sentence transformers for semantic search
- Generate accurate answers using T5 model
- Command-line interface for easy interaction
- Detailed logging for monitoring and debugging
- Python 3.9+
- PyPDF2
- sentence-transformers
- transformers
- torch
- Other dependencies listed in requirements.txt
- Clone this repository:
git clone <repository-url>
cd 02-rag-bot- Create and activate a virtual environment (recommended):
python -m venv .venv
source .venv/bin/activate # On Windows, use: .venv\Scripts\activate- Install the required packages:
pip install -r requirements.txtYou can use the RAG bot in two ways:
python main.py --folder_path "path/to/your/pdf/folder" --question "Your question here?"Simply run the script and follow the prompts:
python main.pyThe program will ask you to:
- Enter the path to your folder containing PDF files
- Enter your question about the PDF content
- PDF Processing: The system processes all PDF files in the specified folder using PyPDF2.
- Text Extraction: Text is extracted from each PDF and combined for processing.
- Text Chunking: The combined text is split into semantic chunks for efficient processing.
- Embedding Generation: Each text chunk is converted into embeddings using sentence transformers.
- Semantic Search: When a question is asked, the system finds the most relevant chunks using cosine similarity.
- Answer Generation: The relevant context and question are processed through a T5 model to generate an accurate answer.
- Uses
google/flan-t5-xlfor question answering - Employs
all-MiniLM-L6-v2for semantic text similarity - Implements efficient text chunking with configurable chunk sizes
- Processes multiple PDFs in a single folder
- Includes comprehensive logging for monitoring and debugging
- Handles PDF reading errors gracefully
The system includes robust error handling for:
- Missing folders
- Invalid folder paths
- PDF reading errors
- No PDF files found in folder
- Processing errors
- Model generation errors
Contributions are welcome! Please feel free to submit a Pull Request.
MIT
- PyPDF2 for PDF processing
- Hugging Face for the transformer models
- Sentence Transformers for semantic search capabilities