A powerful Python-based chatbot that can answer questions about the content of PDF documents using advanced natural language processing and machine learning techniques.
- Extract text from PDF documents
- Process and chunk text for efficient analysis
- Use semantic search to find relevant content
- Generate accurate answers to questions about the PDF content
- Command-line interface for easy interaction
- Python 3.7+
- PyPDF2
- sentence-transformers
- transformers
- torch
- Other dependencies listed in requirements.txt
- Clone this repository:
git clone <repository-url>
cd 01-pdf-chatbot- Create and activate a virtual environment (recommended):
python -m venv .venv
source .venv/bin/activate # On Windows, use: .venv\Scripts\activate- Install the required packages:
pip install -r requirements.txtYou can use the chatbot in two ways:
python main.py --pdf_path "path/to/your/document.pdf" --question "Your question here?"Simply run the script and follow the prompts:
python main.pyThe program will ask you to:
- Enter the path to your PDF file
- Enter your question about the PDF content
- Text Extraction: The system extracts text from the provided PDF file using PyPDF2.
- Text Processing: The extracted text is split into manageable chunks for efficient processing.
- Semantic Search: When a question is asked, the system uses sentence transformers to find the most relevant text chunk.
- Answer Generation: The relevant context and question are processed through a T5 model to generate an accurate answer.
- Uses
google/flan-t5-xlfor question answering - Employs
all-MiniLM-L6-v2for semantic text similarity - Implements efficient text chunking with configurable chunk sizes
- Handles PDF reading errors gracefully
The system includes robust error handling for:
- Missing PDF files
- PDF reading errors
- Invalid input
- Processing errors
Contributions are welcome! Please feel free to submit a Pull Request.
MIT
- PyPDF2 for PDF processing
- Hugging Face for the transformer models
- Sentence Transformers for semantic search capabilities