PDFAssistant is an interactive tool that allows users to ask questions based on the content of PDF documents. It makes use of various libraries such as Streamlit, PyPDF2, and langchain to facilitate the querying process of the PDFs.
streamlit: For creating the web applicationdotenv: For loading environment variablesPyPDF2: For PDF text extractionrequests: For downloading PDFs from URLslangchain: A chain of NLP tools for text processing and embedding
- Downloads a PDF from a given URL and stores it in memory.
- Appends failed URLs to the
failed_urlslist.
- Extracts text content from the PDF pages using a
PdfReaderobject.
- Processes uploaded PDF documents and extracts text.
- Processes PDFs from given URLs and extracts text.
- Breaks down the PDF text into manageable chunks. Useful for processing.
- Initializes OpenAIEmbeddings and creates a vector store for text chunks.
- Sets up a conversation chain, utilizing a language model and a vector store.
- Displays messages in the Streamlit app, differentiating between user and bot.
- Handles user queries and displays the conversation in the Streamlit app.
- Main function that initializes the Streamlit app and orchestrates the execution of functionalities.
-
Running the Application:
- Open a web browser and go to
http://localhost:8501/. - You should see the PDFAssistant interface ready for interaction.
- Open a web browser and go to
-
Adding Core PDFs:
- Open the file
core_pdfs.txtfrom the project directory. - Copy the links listed in the file.
- Paste these links into the "PDF URLs" box in the PDFAssistant interface.
- Open the file
-
Adding Additional PDFs:
- Users can add more PDFs by pasting additional links into the "PDF URLs" box.
- Users can also upload PDF files directly through the user interface.
-
Processing PDFs:
- After adding the PDFs, click the "Process" button. PDFAssistant will extract and process text from the added PDFs.
-
Querying PDFs:
- Enter your questions in the query box to receive responses based on the content of the processed PDFs.
- The application includes error handling mechanisms to manage failed downloads and PDF processing issues.
-
Clone the Repository:
git clone https://github.com/yourusername/PDFAssistant.git cd PDFAssistant -
Create a Virtual Environment:
python3 -m venv env source env/bin/activate # For Windows, use `env\Scripts\activate`
-
Install Dependencies:
pip install -r requirements.txt
Ensure the
requirements.txtfile includes the following:streamlit python-dotenv PyPDF2 requests langchain -
Set Up Environment Variables:
- Create a
.envfile in the project directory. - Add necessary environment variables, such as API keys, if needed.
- Create a
-
Run the Application:
streamlit run your_app_script.py
The application should now be running at
http://localhost:8501/.