TutorAI is a state-of-the-art AI-driven document retrieval and question-answering system that leverages OpenAI's GPT models, LangChain, and ChromaDB to provide intelligent insights from uploaded PDF documents. With advanced text chunking, embedding generation, and similarity-based retrieval mechanisms, TutorAI enables users to interact with their documents in an intuitive and efficient manner.
- Introduction
- Features
- Technologies Used
- Setup Instructions
- How It Works
- Usage
- Visualization
- Future Enhancements
- License
TutorAI is designed to help users engage with their documents like never before. Upload academic papers, reports, or books, and TutorAI will split them into manageable chunks, embed them, and allow intelligent question-answering. By integrating advanced AI and ML techniques, TutorAI provides precise and contextually accurate answers while referencing the original document.
- PDF Upload and Parsing: Load and process multiple PDFs seamlessly.
- Text Chunking: Splits large documents into manageable and context-rich chunks.
- Embeddings with OpenAI: Utilizes
text-embedding-3-smallfor robust embedding generation. - Cosine Similarity: Measures the similarity between document chunks for high-quality retrieval.
- Document Visualization: Visualizes embeddings in 2D using PCA for better understanding.
- Multi-Query Retrieval: Generates diverse question formulations to improve search accuracy.
- Context-Based QA: Answers questions strictly based on the provided document context, ensuring precision.
- Python: Core programming language.
- LangChain: Framework for building modular applications using LLMs.
- OpenAI GPT Models:
gpt-4o-miniandtext-embedding-3-small. - ChromaDB: Vector database for efficient storage and retrieval of document embeddings.
- Scikit-Learn: PCA for dimensionality reduction and cosine similarity calculations.
- Matplotlib: For data visualization.
Follow these steps to set up and run the project on your local system:
-
Clone the Repository:
git clone https://github.com/yourusername/TutorAI.git cd TutorAI -
Install Dependencies: Ensure Python 3.8+ is installed, then run:
pip install -r requirements.txt
-
Set Up OpenAI API Key: Create a
.envfile in the project root and add your OpenAI API key:OPENAI_API_KEY=your_openai_api_key -
Load PDF Files: Place your PDFs in the
TutorAI_Data/directory. -
Run the Project: Execute the script:
python main.py
- Document Loading:
- Loads all PDFs from the specified directory.
- Text Splitting:
- Uses LangChain's RecursiveCharacterTextSplitter to split text into chunks.
- Embedding Generation:
- Embeds chunks using OpenAI's
text-embedding-3-small.
- Embeds chunks using OpenAI's
- Similarity Computation:
- Applies cosine similarity for efficient information retrieval.
- Query Handling:
- Processes user questions, formulates diverse variations, and retrieves the best-matching context.
- Answer Generation:
- Answers are generated strictly based on the retrieved context.
- Upload PDFs to
TutorAI_Data/. - Modify and run
chain.invoke()to start querying your documents. - Visualize embeddings using the provided PCA functionality.
- Retrieve answers with document references for traceability.
This project includes a visualization of document embeddings using PCA for a 2D projection. Below is an example scatter plot showing embeddings distributed across two principal components:
- Add support for more file formats (e.g., DOCX, TXT).
- Enable real-time querying through a web-based interface.
- Expand multi-query retrieval with more robust LLM models.
- Implement user-friendly feedback loops for iterative improvements.
