DocuMind is an intelligent, client-side web application that leverages the power of Google's state-of-the-art Large Language Models (LLMs) to unlock insights from your documents. Upload PDFs or images and engage in a dynamic conversation to summarize, query, and understand your content like never before.
- ✨ Core AI Capabilities
- 📸 Application Preview
- 🚀 How It Works: The AI Pipeline
- 🛠️ Technology Showcase
- ⚙️ Getting Started
- 🧑💻 Usage Guide
- 👤 Author
DocuMind's intelligence is driven entirely by the Google Gemini gemini-2.5-flash model. This section details how we harness its advanced LLM capabilities to deliver a powerful user experience.
At its core, DocuMind utilizes a Vision Language Model (VLM). This means it doesn't just read text; it can see and interpret visual information.
- For PDFs: We extract the raw text client-side and feed it to the LLM as a rich textual context.
- For Images: Whether it's a scanned document, an invoice with a chart, or a diagram, the image is sent directly to the Gemini API. The LLM's vision capabilities perform on-the-fly Optical Character Recognition (OCR) and spatial understanding to interpret the content.
This multimodal approach allows you to ask questions about a bar chart in an annual report or get the total from a photographed invoice as seamlessly as querying a text-based document.
Instead of using basic extraction algorithms, DocuMind prompts the LLM to perform abstractive summarization.
We instruct the model: "You are an expert summarizer. Provide a concise, easy-to-read summary... capturing the key points and main ideas."
The LLM reads the entire document context (text or image) and generates a new, coherent summary in natural language, formatted beautifully in markdown for maximum readability.
The chat interface is the most powerful feature, employing a Retrieval-Augmented Generation (RAG) pattern to ensure factual accuracy and prevent AI "hallucinations."
How our RAG approach works:
- Grounding in Context: The entire content of your uploaded document serves as the "single source of truth."
- Strict Prompt Engineering: Every question you ask is wrapped in a meticulously crafted prompt that instructs the LLM:
"You are a helpful assistant specialized in answering questions based only on the provided document context... If the answer is not found in the document, you MUST state: 'I could not find an answer...'"
- Contextual Conversation: The model is also fed the recent chat history, allowing it to understand follow-up questions ("What about the second point you mentioned?") while still being strictly bound to the document's content for its answers.
This turns the LLM from a general-purpose knowledge engine into a dedicated expert on your specific document, providing trustworthy, context-aware answers.
The main interface
DocuMind operates entirely on the client-side, ensuring your documents remain private. Here's a step-by-step breakdown of the data flow from upload to answer.
-
File Upload: The user uploads a PDF or an image (
.png,.jpg,.webp). -
Client-Side Pre-processing:
- PDF: Mozilla's
pdf.jslibrary is used in the browser to parse the PDF and extract its complete textual content. No data is sent to a server for this step. - Image: The browser's
FileReaderAPI converts the image into a base64 string, preparing it for the API request.
- PDF: Mozilla's
-
LLM Interaction (The Gemini API Call):
- The extracted text or the base64 image data is packaged into a request payload.
- The user's prompt (e.g., "Summarize this" or "What is the total cost?") is combined with our RAG instructions.
- This payload is sent securely to the Google Gemini API endpoint.
-
Response Generation & Display:
- The Gemini model processes the request and generates a response based only on the provided context.
- The response (a summary or an answer) is sent back to the application.
- The UI updates in real-time to display the generated text in the appropriate panel.
Process Flow Chart
- Frontend: React 18 with TypeScript for a modern, type-safe, and component-based architecture.
- AI Engine: Google Gemini API (
gemini-2.5-flash). We chose this model for its exceptional balance of speed, intelligence, and powerful multimodal (text and vision) capabilities. - Styling: Tailwind CSS for a utility-first, responsive, and highly customizable design system that enables rapid development of a beautiful UI.
- Document Processing:
- PDFs: Mozilla's
pdf.jsfor robust, client-side PDF text extraction without server dependencies. - File Handling: React Dropzone for a seamless and accessible drag-and-drop file upload experience.
- PDFs: Mozilla's
- A modern web browser (Chrome, Firefox, Safari, Edge).
- A Google Gemini API key.
The application requires a Google Gemini API key to communicate with the language model.
- Obtain your API key from Google AI Studio.
- The application is architected to securely access this key via the
process.env.API_KEYenvironment variable. In a deployed environment (like Vercel, Netlify, or a custom server), this key must be configured in the environment settings.
Security Note: Never expose your API key in client-side code. This project assumes the build environment correctly handles and secures environment variables.
-
Upload Your Document: Drag and drop a supported file into the upload zone or click to browse your local files.
-
Generate a Summary: Once the file is processed, click the "Summarize Document" button. The AI will read the entire document and generate a high-quality summary on the left panel.
-
Ask Questions: Use the chat interface on the right to ask specific questions about the document. Press Enter or click the send button to get an AI-powered answer grounded in the document's content.
Made by Vaibhav Shikhar Singh.




