Skip to content

Document Q&A system using Retrieval-Augmented Generation (RAG) with FastAPI, ChromaDB, sentence-transformers, and Ollama and runs 100% locally for free

Notifications You must be signed in to change notification settings

jahnavikedia/rag-doc-qa

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

10 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ“„ DocQA β€” RAG Document Q&A

A full-stack Retrieval-Augmented Generation (RAG) system that lets you upload PDF documents and ask questions about them in natural language. Answers are grounded in your documents with source citations β€” no hallucination.

100% free and local. No API keys, no cloud services. Everything runs on your machine.

Python React FastAPI Docker


✨ Features

  • Upload PDFs β€” Regular or scanned (OCR support via Tesseract)
  • Ask questions in natural language β€” Get grounded answers from your documents
  • Source citations β€” See exactly which document chunks informed the answer
  • Streaming responses β€” Real-time token-by-token answers (like ChatGPT)
  • Anti-hallucination β€” LLM only answers from document context, admits when it doesn't know
  • Semantic search β€” Finds relevant content by meaning, not just keywords ("money back" matches "refund policy")
  • Dark-themed chat UI β€” Clean React frontend with smooth animations
  • Dockerized β€” Full Docker Compose setup with backend, frontend, and nginx proxy

πŸ—οΈ Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   React      β”‚     β”‚  FastAPI     β”‚     β”‚   Ollama     β”‚
β”‚   Frontend   │────▢│  Backend     │────▢│  (Local LLM) β”‚
β”‚   (Chat UI)  β”‚     β”‚  + RAG Engineβ”‚     β”‚  Llama 3.2   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                            β”‚
                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚               β”‚
               β”Œβ”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”
               β”‚ Document  β”‚  β”‚ ChromaDB β”‚
               β”‚ Processor β”‚  β”‚ (Vector  β”‚
               β”‚ (Chunker) │─▢│  Store)  β”‚
               β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

RAG Pipeline

Ingestion (once per document):

Upload PDF β†’ Extract text (PyMuPDF / OCR) β†’ Chunk (recursive splitting) β†’ Embed (sentence-transformers) β†’ Store (ChromaDB)

Querying (per question):

Question β†’ Embed β†’ Similarity search β†’ Retrieve top-K chunks β†’ LLM generates grounded answer β†’ Stream to frontend

πŸ› οΈ Tech Stack

Component Technology Purpose
Backend FastAPI + Uvicorn Async API with auto-generated docs
Vector Store ChromaDB Local persistent vector database
Embeddings sentence-transformers (all-MiniLM-L6-v2) 384-dim semantic embeddings, runs locally
LLM Ollama + Llama 3.2 Local inference, no API keys needed
PDF Parsing PyMuPDF + pytesseract Regular + scanned PDF support
Frontend React 18 Streaming chat UI with dark theme
Deployment Docker Compose + Nginx Containerized multi-service setup
Validation Pydantic v2 Request/response validation + settings

πŸš€ Quick Start

Prerequisites

  • Python 3.11+
  • Node.js 18+
  • Ollama installed
  • Tesseract installed (optional, for scanned PDFs)

Option 1: Run with Docker (Recommended)

# Clone the repo
git clone https://github.com/YOUR_USERNAME/rag-doc-qa.git
cd rag-doc-qa

# Make sure Ollama is running and has the model
ollama pull llama3.2

# Build and start all services
docker compose up --build -d

Open http://localhost:3000 β€” upload a PDF and start asking questions!

Option 2: Run Locally

1. Backend setup:

git clone https://github.com/YOUR_USERNAME/rag-doc-qa.git
cd rag-doc-qa

# Create virtual environment
python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Copy environment config
cp .env.example .env

2. Pull the LLM model:

ollama pull llama3.2

3. Start the backend:

uvicorn app.main:app --reload --port 8001

API docs available at http://localhost:8001/docs

4. Start the frontend (new terminal):

cd frontend
npm install
npm start

Open http://localhost:3000 β€” upload a PDF and start asking questions!


πŸ“‘ API Endpoints

Method Path Description
POST /api/v1/documents/upload Upload and ingest a PDF
GET /api/v1/documents List ingested documents
DELETE /api/v1/documents/{doc_id} Remove a document and its chunks
POST /api/v1/query Ask a question, get a full answer
POST /api/v1/query/stream Stream an answer token-by-token (SSE)
GET /api/v1/health Health check with dependency status

Example: Ask a question via curl

curl -X POST http://localhost:8001/api/v1/query \
  -H "Content-Type: application/json" \
  -d '{"question": "What is the refund policy?", "top_k": 5}'

πŸ“ Project Structure

rag-doc-qa/
β”œβ”€β”€ app/
β”‚   β”œβ”€β”€ main.py              # FastAPI app + service initialization
β”‚   β”œβ”€β”€ config.py            # Pydantic settings from .env
β”‚   β”œβ”€β”€ api/                 # HTTP endpoints
β”‚   β”‚   β”œβ”€β”€ router.py        # Route aggregator (/api/v1)
β”‚   β”‚   β”œβ”€β”€ documents.py     # Upload / list / delete
β”‚   β”‚   β”œβ”€β”€ query.py         # Question answering + streaming
β”‚   β”‚   └── health.py        # Health check
β”‚   β”œβ”€β”€ core/                # RAG building blocks
β”‚   β”‚   β”œβ”€β”€ chunker.py       # Recursive text splitting with overlap
β”‚   β”‚   β”œβ”€β”€ embeddings.py    # sentence-transformers wrapper
β”‚   β”‚   └── vector_store.py  # ChromaDB wrapper
β”‚   β”œβ”€β”€ services/            # Business logic orchestration
β”‚   β”‚   β”œβ”€β”€ ingestion.py     # Full upload pipeline
β”‚   β”‚   β”œβ”€β”€ retriever.py     # Full query pipeline
β”‚   β”‚   └── llm.py           # Ollama LLM with anti-hallucination prompt
β”‚   β”œβ”€β”€ models/
β”‚   β”‚   └── schemas.py       # Pydantic request/response models
β”‚   └── utils/
β”‚       └── pdf_parser.py    # PDF text extraction + OCR fallback
β”œβ”€β”€ frontend/                # React chat UI
β”‚   β”œβ”€β”€ src/
β”‚   β”‚   β”œβ”€β”€ App.js           # Main chat component
β”‚   β”‚   └── App.css          # Dark theme styling
β”‚   β”œβ”€β”€ Dockerfile           # Multi-stage build (Node β†’ Nginx)
β”‚   └── nginx.conf           # SPA routing + API proxy
β”œβ”€β”€ docker-compose.yml       # Multi-service orchestration
β”œβ”€β”€ Dockerfile               # Backend container
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ .env.example
└── .gitignore

βš™οΈ Design Decisions

Why these matter (interview context)

  1. Fully local stack β€” No API keys or paid services. sentence-transformers for embeddings, Ollama for LLM, ChromaDB for storage. Everything runs on the user's machine for privacy and zero cost.

  2. Provider abstraction β€” LLM and embedding providers are isolated in separate classes. Swapping Ollama for OpenAI or Claude is a one-file change β€” no refactoring needed.

  3. Recursive chunking with overlap β€” Splits text at natural boundaries (paragraphs β†’ sentences β†’ words) with configurable overlap (~50 chars). Preserves semantic coherence while ensuring no context is lost at chunk edges.

  4. Direct extraction + OCR fallback β€” Tries fast text extraction first via PyMuPDF. Falls back to Tesseract OCR only for scanned pages. Regular PDFs stay fast (~1ms/page), scanned pages pay the OCR cost only when needed.

  5. Anti-hallucination system prompt β€” LLM is explicitly instructed to answer only from provided context chunks and say "I don't have enough information" when context is insufficient. Tested and verified.

  6. Cosine distance β†’ similarity conversion β€” ChromaDB returns distance (0 = identical), we convert to similarity (1 = identical) for intuitive scoring in the API response.

  7. Streaming via SSE β€” Server-Sent Events deliver tokens in real-time. Sources are sent first so the frontend can display them immediately while the answer streams in.

  8. Model loaded once at startup β€” The embedding model loads in __init__ (~3-5 sec) and is reused for all requests (milliseconds each). Avoids the performance hit of loading per request.

  9. Dependency injection β€” Services receive their dependencies through constructors, making them testable and swappable. The ingestion service doesn't know or care which embedding provider it's using.

  10. Docker Compose β€” Three-service architecture (backend, frontend, nginx) with persistent volumes for ChromaDB data. Nginx proxies API calls and handles SPA routing.


πŸ”§ Configuration

All settings are loaded from .env via Pydantic Settings with validation:

Variable Default Description
OLLAMA_MODEL llama3.2 Which Ollama model to use
OLLAMA_BASE_URL http://localhost:11434 Ollama server URL
CHROMA_PERSIST_DIR ./data/chroma ChromaDB storage location
CHUNK_SIZE 512 Max characters per chunk (100-4096)
CHUNK_OVERLAP 50 Overlap between chunks (0-512)
TOP_K 5 Number of chunks to retrieve per query (1-20)
EMBEDDING_MODEL all-MiniLM-L6-v2 sentence-transformers model

πŸ“ˆ Scaling Considerations

This project is designed for local use. For production scale:

Current Production Alternative
ChromaDB (local) Pinecone / Weaviate (managed, distributed)
Ollama (local LLM) OpenAI / Claude API (faster, higher quality)
sentence-transformers (local) OpenAI embeddings API (higher quality)
Single process Multiple Uvicorn workers + async task queue
File-based storage Redis cache for repeated queries

The provider abstraction pattern makes these swaps straightforward β€” change one class, not the whole app.


About

Document Q&A system using Retrieval-Augmented Generation (RAG) with FastAPI, ChromaDB, sentence-transformers, and Ollama and runs 100% locally for free

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published