A full-stack Retrieval-Augmented Generation (RAG) system that lets you upload PDF documents and ask questions about them in natural language. Answers are grounded in your documents with source citations β no hallucination.
100% free and local. No API keys, no cloud services. Everything runs on your machine.
- Upload PDFs β Regular or scanned (OCR support via Tesseract)
- Ask questions in natural language β Get grounded answers from your documents
- Source citations β See exactly which document chunks informed the answer
- Streaming responses β Real-time token-by-token answers (like ChatGPT)
- Anti-hallucination β LLM only answers from document context, admits when it doesn't know
- Semantic search β Finds relevant content by meaning, not just keywords ("money back" matches "refund policy")
- Dark-themed chat UI β Clean React frontend with smooth animations
- Dockerized β Full Docker Compose setup with backend, frontend, and nginx proxy
ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ
β React β β FastAPI β β Ollama β
β Frontend ββββββΆβ Backend ββββββΆβ (Local LLM) β
β (Chat UI) β β + RAG Engineβ β Llama 3.2 β
ββββββββββββββββ ββββββββ¬ββββββββ ββββββββββββββββ
β
βββββββββ΄ββββββββ
β β
ββββββΌββββββ βββββββΌβββββ
β Document β β ChromaDB β
β Processor β β (Vector β
β (Chunker) βββΆβ Store) β
βββββββββββββ ββββββββββββ
Ingestion (once per document):
Upload PDF β Extract text (PyMuPDF / OCR) β Chunk (recursive splitting) β Embed (sentence-transformers) β Store (ChromaDB)
Querying (per question):
Question β Embed β Similarity search β Retrieve top-K chunks β LLM generates grounded answer β Stream to frontend
| Component | Technology | Purpose |
|---|---|---|
| Backend | FastAPI + Uvicorn | Async API with auto-generated docs |
| Vector Store | ChromaDB | Local persistent vector database |
| Embeddings | sentence-transformers (all-MiniLM-L6-v2) | 384-dim semantic embeddings, runs locally |
| LLM | Ollama + Llama 3.2 | Local inference, no API keys needed |
| PDF Parsing | PyMuPDF + pytesseract | Regular + scanned PDF support |
| Frontend | React 18 | Streaming chat UI with dark theme |
| Deployment | Docker Compose + Nginx | Containerized multi-service setup |
| Validation | Pydantic v2 | Request/response validation + settings |
# Clone the repo
git clone https://github.com/YOUR_USERNAME/rag-doc-qa.git
cd rag-doc-qa
# Make sure Ollama is running and has the model
ollama pull llama3.2
# Build and start all services
docker compose up --build -dOpen http://localhost:3000 β upload a PDF and start asking questions!
1. Backend setup:
git clone https://github.com/YOUR_USERNAME/rag-doc-qa.git
cd rag-doc-qa
# Create virtual environment
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Copy environment config
cp .env.example .env2. Pull the LLM model:
ollama pull llama3.23. Start the backend:
uvicorn app.main:app --reload --port 8001API docs available at http://localhost:8001/docs
4. Start the frontend (new terminal):
cd frontend
npm install
npm startOpen http://localhost:3000 β upload a PDF and start asking questions!
| Method | Path | Description |
|---|---|---|
POST |
/api/v1/documents/upload |
Upload and ingest a PDF |
GET |
/api/v1/documents |
List ingested documents |
DELETE |
/api/v1/documents/{doc_id} |
Remove a document and its chunks |
POST |
/api/v1/query |
Ask a question, get a full answer |
POST |
/api/v1/query/stream |
Stream an answer token-by-token (SSE) |
GET |
/api/v1/health |
Health check with dependency status |
curl -X POST http://localhost:8001/api/v1/query \
-H "Content-Type: application/json" \
-d '{"question": "What is the refund policy?", "top_k": 5}'rag-doc-qa/
βββ app/
β βββ main.py # FastAPI app + service initialization
β βββ config.py # Pydantic settings from .env
β βββ api/ # HTTP endpoints
β β βββ router.py # Route aggregator (/api/v1)
β β βββ documents.py # Upload / list / delete
β β βββ query.py # Question answering + streaming
β β βββ health.py # Health check
β βββ core/ # RAG building blocks
β β βββ chunker.py # Recursive text splitting with overlap
β β βββ embeddings.py # sentence-transformers wrapper
β β βββ vector_store.py # ChromaDB wrapper
β βββ services/ # Business logic orchestration
β β βββ ingestion.py # Full upload pipeline
β β βββ retriever.py # Full query pipeline
β β βββ llm.py # Ollama LLM with anti-hallucination prompt
β βββ models/
β β βββ schemas.py # Pydantic request/response models
β βββ utils/
β βββ pdf_parser.py # PDF text extraction + OCR fallback
βββ frontend/ # React chat UI
β βββ src/
β β βββ App.js # Main chat component
β β βββ App.css # Dark theme styling
β βββ Dockerfile # Multi-stage build (Node β Nginx)
β βββ nginx.conf # SPA routing + API proxy
βββ docker-compose.yml # Multi-service orchestration
βββ Dockerfile # Backend container
βββ requirements.txt
βββ .env.example
βββ .gitignore
-
Fully local stack β No API keys or paid services. sentence-transformers for embeddings, Ollama for LLM, ChromaDB for storage. Everything runs on the user's machine for privacy and zero cost.
-
Provider abstraction β LLM and embedding providers are isolated in separate classes. Swapping Ollama for OpenAI or Claude is a one-file change β no refactoring needed.
-
Recursive chunking with overlap β Splits text at natural boundaries (paragraphs β sentences β words) with configurable overlap (~50 chars). Preserves semantic coherence while ensuring no context is lost at chunk edges.
-
Direct extraction + OCR fallback β Tries fast text extraction first via PyMuPDF. Falls back to Tesseract OCR only for scanned pages. Regular PDFs stay fast (~1ms/page), scanned pages pay the OCR cost only when needed.
-
Anti-hallucination system prompt β LLM is explicitly instructed to answer only from provided context chunks and say "I don't have enough information" when context is insufficient. Tested and verified.
-
Cosine distance β similarity conversion β ChromaDB returns distance (0 = identical), we convert to similarity (1 = identical) for intuitive scoring in the API response.
-
Streaming via SSE β Server-Sent Events deliver tokens in real-time. Sources are sent first so the frontend can display them immediately while the answer streams in.
-
Model loaded once at startup β The embedding model loads in
__init__(~3-5 sec) and is reused for all requests (milliseconds each). Avoids the performance hit of loading per request. -
Dependency injection β Services receive their dependencies through constructors, making them testable and swappable. The ingestion service doesn't know or care which embedding provider it's using.
-
Docker Compose β Three-service architecture (backend, frontend, nginx) with persistent volumes for ChromaDB data. Nginx proxies API calls and handles SPA routing.
All settings are loaded from .env via Pydantic Settings with validation:
| Variable | Default | Description |
|---|---|---|
OLLAMA_MODEL |
llama3.2 |
Which Ollama model to use |
OLLAMA_BASE_URL |
http://localhost:11434 |
Ollama server URL |
CHROMA_PERSIST_DIR |
./data/chroma |
ChromaDB storage location |
CHUNK_SIZE |
512 |
Max characters per chunk (100-4096) |
CHUNK_OVERLAP |
50 |
Overlap between chunks (0-512) |
TOP_K |
5 |
Number of chunks to retrieve per query (1-20) |
EMBEDDING_MODEL |
all-MiniLM-L6-v2 |
sentence-transformers model |
This project is designed for local use. For production scale:
| Current | Production Alternative |
|---|---|
| ChromaDB (local) | Pinecone / Weaviate (managed, distributed) |
| Ollama (local LLM) | OpenAI / Claude API (faster, higher quality) |
| sentence-transformers (local) | OpenAI embeddings API (higher quality) |
| Single process | Multiple Uvicorn workers + async task queue |
| File-based storage | Redis cache for repeated queries |
The provider abstraction pattern makes these swaps straightforward β change one class, not the whole app.