A local Retrieval-Augmented Generation (RAG) chatbot that answers insurance questions based on your own PDF documents. Built with TinyLlama, FAISS, and Gradio.
PDFs are static: users have to read and search manually. Chatbot is interactive: users can ask natural language questions and get direct answers.
For insurance FAQs, this means: “What’s the claim process for car insurance?” → instant answer. No scrolling, no guessing, no searching multiple pages.
- PDF Ingestion — loads and chunks your insurance FAQ document
- Embedding — converts chunks into vector embeddings using
all-MiniLM-L6-v2 - Vector Store — stores embeddings in a FAISS index for fast similarity search
- RAG Retrieval — on each query, retrieves the top-3 most relevant chunks
- LLM Generation — feeds retrieved context + question to TinyLlama to generate an answer
- Gradio UI — serves a chat interface accessible via browser
User Query
│
▼
Embedding Model ──► FAISS Index ──► Top-K Chunks
│
▼
TinyLlama LLM
│
▼
Answer
llm_insurance-chatbot/
├── main.py # Main application
├── requirements.txt # Python dependencies
├── data/
│ └── Insurance_FAQs.pdf # Your source document (add this yourself)
└── embeddings/
└── vector_store.pkl # Auto-generated on first run
cd llm_insurance-chatbotpython -m venv venv
# Windows
venv\Scripts\activate
# macOS/Linux
source venv/bin/activatepip install -r requirements.txtPlace your insurance FAQ document at:
data/Insurance_FAQs.pdf
python main.pyOn first run, the vector store will be built automatically. Subsequent runs will load it from cache.
After running, you'll see output like:
* Running on local URL: http://127.0.0.1:7860
* Running on public URL: https://xxxxxx.gradio.live
- Local URL — open in your browser on the same machine
- Public URL — share with anyone for 72 hours (enabled via
share=True)
| Component | Minimum |
|---|---|
| RAM | 6GB free |
| Python | 3.10+ |
| Disk | ~3GB (for model cache) |
| GPU | Optional (runs on CPU) |
Note for Windows users: You may see a symlinks warning from HuggingFace. This is harmless. To fix it, enable Developer Mode in Windows Settings or run Python as Administrator.
Key parameters you can tweak in main.py:
| Parameter | Location | Default | Description |
|---|---|---|---|
chunk_size |
chunk_text() |
500 |
Characters per chunk |
overlap |
chunk_text() |
50 |
Overlap between chunks |
top_k |
retrieve_relevant() |
3 |
Number of chunks retrieved |
max_new_tokens |
generate_answer() |
300 |
Max response length |
temperature |
generate_answer() |
0.7 |
Response creativity (0=deterministic) |
Segmentation fault during model load → Not enough RAM. The model needs ~4GB free. Close other applications and retry.
localhost is not accessible error
→ Add share=True to iface.launch() and use the public URL instead.
Slow responses → Expected on CPU — TinyLlama takes 30–90 seconds per response without a GPU.
Poor answer quality
→ Check that your PDF text is extractable (not a scanned image). Try increasing top_k to 5.
| Package | Purpose |
|---|---|
pdfplumber |
PDF text extraction |
sentence-transformers |
Generating embeddings |
faiss-cpu |
Vector similarity search |
transformers |
Loading TinyLlama LLM |
torch |
Model inference backend |
gradio |
Chat web interface |
accelerate |
Optimized model loading |

