A tiny, no-magic Retrieval-Augmented Generation (RAG) pipeline in pure Python. It uses Sentence-Transformers for embeddings, FAISS for retrieval, and talks to an Ollama LLM for generation. Clean, minimal, and easy to extend—perfect for learning or as a foundation for your own projects.
- Pure Python, no LangChain
- Sentence-Transformers embeddings + FAISS vector search
- Simple text/PDF ingestion with chunking and overlap
- Streams responses from a local Ollama server
- Small, readable codebase designed for teaching and hacking
simple-RAG/
├─ main.py # CLI entrypoint; wires retrieval to Ollama
├─ rag.py # Chunking, embeddings, FAISS store & retrieval
├─ fileutils.py # File loading utilities (PDF/TXT/MD)
├─ knowledge_base/ # Your source documents live here
│ ├─ shrek.txt
│ └─ bee_movie_script.txt
└─ requirements.txt # Python dependencies
- Load documents from
knowledge_base/(PDF, TXT, MD). - Split into overlapping chunks for better context.
- Embed chunks with Sentence-Transformers.
- Build a FAISS index for fast similarity search.
- At query time, retrieve top-k chunks and pass them to the LLM via Ollama.
- Python 3.10+
- A working Ollama installation running locally (default:
http://localhost:11434). - An Ollama model downloaded (e.g.,
mistral).
Create and activate a virtual environment, then install dependencies.
python -m venv env
./env/Scripts/Activate.ps1
pip install -r requirements.txtPull a model for Ollama (example: mistral):
ollama pull mistral- Put your documents into
knowledge_base/as.txt,.md, or.pdffiles. - Start Ollama (if it isn’t already running).
- Run the app:
python main.pyAsk questions interactively. Type exit to quit.
Key constants you may want to tweak:
- In
rag.py:CHUNK_SIZE(default: 1000)CHUNK_OVERLAP(default: 100)MODEL_NAME(default:sentence-transformers/all-MiniLM-L6-v2)FAISS_INDEX_PATH/DOCS_PATH
- In
main.py:OLLAMA_URL(default:http://localhost:11434/api/generate)OLLAMA_MODEL(default:mistral)
- Start interactive RAG session:
python main.py
- Style: keep functions small and well documented.
- Tests: this repo is tiny; consider adding smoke tests as you extend it.
- Contributions: PRs and issues are welcome.
- Add persistence: call
VectorStore.save()/load()to reuse the index. - Swap embedding model: change
MODEL_NAMEinrag.py. - Change retriever behavior: adjust
kinstore.search(query, k=3). - Add sources formatting: currently prints the first 200 chars of each chunk.
- Import errors for packages (requests, pypdf, sentence-transformers, faiss-cpu, numpy):
- Ensure your virtual environment is active and run
pip install -r requirements.txt.
- Ensure your virtual environment is active and run
- Ollama connection errors:
- Verify the service is running and reachable at
OLLAMA_URL. - Confirm the model is available (
ollama list) and pulled (ollama pull mistral).
- Verify the service is running and reachable at
- GPU vs CPU FAISS:
- This project pins CPU FAISS via
faiss-cpu. If you have a GPU and want acceleration, install a suitable FAISS build manually.
- This project pins CPU FAISS via
Q: Can I use another LLM provider?
A: Yes. Replace query_ollama() in main.py with a function that calls your provider, keeping the same input/output signature.
Q: How big can my documents be?
A: As big as your memory allows. The index holds embeddings for each chunk; large corpora will use more RAM and take time to build. Consider batching or on-disk indices for very large datasets.
Q: Why chunk overlap?
A: Overlap helps preserve context that might otherwise be split between chunks, improving retrieval quality.
By default, all data stays local: files, embeddings, and LLM calls (with Ollama). Review your model’s behavior and logs before sharing outputs.
MIT License. See LICENSE for details.