This guide will help you set up the RAG system to use local models (Llama 3.2 or Qwen) instead of OpenAI's API.
# Download and install from https://ollama.ai
# Or use Homebrew:
brew install ollamaollama serve(Keep this running in a separate terminal window)
Choose one or more models to download:
# Llama 3.2 (1B - fastest, lowest quality)
ollama pull llama3.2:1b
# Llama 3.2 (3B - balanced)
ollama pull llama3.2:3b
# Llama 3.2 (default, usually 3B)
ollama pull llama3.2
# Qwen 2.5 (various sizes)
ollama pull qwen2.5:0.5b # Smallest
ollama pull qwen2.5:1.5b # Small
ollama pull qwen2.5:3b # Medium
ollama pull qwen2.5:7b # Larger (needs more RAM)pip install -r requirements.txtThe key changes made:
-
Replaced
ChatOpenAIwithChatOllama:from langchain_ollama import ChatOllama llm = ChatOllama( model="llama3.2", # or "qwen2.5" temperature=0.7, )
-
Replaced
OpenAIEmbeddingswithHuggingFaceEmbeddings:from langchain.embeddings import HuggingFaceEmbeddings embeddings = HuggingFaceEmbeddings( model_name="sentence-transformers/all-MiniLM-L6-v2" )
-
Removed OpenAI API key requirement - everything runs locally!
python local_llm_rag.pyllama3.2:1b- Fast, good for simple queriesqwen2.5:0.5borqwen2.5:1.5b- Very fast
llama3.2:3b- Recommended balanceqwen2.5:3b- Also excellent
qwen2.5:7b- High quality responsesllama3.1:8b- Very capable
To switch models, just change the MODEL variable in your code:
MODEL = "qwen2.5:3b" # Change this lineThen restart your application.
- Make sure Ollama is running:
ollama serve
- Pull the model first:
ollama pull llama3.2
- Try a smaller model (1b or 0.5b versions)
- Close other applications
- HuggingFace will download the embedding model on first run
- This is a one-time download (~80MB)
| Model | Size | Speed | Quality | RAM Needed |
|---|---|---|---|---|
| llama3.2:1b | 1.3GB | ⚡⚡⚡ | ⭐⭐ | 4GB |
| llama3.2:3b | 2GB | ⚡⚡ | ⭐⭐⭐ | 8GB |
| qwen2.5:3b | 2.5GB | ⚡⚡ | ⭐⭐⭐⭐ | 8GB |
| qwen2.5:7b | 4.7GB | ⚡ | ⭐⭐⭐⭐⭐ | 16GB |
✅ No API costs - runs completely free
✅ Privacy - your data never leaves your machine
✅ No rate limits - use as much as you want
✅ Works offline - no internet required after setup
✅ Fast - no network latency
You can customize Ollama settings:
llm = ChatOllama(
model="llama3.2",
temperature=0.7, # Creativity (0-1)
top_p=0.9, # Nucleus sampling
top_k=40, # Top-k sampling
num_ctx=2048, # Context window size
)