A lightweight, extensible Python chat assistant that talks to local Large Language Models (LLMs) via Ollama.
It supports dynamic model selection with fallback, (basic) streaming, conversational history, prompt engineering utilities, and optional quality‑repair hooks.
Run fully offline once models are pulled. Ideal for prototyping agents, internal tools, or experimentation without sending data to external APIs.
- Key Features
- Architecture Overview
- Quick Start
- Requirements
- Installation
- Environment Variables
- Usage Examples
- Model Management (Ollama)
- Prompt & Quality Tips
- Streaming Mode
- Fallback Logic
- Extending (HTTP API Option)
- Troubleshooting
- Roadmap
- Contributing
- License
- ✅ Local inference via Ollama CLI
- ✅ Automatic model fallback (e.g.
llama3→mistral→phi3) - ✅ Optional on‑demand model pull (ensure model present)
- ✅ Conversation history merging
- ✅ Blocking or basic streaming execution
- ✅ Configurable generation parameters (temperature, top_p, num_predict)
- ✅ Debug logging toggle
- ✅ Prompt sanitation (removal of control chars)
- 🧪 Optional post‑generation “repair” function (quality enforcement)
- 🧩 Designed to be easily upgraded to Ollama’s HTTP API
+----------------------+
| Your App / UI | (CLI script, FastAPI, Discord bot, etc.)
+-----------+----------+
|
v
+--------------+
| ollama_client| <-- model selection, fallback, prompt build,
+------+-------+ history injection, streaming, timeouts
|
v
+---------------+
| Ollama Daemon| <-- runs / pulls local LLMs
+---------------+
|
Local Disk (model blobs)
Main entrypoint function: ask_ollama(...) inside ollama_client.py.
# Pull at least one model
ollama pull phi3
# Test the model quickly
ollama run phi3 "Hello"
# Run a Python one‑liner using the client
python -c "from ollama_client import ask_ollama; print(ask_ollama('Give me a friendly greeting.', model='phi3'))"| Component | Minimum |
|---|---|
| Python | 3.9+ |
| Ollama | 0.11.0+ |
| Disk | Depends on models (e.g. mistral ≈ 4.4 GB) |
| RAM | 8 GB minimum for small models; more recommended for larger |
git clone https://github.com/<YOUR_USER>/<YOUR_REPO>.git
cd <YOUR_REPO>
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
# (Add dependencies if you introduce any; base version may have none)
pip install -r requirements.txt # if the file existsCheck Ollama:
ollama --version
ollama list| Variable | Purpose | Example |
|---|---|---|
OLLAMA_MODEL |
Preferred default model | phi3 |
OLLAMA_FALLBACK_MODELS |
Comma list for fallback | mistral,phi3 |
SYSTEM_PROMPT |
System role / behavior | "You are a concise helpful assistant." |
GEN_TEMPERATURE |
Sampling temperature | 0.6 |
GEN_TOP_P |
Nucleus sampling | 0.9 |
GEN_NUM_PREDICT |
Max tokens (-1 = model default) | 256 |
OLLAMA_TIMEOUT |
Inference inactivity timeout (seconds) | 180 |
OLLAMA_DEBUG |
Debug logs (1/0) | 1 |
Example:
export OLLAMA_MODEL=phi3
export OLLAMA_FALLBACK_MODELS=mistral,phi3
export SYSTEM_PROMPT="You are an accurate, polite assistant. Avoid hallucination."
export GEN_TEMPERATURE=0.5
export GEN_TOP_P=0.9
export OLLAMA_DEBUG=1from ollama_client import ask_ollama
print(ask_ollama("Give me a short inspirational sentence.", model="phi3"))history = [
{"role": "user", "content": "Hi, who are you?"},
{"role": "assistant", "content": "I'm a local AI assistant running on your machine."}
]
response = ask_ollama(
"Summarize your capabilities in one sentence.",
model="phi3",
history=history,
use_history=True
)
print(response)resp = ask_ollama(
"Explain what a hash map is in simple terms.",
model="phi3",
stream=True
)ask_ollama(
"Write a helpful 3-sentence explanation of recursion. Do not answer in bullet points.",
model="mistral"
)If OLLAMA_MODEL=llama3 but it is not downloaded and OLLAMA_FALLBACK_MODELS=mistral,phi3:
- The client will pick the first installed model from that list automatically.
| Action | Command |
|---|---|
| List installed models | ollama list |
| Pull a model | ollama pull mistral |
| Run a quick test | ollama run phi3 "Hello" |
| Remove a model | ollama rm mistral |
| Daemon logs (Linux systemd) | journalctl -u ollama -f |
First run with a new model may be slow (download + unpack). Subsequent runs are much faster.
| Problem | Tip |
|---|---|
| Very short answers | Explicit length: “Answer in at least 2 sentences.” |
| Hallucinations | Add: “If unsure, say you are unsure.” |
| Unstable formatting | Provide exact template scaffold in prompt. |
| Language drift | System prompt: “Always answer in English.” |
| Repetition | Lower temperature / add “Avoid repeating phrases.” |
Few‑shot example:
history = [
{"role": "user", "content": "Give me a cheerful greeting."},
{"role": "assistant", "content": "Hello! I hope you're having a great day. How can I assist you?"},
{"role": "user", "content": "Another one."},
{"role": "assistant", "content": "Hi there! Ready whenever you are—what would you like to explore today?"}
]
ask_ollama("Give me a third unique greeting.", model="phi3", history=history, use_history=True)Current streaming is a simple stdout line reader (not token‑level structured events). It helps avoid “blank screen” during longer generations or downloads.
To upgrade:
- Use Ollama HTTP API (
/api/generatewithstream: true) and parse JSON events. - Pipe them through WebSocket for a web UI.
- Try preferred model (
OLLAMA_MODELor provided argument). - If not installed:
- Iterate
OLLAMA_FALLBACK_MODELS. - Use the first installed one.
- Iterate
- If none installed:
- Optionally auto‑pull the preferred model (controlled by the client).
- Proceed with generation.
To gain:
- True parameter control (temperature, top_k, repeat_penalty, stop tokens)
- Proper streaming events
- Structured JSON
Sketch:
import requests, json
def http_generate(prompt, model="mistral"):
url = "http://localhost:11434/api/generate"
payload = {
"model": model,
"prompt": prompt,
"stream": False,
"options": {"temperature": 0.6, "top_p": 0.9}
}
r = requests.post(url, json=payload, timeout=180)
r.raise_for_status()
data = r.json()
return data.get("response")| Symptom | Cause | Fix |
|---|---|---|
| Long silence, no answer | Model downloading | Run ollama pull model first; show progress |
| Empty / nonsense output | Weak small model | Use better prompt or switch to mistral / llama3 |
| Timeout error | Slow hardware / large model | Increase OLLAMA_TIMEOUT |
| CLI not found | PATH issue | Reinstall or add Ollama to PATH |
| High RAM usage | Large model loaded | Use a smaller model (phi3, qwen2:0.5b) |
| Mixed language output | Not constrained | Add explicit language rule in system prompt |
| Very short answer | Under-specified prompt | Add minimum length requirement |
Debug:
export OLLAMA_DEBUG=1
python -c "from ollama_client import ask_ollama; ask_ollama('Test', model='phi3')"- Native HTTP API client (JSON streaming)
- Web UI (FastAPI + WebSocket)
- Response cache layer
- Automatic quality repair module (config-toggle)
- Benchmark script for latency & token throughput
- Multi-user session isolation
- Prompt template registry
- Fork the repo
- Create a feature branch:
git checkout -b feat/improve-stream - Commit with clear messages
- Open a Pull Request describing changes (screenshots if UI-related)
Choose and add a license file (e.g. MIT or Apache-2.0).
Example (MIT):
Create LICENSE file with MIT template text.
Local models may hallucinate or produce inaccurate information. Always verify critical outputs before use in production or decision pipelines.
ollama pull phi3
python -c "from ollama_client import ask_ollama; print(ask_ollama('Say something encouraging.', model='phi3'))"Happy hacking! 🚀