An interactive web demo that showcases LLaVA (Large Language and Vision Assistant) — a multimodal large language model that can understand both images and natural language instructions. The model runs locally via Ollama, so no API keys or cloud services are required.
LLaVA demonstrates instruction following in a multimodal setting:
- You provide an image (e.g. a street scene, a chart, a handwritten note).
- You give a natural-language instruction (e.g. "What objects are in this image?").
- LLaVA generates a detailed response by jointly reasoning over the visual and textual input.
This is a core capability studied in multimodal foundation models — the ability to ground language understanding in visual perception.
| Requirement | Version | Notes |
|---|---|---|
| Python | 3.9+ | Check with python --version |
| Ollama | latest | Local LLM runtime — install here |
| pip | any | Comes with Python |
Hardware: LLaVA (~4.7 GB) runs comfortably on machines with 8 GB+ RAM. A GPU is helpful but not required — Ollama automatically uses the GPU if available.
Follow these three steps to get the demo running.
If you haven't installed Ollama yet, download it from ollama.com/download (macOS / Linux / Windows).
Then open a terminal and pull the LLaVA model:
ollama pull llavaThis downloads ~4.7 GB on first run. You only need to do this once.
Ollama runs as a background service automatically after installation. If it's not running, start it manually:
ollama serveTip: If you see
Error: listen tcp 127.0.0.1:11434: bind: address already in use, that means Ollama is already running — you're good to go.
cd demo_llava
pip install -r requirements.txtThis installs two lightweight packages: flask (web server) and requests (HTTP client).
python app.pyThen open your browser and navigate to:
http://localhost:5001
You should see the demo interface with a green status pill that reads "Ollama connected · llava ready".
-
Upload an image — Drag and drop an image onto the left card, or click to browse your files.
-
Write an instruction — Type a question or instruction in the text area, or click one of the preset chips:
Preset What it does Describe Asks for a detailed description of the image Objects Lists all visible objects Interesting Highlights unusual or notable aspects Read text Transcribes any visible text (OCR-style) Depth Estimates relative depth of objects in the scene ELI5 Explains the image in simple language -
Send — Click the blue Send button or press
Cmd+Enter(macOS) /Ctrl+Enter(Windows/Linux). -
Watch the response stream — LLaVA's response appears token by token in the bottom response card.
You can upload a new image at any time by clicking the × button on the preview and dropping a new one.
demo_llava/
├── app.py # Flask backend — proxies requests to Ollama
├── requirements.txt # Python dependencies (flask, requests)
├── README.md # This file
└── templates/
└── index.html # Frontend — single-page app with embedded CSS/JS
| File | Purpose |
|---|---|
app.py |
Serves the web UI and exposes two API endpoints: /api/generate (streams LLaVA responses) and /api/models (checks Ollama connectivity). |
index.html |
Self-contained frontend with drag-and-drop image upload, preset instruction chips, and Server-Sent Events (SSE) for real-time token streaming. |
┌──────────────┐ POST /api/generate ┌──────────────┐ POST /api/generate ┌──────────────┐
│ │ ──────────────────────────────► │ │ ──────────────────────────────► │ │
│ Browser │ (image + prompt) │ Flask App │ (image + prompt) │ Ollama │
│ (index.html)│ ◄────────────────────────────── │ (app.py) │ ◄────────────────────────────── │ (llava) │
│ │ SSE token stream │ port 5001 │ NDJSON stream │ port 11434 │
└──────────────┘ └──────────────┘ └──────────────┘
- The browser sends the base64-encoded image and the text prompt to Flask.
- Flask forwards the request to Ollama's
/api/generateendpoint withstream: true. - Ollama streams tokens back as newline-delimited JSON.
- Flask re-packages each token as an SSE event and streams it to the browser.
- The frontend appends each token to the response area in real time.
| Problem | Solution |
|---|---|
| Status pill shows "Ollama not reachable" | Make sure Ollama is running: ollama serve. If you get "address already in use", it's already running. |
| Status pill shows "llava not found" | Pull the model: ollama pull llava |
| Response says "Cannot connect to Ollama" | Ollama may have stopped — restart it with ollama serve |
| Very slow responses | LLaVA is running on CPU. For faster inference, ensure your GPU is detected by Ollama (ollama ps) |
| Port 5001 already in use | Edit app.py line 75 and change the port number, e.g. port=5002 |
Here are some interesting prompts to demonstrate LLaVA's instruction-following capabilities:
- Upload a street scene → "Count the number of cars and describe their colors."
- Upload a chart or graph → "Summarize the trend shown in this chart."
- Upload a handwritten note → "Transcribe this text and correct any spelling errors."
- Upload a meme → "Explain why this image is funny."
- Upload an aerial photo → "Describe this scene as if you were giving driving directions."
- Upload any image without a prompt → Try the ELI5 preset to see how it simplifies complex scenes.
- LLaVA: Liu et al., "Visual Instruction Tuning", NeurIPS 2023 — arXiv:2304.08485
- Ollama: ollama.com — Run LLMs locally with a single command
- Flask: flask.palletsprojects.com — Lightweight Python web framework
INFS4205/7205 Advanced Techniques for High Dimensional Data — The University of Queensland