INFS4205/7205 — Week 4 Demo

Multimodal Instruction Following with LLaVA

An interactive web demo that showcases LLaVA (Large Language and Vision Assistant) — a multimodal large language model that can understand both images and natural language instructions. The model runs locally via Ollama, so no API keys or cloud services are required.

What This Demo Shows

LLaVA demonstrates instruction following in a multimodal setting:

You provide an image (e.g. a street scene, a chart, a handwritten note).
You give a natural-language instruction (e.g. "What objects are in this image?").
LLaVA generates a detailed response by jointly reasoning over the visual and textual input.

This is a core capability studied in multimodal foundation models — the ability to ground language understanding in visual perception.

Prerequisites

Requirement	Version	Notes
Python	3.9+	Check with `python --version`
Ollama	latest	Local LLM runtime — install here
pip	any	Comes with Python

Hardware: LLaVA (~4.7 GB) runs comfortably on machines with 8 GB+ RAM. A GPU is helpful but not required — Ollama automatically uses the GPU if available.

Quick Start

Follow these three steps to get the demo running.

Step 1 — Install Ollama & pull the model

If you haven't installed Ollama yet, download it from ollama.com/download (macOS / Linux / Windows).

Then open a terminal and pull the LLaVA model:

ollama pull llava

This downloads ~4.7 GB on first run. You only need to do this once.

Ollama runs as a background service automatically after installation. If it's not running, start it manually:

ollama serve

Tip: If you see Error: listen tcp 127.0.0.1:11434: bind: address already in use, that means Ollama is already running — you're good to go.

Step 2 — Install Python dependencies

cd demo_llava
pip install -r requirements.txt

This installs two lightweight packages: flask (web server) and requests (HTTP client).

Step 3 — Launch the demo

python app.py

Then open your browser and navigate to:

http://localhost:5001

You should see the demo interface with a green status pill that reads "Ollama connected · llava ready".

How to Use the Demo

Upload an image — Drag and drop an image onto the left card, or click to browse your files.

Write an instruction — Type a question or instruction in the text area, or click one of the preset chips:

Preset	What it does
Describe	Asks for a detailed description of the image
Objects	Lists all visible objects
Interesting	Highlights unusual or notable aspects
Read text	Transcribes any visible text (OCR-style)
Depth	Estimates relative depth of objects in the scene
ELI5	Explains the image in simple language

Send — Click the blue Send button or press Cmd+Enter (macOS) / Ctrl+Enter (Windows/Linux).
Watch the response stream — LLaVA's response appears token by token in the bottom response card.

You can upload a new image at any time by clicking the × button on the preview and dropping a new one.

Project Structure

demo_llava/
├── app.py                 # Flask backend — proxies requests to Ollama
├── requirements.txt       # Python dependencies (flask, requests)
├── README.md              # This file
└── templates/
    └── index.html         # Frontend — single-page app with embedded CSS/JS

File	Purpose
`app.py`	Serves the web UI and exposes two API endpoints: `/api/generate` (streams LLaVA responses) and `/api/models` (checks Ollama connectivity).
`index.html`	Self-contained frontend with drag-and-drop image upload, preset instruction chips, and Server-Sent Events (SSE) for real-time token streaming.

Architecture Overview

┌──────────────┐       POST /api/generate       ┌──────────────┐       POST /api/generate       ┌──────────────┐
│              │  ──────────────────────────────► │              │  ──────────────────────────────► │              │
│   Browser    │       (image + prompt)           │  Flask App   │       (image + prompt)           │   Ollama     │
│  (index.html)│  ◄────────────────────────────── │  (app.py)    │  ◄────────────────────────────── │   (llava)    │
│              │       SSE token stream           │  port 5001   │       NDJSON stream              │  port 11434  │
└──────────────┘                                  └──────────────┘                                  └──────────────┘

The browser sends the base64-encoded image and the text prompt to Flask.
Flask forwards the request to Ollama's /api/generate endpoint with stream: true.
Ollama streams tokens back as newline-delimited JSON.
Flask re-packages each token as an SSE event and streams it to the browser.
The frontend appends each token to the response area in real time.

Troubleshooting

Problem	Solution
Status pill shows "Ollama not reachable"	Make sure Ollama is running: `ollama serve`. If you get "address already in use", it's already running.
Status pill shows "llava not found"	Pull the model: `ollama pull llava`
Response says "Cannot connect to Ollama"	Ollama may have stopped — restart it with `ollama serve`
Very slow responses	LLaVA is running on CPU. For faster inference, ensure your GPU is detected by Ollama (`ollama ps`)
Port 5001 already in use	Edit `app.py` line 75 and change the port number, e.g. `port=5002`

Things to Try

Here are some interesting prompts to demonstrate LLaVA's instruction-following capabilities:

Upload a street scene → "Count the number of cars and describe their colors."
Upload a chart or graph → "Summarize the trend shown in this chart."
Upload a handwritten note → "Transcribe this text and correct any spelling errors."
Upload a meme → "Explain why this image is funny."
Upload an aerial photo → "Describe this scene as if you were giving driving directions."
Upload any image without a prompt → Try the ELI5 preset to see how it simplifies complex scenes.

References

LLaVA: Liu et al., "Visual Instruction Tuning", NeurIPS 2023 — arXiv:2304.08485
Ollama: ollama.com — Run LLMs locally with a single command
Flask: flask.palletsprojects.com — Lightweight Python web framework

INFS4205/7205 Advanced Techniques for High Dimensional Data — The University of Queensland

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
templates		templates
.DS_Store		.DS_Store
README.md		README.md
app.py		app.py
nugget_real.png		nugget_real.png
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

INFS4205/7205 — Week 4 Demo

Multimodal Instruction Following with LLaVA

What This Demo Shows

Prerequisites

Quick Start

Step 1 — Install Ollama & pull the model

Step 2 — Install Python dependencies

Step 3 — Launch the demo

How to Use the Demo

Project Structure

Architecture Overview

Troubleshooting

Things to Try

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

INFS4205/7205 — Week 4 Demo

Multimodal Instruction Following with LLaVA

What This Demo Shows

Prerequisites

Quick Start

Step 1 — Install Ollama & pull the model

Step 2 — Install Python dependencies

Step 3 — Launch the demo

How to Use the Demo

Project Structure

Architecture Overview

Troubleshooting

Things to Try

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages