Skip to content

INFS4205-7205/2026_Week4_LLaVA_Demo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

INFS4205/7205 — Week 4 Demo

Multimodal Instruction Following with LLaVA

An interactive web demo that showcases LLaVA (Large Language and Vision Assistant) — a multimodal large language model that can understand both images and natural language instructions. The model runs locally via Ollama, so no API keys or cloud services are required.


What This Demo Shows

LLaVA demonstrates instruction following in a multimodal setting:

  1. You provide an image (e.g. a street scene, a chart, a handwritten note).
  2. You give a natural-language instruction (e.g. "What objects are in this image?").
  3. LLaVA generates a detailed response by jointly reasoning over the visual and textual input.

This is a core capability studied in multimodal foundation models — the ability to ground language understanding in visual perception.


Prerequisites

Requirement Version Notes
Python 3.9+ Check with python --version
Ollama latest Local LLM runtime — install here
pip any Comes with Python

Hardware: LLaVA (~4.7 GB) runs comfortably on machines with 8 GB+ RAM. A GPU is helpful but not required — Ollama automatically uses the GPU if available.


Quick Start

Follow these three steps to get the demo running.

Step 1 — Install Ollama & pull the model

If you haven't installed Ollama yet, download it from ollama.com/download (macOS / Linux / Windows).

Then open a terminal and pull the LLaVA model:

ollama pull llava

This downloads ~4.7 GB on first run. You only need to do this once.

Ollama runs as a background service automatically after installation. If it's not running, start it manually:

ollama serve

Tip: If you see Error: listen tcp 127.0.0.1:11434: bind: address already in use, that means Ollama is already running — you're good to go.

Step 2 — Install Python dependencies

cd demo_llava
pip install -r requirements.txt

This installs two lightweight packages: flask (web server) and requests (HTTP client).

Step 3 — Launch the demo

python app.py

Then open your browser and navigate to:

http://localhost:5001

You should see the demo interface with a green status pill that reads "Ollama connected · llava ready".


How to Use the Demo

  1. Upload an image — Drag and drop an image onto the left card, or click to browse your files.

  2. Write an instruction — Type a question or instruction in the text area, or click one of the preset chips:

    Preset What it does
    Describe Asks for a detailed description of the image
    Objects Lists all visible objects
    Interesting Highlights unusual or notable aspects
    Read text Transcribes any visible text (OCR-style)
    Depth Estimates relative depth of objects in the scene
    ELI5 Explains the image in simple language
  3. Send — Click the blue Send button or press Cmd+Enter (macOS) / Ctrl+Enter (Windows/Linux).

  4. Watch the response stream — LLaVA's response appears token by token in the bottom response card.

You can upload a new image at any time by clicking the × button on the preview and dropping a new one.


Project Structure

demo_llava/
├── app.py                 # Flask backend — proxies requests to Ollama
├── requirements.txt       # Python dependencies (flask, requests)
├── README.md              # This file
└── templates/
    └── index.html         # Frontend — single-page app with embedded CSS/JS
File Purpose
app.py Serves the web UI and exposes two API endpoints: /api/generate (streams LLaVA responses) and /api/models (checks Ollama connectivity).
index.html Self-contained frontend with drag-and-drop image upload, preset instruction chips, and Server-Sent Events (SSE) for real-time token streaming.

Architecture Overview

┌──────────────┐       POST /api/generate       ┌──────────────┐       POST /api/generate       ┌──────────────┐
│              │  ──────────────────────────────► │              │  ──────────────────────────────► │              │
│   Browser    │       (image + prompt)           │  Flask App   │       (image + prompt)           │   Ollama     │
│  (index.html)│  ◄────────────────────────────── │  (app.py)    │  ◄────────────────────────────── │   (llava)    │
│              │       SSE token stream           │  port 5001   │       NDJSON stream              │  port 11434  │
└──────────────┘                                  └──────────────┘                                  └──────────────┘
  1. The browser sends the base64-encoded image and the text prompt to Flask.
  2. Flask forwards the request to Ollama's /api/generate endpoint with stream: true.
  3. Ollama streams tokens back as newline-delimited JSON.
  4. Flask re-packages each token as an SSE event and streams it to the browser.
  5. The frontend appends each token to the response area in real time.

Troubleshooting

Problem Solution
Status pill shows "Ollama not reachable" Make sure Ollama is running: ollama serve. If you get "address already in use", it's already running.
Status pill shows "llava not found" Pull the model: ollama pull llava
Response says "Cannot connect to Ollama" Ollama may have stopped — restart it with ollama serve
Very slow responses LLaVA is running on CPU. For faster inference, ensure your GPU is detected by Ollama (ollama ps)
Port 5001 already in use Edit app.py line 75 and change the port number, e.g. port=5002

Things to Try

Here are some interesting prompts to demonstrate LLaVA's instruction-following capabilities:

  • Upload a street scene"Count the number of cars and describe their colors."
  • Upload a chart or graph"Summarize the trend shown in this chart."
  • Upload a handwritten note"Transcribe this text and correct any spelling errors."
  • Upload a meme"Explain why this image is funny."
  • Upload an aerial photo"Describe this scene as if you were giving driving directions."
  • Upload any image without a prompt → Try the ELI5 preset to see how it simplifies complex scenes.

References


INFS4205/7205 Advanced Techniques for High Dimensional Data — The University of Queensland

About

Vibe coding demo for visualise LLava's instructioin following capacity

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors