Skip to content

INFS4205-7205/2026-Week5-mllmrag-demo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Basic Multimodal RAG Tutorial

This repository contains a small, self-contained notebook that demonstrates a basic multimodal RAG pipeline using local text chunks, image assets, precomputed embeddings, and an Ollama vision model for the final response step.

Overview

The notebook walks through a simple Multimodal Retrieval-Augmented Generation workflow:

  1. Load a prepared corpus and saved embeddings
  2. Embed a text query into a shared CLIP embedding space
  3. Retrieve relevant text chunks and images
  4. Send the retrieved context to an Ollama multimodal model

The repo already includes the prepared demo assets, so you can focus on understanding the retrieval and prompting flow rather than building the dataset from scratch.

Repository Contents

  • mrag_tutorial.ipynb - main tutorial notebook
  • data/text_content.json - text corpus records
  • data/image_content.json - image metadata records
  • data/text_embeddings.pt - saved text embeddings
  • data/image_embeddings.pt - saved image embeddings
  • images/ - local image files used by the notebook

Included Demo Assets

The notebook loads a ready-made local corpus containing:

  • 86 text records
  • 16 image records

Because the embeddings are already stored in the repository, you do not need to regenerate them to run the demo.

Prerequisites

  • Python with Jupyter Notebook or JupyterLab
  • Ollama installed locally
  • Access to the llava-phi3:3.8b model for the final generation step

The notebook installs these Python packages in a setup cell:

  • transformers
  • ollama
  • torch

Quick Start

  1. Open Jupyter and launch mrag_tutorial.ipynb.
  2. Run the notebook cells from top to bottom.
  3. Before running the final generation cells, start Ollama:
ollama serve
  1. Pull the required model if it is not already available:
ollama pull llava-phi3:3.8b

What the Notebook Covers

  • loading local text and image records
  • loading precomputed CLIP embeddings
  • embedding a user question in the same vector space
  • retrieving top matching text chunks and images
  • prompting a multimodal model with both retrieved text and image paths

Useful Ollama Checks

ollama list
curl http://127.0.0.1:11434/api/tags

Troubleshooting

If Jupyter shows Failed to connect to Ollama, the Ollama server is usually not running yet. Start it with:

ollama serve

If the notebook runs through retrieval but fails at generation, check that:

  • Ollama is running
  • llava-phi3:3.8b has been pulled locally

Notes

You can adapt this demo to your own multimodal dataset, but that is optional. The included assets are already enough to run the tutorial and understand the core MRAG pipeline end to end.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors