Basic Multimodal RAG Tutorial

This repository contains a small, self-contained notebook that demonstrates a basic multimodal RAG pipeline using local text chunks, image assets, precomputed embeddings, and an Ollama vision model for the final response step.

Overview

The notebook walks through a simple Multimodal Retrieval-Augmented Generation workflow:

Load a prepared corpus and saved embeddings
Embed a text query into a shared CLIP embedding space
Retrieve relevant text chunks and images
Send the retrieved context to an Ollama multimodal model

The repo already includes the prepared demo assets, so you can focus on understanding the retrieval and prompting flow rather than building the dataset from scratch.

Repository Contents

mrag_tutorial.ipynb - main tutorial notebook
data/text_content.json - text corpus records
data/image_content.json - image metadata records
data/text_embeddings.pt - saved text embeddings
data/image_embeddings.pt - saved image embeddings
images/ - local image files used by the notebook

Included Demo Assets

The notebook loads a ready-made local corpus containing:

86 text records
16 image records

Because the embeddings are already stored in the repository, you do not need to regenerate them to run the demo.

Prerequisites

Python with Jupyter Notebook or JupyterLab
Ollama installed locally
Access to the llava-phi3:3.8b model for the final generation step

The notebook installs these Python packages in a setup cell:

transformers
ollama
torch

Quick Start

Open Jupyter and launch mrag_tutorial.ipynb.
Run the notebook cells from top to bottom.
Before running the final generation cells, start Ollama:

ollama serve

Pull the required model if it is not already available:

ollama pull llava-phi3:3.8b

What the Notebook Covers

loading local text and image records
loading precomputed CLIP embeddings
embedding a user question in the same vector space
retrieving top matching text chunks and images
prompting a multimodal model with both retrieved text and image paths

Useful Ollama Checks

ollama list
curl http://127.0.0.1:11434/api/tags

Troubleshooting

If Jupyter shows Failed to connect to Ollama, the Ollama server is usually not running yet. Start it with:

ollama serve

If the notebook runs through retrieval but fails at generation, check that:

Ollama is running
llava-phi3:3.8b has been pulled locally

Notes

You can adapt this demo to your own multimodal dataset, but that is optional. The included assets are already enough to run the tutorial and understand the core MRAG pipeline end to end.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
data		data
images		images
README.md		README.md
mrag_tutorial.ipynb		mrag_tutorial.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Basic Multimodal RAG Tutorial

Overview

Repository Contents

Included Demo Assets

Prerequisites

Quick Start

What the Notebook Covers

Useful Ollama Checks

Troubleshooting

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Basic Multimodal RAG Tutorial

Overview

Repository Contents

Included Demo Assets

Prerequisites

Quick Start

What the Notebook Covers

Useful Ollama Checks

Troubleshooting

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages