This repository contains a small, self-contained notebook that demonstrates a basic multimodal RAG pipeline using local text chunks, image assets, precomputed embeddings, and an Ollama vision model for the final response step.
The notebook walks through a simple Multimodal Retrieval-Augmented Generation workflow:
- Load a prepared corpus and saved embeddings
- Embed a text query into a shared CLIP embedding space
- Retrieve relevant text chunks and images
- Send the retrieved context to an Ollama multimodal model
The repo already includes the prepared demo assets, so you can focus on understanding the retrieval and prompting flow rather than building the dataset from scratch.
mrag_tutorial.ipynb- main tutorial notebookdata/text_content.json- text corpus recordsdata/image_content.json- image metadata recordsdata/text_embeddings.pt- saved text embeddingsdata/image_embeddings.pt- saved image embeddingsimages/- local image files used by the notebook
The notebook loads a ready-made local corpus containing:
- 86 text records
- 16 image records
Because the embeddings are already stored in the repository, you do not need to regenerate them to run the demo.
- Python with Jupyter Notebook or JupyterLab
- Ollama installed locally
- Access to the
llava-phi3:3.8bmodel for the final generation step
The notebook installs these Python packages in a setup cell:
transformersollamatorch
- Open Jupyter and launch
mrag_tutorial.ipynb. - Run the notebook cells from top to bottom.
- Before running the final generation cells, start Ollama:
ollama serve- Pull the required model if it is not already available:
ollama pull llava-phi3:3.8b- loading local text and image records
- loading precomputed CLIP embeddings
- embedding a user question in the same vector space
- retrieving top matching text chunks and images
- prompting a multimodal model with both retrieved text and image paths
ollama list
curl http://127.0.0.1:11434/api/tagsIf Jupyter shows Failed to connect to Ollama, the Ollama server is usually not running yet. Start it with:
ollama serveIf the notebook runs through retrieval but fails at generation, check that:
- Ollama is running
llava-phi3:3.8bhas been pulled locally
You can adapt this demo to your own multimodal dataset, but that is optional. The included assets are already enough to run the tutorial and understand the core MRAG pipeline end to end.