Multi-Modal RAG over Scientific Papers

Retrieve and understand text, figures, and tables from research papers using AI — 100% local, zero API cost.

What This Does

Drop a scientific paper (PDF) into the system and ask questions in natural language. The system:

Parses the PDF — extracts text, figures, and tables
Embeds text with sentence-transformers, images with CLIP
Retrieves the most relevant text passages, figures, and tables for your query
Analyzes retrieved figures using a vision model (LLaVA)
Generates a comprehensive answer grounded in the paper

Example

Query: "Show me papers with phase diagrams of supercooled liquids"

→ Retrieves relevant text passages about phase behavior
→ Finds figure showing temperature vs. density phase diagram
→ LLaVA analyzes: "This is a phase diagram showing liquid, glass,
  and crystalline regions with a glass transition at Tg = 350K..."
→ Generates answer citing specific figures and data

Architecture

                    ┌─────────────┐
                    │   PDF Paper │
                    └──────┬──────┘
                           │
                    ┌──────▼──────┐
                    │  PDF Parser │  PyMuPDF + pdfplumber
                    │  (extract)  │
                    └──┬───┬───┬──┘
                       │   │   │
              ┌────────┘   │   └────────┐
              ▼            ▼            ▼
        ┌──────────┐ ┌──────────┐ ┌──────────┐
        │   Text   │ │  Figures │ │  Tables  │
        │  Chunks  │ │  (PNG)   │ │  (CSV)   │
        └────┬─────┘ └────┬─────┘ └────┬─────┘
             │            │            │
     sentence-       CLIP ViT-B/32  sentence-
     transformers                   transformers
             │            │            │
             ▼            ▼            ▼
        ┌─────────────────────────────────────┐
        │       NumPy Vector Store            │
        │  (text + image + table collections) │
        └───────────────┬─────────────────────┘
                        │
                 ┌──────▼──────┐
                 │  Retriever  │  Multi-modal ranked retrieval
                 └──────┬──────┘
                        │
              ┌─────────┼─────────┐
              ▼         ▼         ▼
         ┌────────┐ ┌────────┐ ┌────────┐
         │  Text  │ │ Images │ │ Tables │
         │Results │ │Results │ │Results │
         └────┬───┘ └───┬────┘ └───┬────┘
              │         │          │
              │    ┌────▼────┐     │
              │    │  LLaVA  │     │
              │    │ (vision)│     │
              │    └────┬────┘     │
              │         │          │
              └────┬────┴────┬───-─┘
                   ▼         ▼
              ┌──────────────────┐
              │   LLaVA (text)   │
              │ Answer Generation│
              └────────┬─────────┘
                       ▼
                ┌──────────────┐
                │   Answer +   │
                │   Sources    │
                └──────────────┘

Quick Start

Prerequisites

Python 3.9+
Ollama installed
~8 GB RAM recommended

Setup

# Clone the repo
git clone https://github.com/tomidiy/multimodal-rag-papers.git
cd multimodal-rag-papers

# One-command setup
make setup

# Activate venv
source venv/bin/activate

Start Ollama (separate terminal)

OLLAMA_KEEP_ALIVE=30s ollama serve

Run

# 1. Put PDFs in data/papers/
cp ~/Downloads/your_paper.pdf data/papers/

# 2. Ingest
make ingest

# 3. Query (interactive CLI)
make query

# 4. Or launch web UI
make ui

Screenshots

Gradio Web UI

CLI Query with Figure Analysis

Example Outputs

See examples/SHOWCASE.md for full example outputs with actual results.

Query: "What is the main contribution of this paper?"

The paper introduces a novel framework for analyzing glass transition behavior in supercooled liquids using molecular dynamics simulations. The key contribution is a modified mode-coupling theory that accounts for...

Sources: 5 text passages, 2 figures analyzed, 1 data table

Query: "Show me phase diagrams"

Retrieved Figure 3 (page 10) — LLaVA analysis: "This is a temperature-density phase diagram showing three distinct regions: liquid (high T), supercooled liquid (metastable), and glass (low T). The glass transition line Tg(ρ) is marked with circles..."

Tech Stack

Component	Technology	Purpose
PDF Parsing	PyMuPDF + pdfplumber	Extract text, figures, tables
Image Embeddings	OpenCLIP (ViT-B/32)	Embed figures for visual search
Text Embeddings	sentence-transformers	Embed text chunks for semantic search
Vector Store	NumPy (custom)	Store and retrieve embeddings
LLM + Vision	LLaVA 7B via Ollama	Text generation + figure analysis
Web UI	Gradio	Interactive browser interface

Everything runs locally. No API keys. No cloud. No cost.

Project Structure

multimodal-rag-papers/
├── src/
│   ├── pdf_parser.py         # PDF → text + images + metadata
│   ├── table_extractor.py    # PDF → structured tables
│   ├── image_embedder.py     # CLIP image embeddings
│   ├── text_embedder.py      # Sentence-transformer text embeddings
│   ├── vector_store.py       # NumPy vector store (3 collections)
│   ├── retriever.py          # Multi-modal ranked retrieval
│   ├── figure_analyzer.py    # LLaVA figure understanding
│   ├── llm.py                # Ollama API wrapper (text + vision)
│   └── rag_pipeline.py       # Main orchestrator (phased memory mgmt)
├── app.py                    # Gradio web UI
├── ingest.py                 # CLI ingestion script
├── query.py                  # CLI query interface
├── analyze_paper.py          # Deep paper analysis (13 queries)
├── generate_examples.py      # Generate showcase outputs
├── Makefile                  # One-command operations
└── requirements.txt

Configuration

Copy .env.example to .env:

cp .env.example .env

Variable	Default	Description
`OLLAMA_LLM_MODEL`	`llava:7b`	Ollama model for text generation
`OLLAMA_VISION_MODEL`	`llava:7b`	Ollama model for figure analysis
`CLIP_MODEL`	`ViT-B-32`	CLIP model for image embeddings
`EMBEDDING_MODEL`	`all-MiniLM-L6-v2`	Text embedding model
`CHUNK_SIZE`	`1000`	Characters per text chunk
`CHUNK_OVERLAP`	`200`	Overlap between chunks

Design Decisions

Why multi-modal?

Scientific papers communicate through text, figures, AND tables. A text-only RAG misses 40%+ of the information — especially phase diagrams, plots, and data tables.

Why local?

Privacy: Papers under review shouldn't go to external APIs
Cost: Zero ongoing cost vs. $0.01–0.03 per GPT-4V call
Speed: No network latency for repeated queries
Reproducibility: Same model, same results, every time

Why a custom vector store instead of ChromaDB?

ChromaDB requires compiling hnswlib (C++ dependency) which fails on many systems. Our custom NumPy vector store uses brute-force cosine similarity — at our scale (<1,000 vectors per paper), queries run in under 5ms. Zero compilation, zero external dependencies, works everywhere.

Performance

Tested on a MacBook Pro (M-series, 8 GB RAM):

Operation	Time	RAM Peak
Ingest 15-page paper	~45 sec	~1.5 GB
Text-only query	~8 sec	~3 GB
Query with figure analysis	~35 sec	~5.5 GB
Full 13-query deep analysis	~6 min	~5.5 GB

Future Improvements

Citation graph analysis (find related papers automatically)
Equation extraction with LaTeX parsing
Multi-paper comparative analysis
Export analysis reports as PDF
Fine-tuned embedding model for scientific text
Hybrid search (dense + sparse retrieval)

License

MIT License — see LICENSE.

Acknowledgments

Ollama — Local LLM inference
OpenCLIP — CLIP image embeddings
NumPy — Custom vector store
PyMuPDF — PDF parsing
Gradio — Web UI

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multi-Modal RAG over Scientific Papers

What This Does

Example

Architecture

Quick Start

Prerequisites

Setup

Start Ollama (separate terminal)

Run

Screenshots

Gradio Web UI

CLI Query with Figure Analysis

Example Outputs

Query: "What is the main contribution of this paper?"

Query: "Show me phase diagrams"

Tech Stack

Project Structure

Configuration

Design Decisions

Why multi-modal?

Why local?

Why a custom vector store instead of ChromaDB?

Performance

Future Improvements

License

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
docs/screenshots		docs/screenshots
examples		examples
src		src
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
analyze_paper.py		analyze_paper.py
app.py		app.py
generate_examples.py		generate_examples.py
ingest.py		ingest.py
query.py		query.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Multi-Modal RAG over Scientific Papers

What This Does

Example

Architecture

Quick Start

Prerequisites

Setup

Start Ollama (separate terminal)

Run

Screenshots

Gradio Web UI

CLI Query with Figure Analysis

Example Outputs

Query: "What is the main contribution of this paper?"

Query: "Show me phase diagrams"

Tech Stack

Project Structure

Configuration

Design Decisions

Why multi-modal?

Why local?

Why a custom vector store instead of ChromaDB?

Performance

Future Improvements

License

Acknowledgments

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages