Skip to content

tomidiy/multimodal-rag-papers

Repository files navigation

Multi-Modal RAG over Scientific Papers

Retrieve and understand text, figures, and tables from research papers using AI — 100% local, zero API cost.

Python 3.9+ Ollama License: MIT


What This Does

Drop a scientific paper (PDF) into the system and ask questions in natural language. The system:

  1. Parses the PDF — extracts text, figures, and tables
  2. Embeds text with sentence-transformers, images with CLIP
  3. Retrieves the most relevant text passages, figures, and tables for your query
  4. Analyzes retrieved figures using a vision model (LLaVA)
  5. Generates a comprehensive answer grounded in the paper

Example

Query: "Show me papers with phase diagrams of supercooled liquids"

→ Retrieves relevant text passages about phase behavior
→ Finds figure showing temperature vs. density phase diagram
→ LLaVA analyzes: "This is a phase diagram showing liquid, glass,
  and crystalline regions with a glass transition at Tg = 350K..."
→ Generates answer citing specific figures and data

Architecture

                    ┌─────────────┐
                    │   PDF Paper │
                    └──────┬──────┘
                           │
                    ┌──────▼──────┐
                    │  PDF Parser │  PyMuPDF + pdfplumber
                    │  (extract)  │
                    └──┬───┬───┬──┘
                       │   │   │
              ┌────────┘   │   └────────┐
              ▼            ▼            ▼
        ┌──────────┐ ┌──────────┐ ┌──────────┐
        │   Text   │ │  Figures │ │  Tables  │
        │  Chunks  │ │  (PNG)   │ │  (CSV)   │
        └────┬─────┘ └────┬─────┘ └────┬─────┘
             │            │            │
     sentence-       CLIP ViT-B/32  sentence-
     transformers                   transformers
             │            │            │
             ▼            ▼            ▼
        ┌─────────────────────────────────────┐
        │       NumPy Vector Store            │
        │  (text + image + table collections) │
        └───────────────┬─────────────────────┘
                        │
                 ┌──────▼──────┐
                 │  Retriever  │  Multi-modal ranked retrieval
                 └──────┬──────┘
                        │
              ┌─────────┼─────────┐
              ▼         ▼         ▼
         ┌────────┐ ┌────────┐ ┌────────┐
         │  Text  │ │ Images │ │ Tables │
         │Results │ │Results │ │Results │
         └────┬───┘ └───┬────┘ └───┬────┘
              │         │          │
              │    ┌────▼────┐     │
              │    │  LLaVA  │     │
              │    │ (vision)│     │
              │    └────┬────┘     │
              │         │          │
              └────┬────┴────┬───-─┘
                   ▼         ▼
              ┌──────────────────┐
              │   LLaVA (text)   │
              │ Answer Generation│
              └────────┬─────────┘
                       ▼
                ┌──────────────┐
                │   Answer +   │
                │   Sources    │
                └──────────────┘

Quick Start

Prerequisites

  • Python 3.9+
  • Ollama installed
  • ~8 GB RAM recommended

Setup

# Clone the repo
git clone https://github.com/tomidiy/multimodal-rag-papers.git
cd multimodal-rag-papers

# One-command setup
make setup

# Activate venv
source venv/bin/activate

Start Ollama (separate terminal)

OLLAMA_KEEP_ALIVE=30s ollama serve

Run

# 1. Put PDFs in data/papers/
cp ~/Downloads/your_paper.pdf data/papers/

# 2. Ingest
make ingest

# 3. Query (interactive CLI)
make query

# 4. Or launch web UI
make ui

Screenshots

Gradio Web UI

Gradio UI

CLI Query with Figure Analysis

Terminal Output


Example Outputs

See examples/SHOWCASE.md for full example outputs with actual results.

Query: "What is the main contribution of this paper?"

The paper introduces a novel framework for analyzing glass transition behavior in supercooled liquids using molecular dynamics simulations. The key contribution is a modified mode-coupling theory that accounts for...

Sources: 5 text passages, 2 figures analyzed, 1 data table

Query: "Show me phase diagrams"

Retrieved Figure 3 (page 10) — LLaVA analysis: "This is a temperature-density phase diagram showing three distinct regions: liquid (high T), supercooled liquid (metastable), and glass (low T). The glass transition line Tg(ρ) is marked with circles..."


Tech Stack

Component Technology Purpose
PDF Parsing PyMuPDF + pdfplumber Extract text, figures, tables
Image Embeddings OpenCLIP (ViT-B/32) Embed figures for visual search
Text Embeddings sentence-transformers Embed text chunks for semantic search
Vector Store NumPy (custom) Store and retrieve embeddings
LLM + Vision LLaVA 7B via Ollama Text generation + figure analysis
Web UI Gradio Interactive browser interface

Everything runs locally. No API keys. No cloud. No cost.


Project Structure

multimodal-rag-papers/
├── src/
│   ├── pdf_parser.py         # PDF → text + images + metadata
│   ├── table_extractor.py    # PDF → structured tables
│   ├── image_embedder.py     # CLIP image embeddings
│   ├── text_embedder.py      # Sentence-transformer text embeddings
│   ├── vector_store.py       # NumPy vector store (3 collections)
│   ├── retriever.py          # Multi-modal ranked retrieval
│   ├── figure_analyzer.py    # LLaVA figure understanding
│   ├── llm.py                # Ollama API wrapper (text + vision)
│   └── rag_pipeline.py       # Main orchestrator (phased memory mgmt)
├── app.py                    # Gradio web UI
├── ingest.py                 # CLI ingestion script
├── query.py                  # CLI query interface
├── analyze_paper.py          # Deep paper analysis (13 queries)
├── generate_examples.py      # Generate showcase outputs
├── Makefile                  # One-command operations
└── requirements.txt

Configuration

Copy .env.example to .env:

cp .env.example .env
Variable Default Description
OLLAMA_LLM_MODEL llava:7b Ollama model for text generation
OLLAMA_VISION_MODEL llava:7b Ollama model for figure analysis
CLIP_MODEL ViT-B-32 CLIP model for image embeddings
EMBEDDING_MODEL all-MiniLM-L6-v2 Text embedding model
CHUNK_SIZE 1000 Characters per text chunk
CHUNK_OVERLAP 200 Overlap between chunks

Design Decisions

Why multi-modal?

Scientific papers communicate through text, figures, AND tables. A text-only RAG misses 40%+ of the information — especially phase diagrams, plots, and data tables.

Why local?

  • Privacy: Papers under review shouldn't go to external APIs
  • Cost: Zero ongoing cost vs. $0.01–0.03 per GPT-4V call
  • Speed: No network latency for repeated queries
  • Reproducibility: Same model, same results, every time

Why a custom vector store instead of ChromaDB?

ChromaDB requires compiling hnswlib (C++ dependency) which fails on many systems. Our custom NumPy vector store uses brute-force cosine similarity — at our scale (<1,000 vectors per paper), queries run in under 5ms. Zero compilation, zero external dependencies, works everywhere.


Performance

Tested on a MacBook Pro (M-series, 8 GB RAM):

Operation Time RAM Peak
Ingest 15-page paper ~45 sec ~1.5 GB
Text-only query ~8 sec ~3 GB
Query with figure analysis ~35 sec ~5.5 GB
Full 13-query deep analysis ~6 min ~5.5 GB

Future Improvements

  • Citation graph analysis (find related papers automatically)
  • Equation extraction with LaTeX parsing
  • Multi-paper comparative analysis
  • Export analysis reports as PDF
  • Fine-tuned embedding model for scientific text
  • Hybrid search (dense + sparse retrieval)

License

MIT License — see LICENSE.


Acknowledgments

About

Multi-modal RAG system for scientific papers. Retrieves and analyzes text, figures, and tables using CLIP + LLaVA. 100% local via Ollama, zero API cost.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors