Retrieve and understand text, figures, and tables from research papers using AI — 100% local, zero API cost.
Drop a scientific paper (PDF) into the system and ask questions in natural language. The system:
- Parses the PDF — extracts text, figures, and tables
- Embeds text with sentence-transformers, images with CLIP
- Retrieves the most relevant text passages, figures, and tables for your query
- Analyzes retrieved figures using a vision model (LLaVA)
- Generates a comprehensive answer grounded in the paper
Query: "Show me papers with phase diagrams of supercooled liquids"
→ Retrieves relevant text passages about phase behavior
→ Finds figure showing temperature vs. density phase diagram
→ LLaVA analyzes: "This is a phase diagram showing liquid, glass,
and crystalline regions with a glass transition at Tg = 350K..."
→ Generates answer citing specific figures and data
┌─────────────┐
│ PDF Paper │
└──────┬──────┘
│
┌──────▼──────┐
│ PDF Parser │ PyMuPDF + pdfplumber
│ (extract) │
└──┬───┬───┬──┘
│ │ │
┌────────┘ │ └────────┐
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
│ Text │ │ Figures │ │ Tables │
│ Chunks │ │ (PNG) │ │ (CSV) │
└────┬─────┘ └────┬─────┘ └────┬─────┘
│ │ │
sentence- CLIP ViT-B/32 sentence-
transformers transformers
│ │ │
▼ ▼ ▼
┌─────────────────────────────────────┐
│ NumPy Vector Store │
│ (text + image + table collections) │
└───────────────┬─────────────────────┘
│
┌──────▼──────┐
│ Retriever │ Multi-modal ranked retrieval
└──────┬──────┘
│
┌─────────┼─────────┐
▼ ▼ ▼
┌────────┐ ┌────────┐ ┌────────┐
│ Text │ │ Images │ │ Tables │
│Results │ │Results │ │Results │
└────┬───┘ └───┬────┘ └───┬────┘
│ │ │
│ ┌────▼────┐ │
│ │ LLaVA │ │
│ │ (vision)│ │
│ └────┬────┘ │
│ │ │
└────┬────┴────┬───-─┘
▼ ▼
┌──────────────────┐
│ LLaVA (text) │
│ Answer Generation│
└────────┬─────────┘
▼
┌──────────────┐
│ Answer + │
│ Sources │
└──────────────┘
- Python 3.9+
- Ollama installed
- ~8 GB RAM recommended
# Clone the repo
git clone https://github.com/tomidiy/multimodal-rag-papers.git
cd multimodal-rag-papers
# One-command setup
make setup
# Activate venv
source venv/bin/activateOLLAMA_KEEP_ALIVE=30s ollama serve# 1. Put PDFs in data/papers/
cp ~/Downloads/your_paper.pdf data/papers/
# 2. Ingest
make ingest
# 3. Query (interactive CLI)
make query
# 4. Or launch web UI
make uiSee examples/SHOWCASE.md for full example outputs with actual results.
The paper introduces a novel framework for analyzing glass transition behavior in supercooled liquids using molecular dynamics simulations. The key contribution is a modified mode-coupling theory that accounts for...
Sources: 5 text passages, 2 figures analyzed, 1 data table
Retrieved Figure 3 (page 10) — LLaVA analysis: "This is a temperature-density phase diagram showing three distinct regions: liquid (high T), supercooled liquid (metastable), and glass (low T). The glass transition line Tg(ρ) is marked with circles..."
| Component | Technology | Purpose |
|---|---|---|
| PDF Parsing | PyMuPDF + pdfplumber | Extract text, figures, tables |
| Image Embeddings | OpenCLIP (ViT-B/32) | Embed figures for visual search |
| Text Embeddings | sentence-transformers | Embed text chunks for semantic search |
| Vector Store | NumPy (custom) | Store and retrieve embeddings |
| LLM + Vision | LLaVA 7B via Ollama | Text generation + figure analysis |
| Web UI | Gradio | Interactive browser interface |
Everything runs locally. No API keys. No cloud. No cost.
multimodal-rag-papers/
├── src/
│ ├── pdf_parser.py # PDF → text + images + metadata
│ ├── table_extractor.py # PDF → structured tables
│ ├── image_embedder.py # CLIP image embeddings
│ ├── text_embedder.py # Sentence-transformer text embeddings
│ ├── vector_store.py # NumPy vector store (3 collections)
│ ├── retriever.py # Multi-modal ranked retrieval
│ ├── figure_analyzer.py # LLaVA figure understanding
│ ├── llm.py # Ollama API wrapper (text + vision)
│ └── rag_pipeline.py # Main orchestrator (phased memory mgmt)
├── app.py # Gradio web UI
├── ingest.py # CLI ingestion script
├── query.py # CLI query interface
├── analyze_paper.py # Deep paper analysis (13 queries)
├── generate_examples.py # Generate showcase outputs
├── Makefile # One-command operations
└── requirements.txt
Copy .env.example to .env:
cp .env.example .env| Variable | Default | Description |
|---|---|---|
OLLAMA_LLM_MODEL |
llava:7b |
Ollama model for text generation |
OLLAMA_VISION_MODEL |
llava:7b |
Ollama model for figure analysis |
CLIP_MODEL |
ViT-B-32 |
CLIP model for image embeddings |
EMBEDDING_MODEL |
all-MiniLM-L6-v2 |
Text embedding model |
CHUNK_SIZE |
1000 |
Characters per text chunk |
CHUNK_OVERLAP |
200 |
Overlap between chunks |
Scientific papers communicate through text, figures, AND tables. A text-only RAG misses 40%+ of the information — especially phase diagrams, plots, and data tables.
- Privacy: Papers under review shouldn't go to external APIs
- Cost: Zero ongoing cost vs. $0.01–0.03 per GPT-4V call
- Speed: No network latency for repeated queries
- Reproducibility: Same model, same results, every time
ChromaDB requires compiling hnswlib (C++ dependency) which fails on many systems. Our custom NumPy vector store uses brute-force cosine similarity — at our scale (<1,000 vectors per paper), queries run in under 5ms. Zero compilation, zero external dependencies, works everywhere.
Tested on a MacBook Pro (M-series, 8 GB RAM):
| Operation | Time | RAM Peak |
|---|---|---|
| Ingest 15-page paper | ~45 sec | ~1.5 GB |
| Text-only query | ~8 sec | ~3 GB |
| Query with figure analysis | ~35 sec | ~5.5 GB |
| Full 13-query deep analysis | ~6 min | ~5.5 GB |
- Citation graph analysis (find related papers automatically)
- Equation extraction with LaTeX parsing
- Multi-paper comparative analysis
- Export analysis reports as PDF
- Fine-tuned embedding model for scientific text
- Hybrid search (dense + sparse retrieval)
MIT License — see LICENSE.

