parsemypdf/vlm_ocr at main · Ibrahim01110/parsemypdf

Name	Name	Last commit message	Last commit date
parent directory ..
anthropic	anthropic
gemini	gemini
mistral_ocr	mistral_ocr
msft_kosmos_2.5	msft_kosmos_2.5
ollama_models	ollama_models
omniai	omniai
openai	openai
smol_docling	smol_docling
README.md	README.md

👉 GenAI Roadmap - 2025

🖼️ OCR with Multimodal | Vision Language Models

Model Provider	Models	Open / Paid	Example Code	Doc
Anthropic	`claude-opus-4-20250514`, `claude-sonnet-4-20250514`, `claude-3-7-sonnet-20250219`, `claude-3-5-sonnet-20241022`	Paid	Code	Doc
Gemini	`gemini-2.5-pro`, `gemini-2.5-flash`, `gemini-2.5-flash-lite-preview-06-17`, `gemini-2.0-flash`, `gemini-2.0-flash-lite`, `gemini-2.0-pro-exp-02-05`	Paid	Code	Doc
OpenAI	`gpt-4.1-2025-04-14`, `gpt-4.1-mini-2025-04-14`, `gpt-4o`, `gpt-4o-mini`	Paid	Code	Doc
Mistral-OCR	`mistral-ocr`	Paid	Code	Doc
OmniAI	`omniai`	Paid	Code	Doc
Google & Meta	`gemma3:4b`, `gemma3:12b`, `gemma3:27b`, `x/llama3.2-vision:11b`	Open Weight	Code	Gemma Doc, Llama3.2 Doc
IBM	`SmolDocling-256M-preview`	Open Weight	Code	Doc

📊 OCR Benchmark

🔗 Dependencies

📚 Python Libraries

# UI
streamlit>=1.43.2 

# SmolDocling related
docling_core>=2.23.1

# LLM related Libraries
ollama>=0.4.7
openai>=1.66.3
anthropic>=0.49.0
google-genai>=1.5.0

# Huggingface library
transformers>=4.49.0

# Utilities
python-dotenv>=1.0.1
pillow>=11.1.0 
requests>=2.32.3
torch>=2.6.0

⚙️ Setup Instructions

Prerequisites
- Python 3.9 or higher
- pip (Python package installer)
Installation
1. Clone the repository:
```
git clone https://github.com/genieincodebottle/parsemypdf.git
cd parsemypdf
```
2. Create a virtual environment:
```
python -m venv venv
venv\Scripts\activate # On Linux -> source venv/bin/activate
```
3. Install dependencies:
```
pip install -r requirements.txt
```
4. Rename .env.example to .env and update required Environment Variables as per requirements
```
ANTHROPIC_API_KEY=your_key_here    # For Claude
OPENAI_API_KEY=your_key_here       # For OpenAI
GOOGLE_API_KEY=your_key_here   # For Google's Gemini models api key
MISTRAL_API_KEY=your_key_here # For Mistral API Key
OMNI_API_KEY=your_key_here # For Omniai API Key
```
  For ANTHROPIC_API_KEY follow this -> https://console.anthropic.com/settings/keys
  
  For OPENAI_API_KEY follow this -> https://platform.openai.com/api-keys
  
  For GOOGLE_API_KEY follow this -> https://ai.google.dev/gemini-api/docs/api-key
  
  For MISTRAL_API_KEY follow this -> https://console.mistral.ai/api-keys
  
  For OMNI_API_KEY follow this -> https://app.getomni.ai/settings/account
5. Install Ollama & Models (for local processing)
  - Install Ollama
    - For Window - Download the Ollama from following location (Requires Window 10 or later) -> https://ollama.com/download/windows
    - For Linux (command line) - curl https://ollama.ai/install.sh | sh
  - Pull required Vision Language Models as per your system capcity (command line)
    - ollama pull gemma3:4b
    - ollama pull gemma3:12b
    - ollama pull gemma3:27b
    - ollama pull x/llama3.2-vision:11b
6. To review each Vision Language Model powered OCR in the Web UI, navigate to parsemypdf/llm_ocr/<provider_folder> (e.g., claude) and run:
```
streamlit run main.py 
```
7. To review all the Vision Language Models powered OCR at single Web UI, navigate to root folder -> parsemypdf and run:
```
streamlit run vlm_ocr_app.py 
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

👉 GenAI Roadmap - 2025

🖼️ OCR with Multimodal | Vision Language Models

📊 OCR Benchmark

🔗 Dependencies

📚 Python Libraries

⚙️ Setup Instructions

Prerequisites

Installation

FilesExpand file tree

vlm_ocr

Directory actions

More options

Directory actions

More options

Latest commit

History

vlm_ocr

Folders and files

parent directory

README.md

👉 GenAI Roadmap - 2025

🖼️ OCR with Multimodal | Vision Language Models

📊 OCR Benchmark

🔗 Dependencies

📚 Python Libraries

⚙️ Setup Instructions

Prerequisites

Installation