Skip to content

Latest commit

 

History

History
 
 

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

README.md


       

🖼️ OCR with Multimodal | Vision Language Models

Model Provider Models Open / Paid Example Code Doc
Anthropic claude-opus-4-20250514, claude-sonnet-4-20250514, claude-3-7-sonnet-20250219, claude-3-5-sonnet-20241022 Paid Code Doc
Gemini gemini-2.5-pro, gemini-2.5-flash, gemini-2.5-flash-lite-preview-06-17, gemini-2.0-flash, gemini-2.0-flash-lite, gemini-2.0-pro-exp-02-05 Paid Code Doc
OpenAI gpt-4.1-2025-04-14, gpt-4.1-mini-2025-04-14, gpt-4o, gpt-4o-mini Paid Code Doc
Mistral-OCR mistral-ocr Paid Code Doc
OmniAI omniai Paid Code Doc
Google & Meta gemma3:4b, gemma3:12b, gemma3:27b, x/llama3.2-vision:11b Open Weight Code Gemma Doc, Llama3.2 Doc
IBM SmolDocling-256M-preview Open Weight Code Doc

🔗 Dependencies

📚 Python Libraries

# UI
streamlit>=1.43.2 

# SmolDocling related
docling_core>=2.23.1

# LLM related Libraries
ollama>=0.4.7
openai>=1.66.3
anthropic>=0.49.0
google-genai>=1.5.0

# Huggingface library
transformers>=4.49.0

# Utilities
python-dotenv>=1.0.1
pillow>=11.1.0 
requests>=2.32.3
torch>=2.6.0

⚙️ Setup Instructions

  • Prerequisites

    • Python 3.9 or higher
    • pip (Python package installer)
  • Installation

    1. Clone the repository:

      git clone https://github.com/genieincodebottle/parsemypdf.git
      cd parsemypdf
    2. Create a virtual environment:

      python -m venv venv
      venv\Scripts\activate # On Linux -> source venv/bin/activate
    3. Install dependencies:

      pip install -r requirements.txt
    4. Rename .env.example to .env and update required Environment Variables as per requirements

      ANTHROPIC_API_KEY=your_key_here    # For Claude
      OPENAI_API_KEY=your_key_here       # For OpenAI
      GOOGLE_API_KEY=your_key_here   # For Google's Gemini models api key
      MISTRAL_API_KEY=your_key_here # For Mistral API Key
      OMNI_API_KEY=your_key_here # For Omniai API Key

      For ANTHROPIC_API_KEY follow this -> https://console.anthropic.com/settings/keys

      For OPENAI_API_KEY follow this -> https://platform.openai.com/api-keys

      For GOOGLE_API_KEY follow this -> https://ai.google.dev/gemini-api/docs/api-key

      For MISTRAL_API_KEY follow this -> https://console.mistral.ai/api-keys

      For OMNI_API_KEY follow this -> https://app.getomni.ai/settings/account

    5. Install Ollama & Models (for local processing)

      • Install Ollama

      • Pull required Vision Language Models as per your system capcity (command line)

        • ollama pull gemma3:4b
        • ollama pull gemma3:12b
        • ollama pull gemma3:27b
        • ollama pull x/llama3.2-vision:11b
    6. To review each Vision Language Model powered OCR in the Web UI, navigate to parsemypdf/llm_ocr/<provider_folder> (e.g., claude) and run:

      streamlit run main.py 
    7. To review all the Vision Language Models powered OCR at single Web UI, navigate to root folder -> parsemypdf and run:

      streamlit run vlm_ocr_app.py