Open-source vision-language models have matured beyond "GPT-4o alternatives" into genuine specialists. DeepSeek-VL2 dominates OCR and document understanding (834 on OCRBench vs GPT-4o's 736). InternVL3 excels at complex reasoning and spatial understanding (72.2 MMMU, outperforms GPT-4o on 3D scene benchmarks).
This guide shows you when to use each model and provides production-ready code for both.
| Task | Best Model | Why |
|---|---|---|
| PDF/document extraction | DeepSeek-VL2 | 93.3% DocVQA, 20x token compression |
| Handwritten text, formulas | DeepSeek-VL2 | Purpose-built OCR pipeline |
| Video understanding | InternVL3 | Multi-frame temporal reasoning |
| 3D scene analysis | InternVL3 | Beats GPT-4o on VSI-Bench |
| GUI automation | InternVL3 | 11.7 vs GPT-4o's 1.9 on WebArena |
| Scientific diagrams | Either | Both strong, test on your data |
├── src/
│ ├── deepseek_vl2/
│ │ ├── loader.py # Model loading with vLLM
│ │ └── document_ocr.py # PDF/image extraction pipeline
│ ├── internvl3/
│ │ ├── loader.py # Model loading with LMDeploy
│ │ └── video_analysis.py # Frame sampling and analysis
│ └── utils/
│ └── preprocessing.py # Shared image/video utilities
├── examples/
│ ├── extract_invoice.py # Document OCR example
│ └── detect_anomaly.py # Video analysis example
├── benchmarks/
│ └── comparison.md # Detailed benchmark analysis
└── requirements.txt
| Model | Parameters | Activated | Min GPU | Recommended |
|---|---|---|---|---|
| DeepSeek-VL2-Tiny | 3.37B | 1.0B | 16GB | RTX 4090 |
| DeepSeek-VL2-Small | 16.1B | 2.8B | 80GB | A100 |
| InternVL3-8B | 8B | 8B | 24GB | RTX 4090 |
| InternVL3-38B | 38B | 38B | 2x 80GB | 2x A100 |
This guide uses DeepSeek-VL2-Small and InternVL3-8B as the practical sweet spots.
See the Cloud Deployment Guide for running on Lambda Labs or Modal.
pip install -r requirements.txtfrom src.deepseek_vl2 import DocumentExtractor
extractor = DocumentExtractor()
result = extractor.extract("invoice.pdf", output_format="json")
print(result)from src.internvl3 import VideoAnalyzer
analyzer = VideoAnalyzer()
anomalies = analyzer.detect_anomalies("conveyor.mp4", check_interval=30)
for frame, description in anomalies:
print(f"Frame {frame}: {description}")MIT