Advanced Visual Reasoning with DeepSeek-VL2 and InternVL3

Open-source vision-language models have matured beyond "GPT-4o alternatives" into genuine specialists. DeepSeek-VL2 dominates OCR and document understanding (834 on OCRBench vs GPT-4o's 736). InternVL3 excels at complex reasoning and spatial understanding (72.2 MMMU, outperforms GPT-4o on 3D scene benchmarks).

This guide shows you when to use each model and provides production-ready code for both.

Quick Decision Framework

Task	Best Model	Why
PDF/document extraction	DeepSeek-VL2	93.3% DocVQA, 20x token compression
Handwritten text, formulas	DeepSeek-VL2	Purpose-built OCR pipeline
Video understanding	InternVL3	Multi-frame temporal reasoning
3D scene analysis	InternVL3	Beats GPT-4o on VSI-Bench
GUI automation	InternVL3	11.7 vs GPT-4o's 1.9 on WebArena
Scientific diagrams	Either	Both strong, test on your data

Project Structure

├── src/
│   ├── deepseek_vl2/
│   │   ├── loader.py          # Model loading with vLLM
│   │   └── document_ocr.py    # PDF/image extraction pipeline
│   ├── internvl3/
│   │   ├── loader.py          # Model loading with LMDeploy
│   │   └── video_analysis.py  # Frame sampling and analysis
│   └── utils/
│       └── preprocessing.py   # Shared image/video utilities
├── examples/
│   ├── extract_invoice.py     # Document OCR example
│   └── detect_anomaly.py      # Video analysis example
├── benchmarks/
│   └── comparison.md          # Detailed benchmark analysis
└── requirements.txt

Hardware Requirements

Model	Parameters	Activated	Min GPU	Recommended
DeepSeek-VL2-Tiny	3.37B	1.0B	16GB	RTX 4090
DeepSeek-VL2-Small	16.1B	2.8B	80GB	A100
InternVL3-8B	8B	8B	24GB	RTX 4090
InternVL3-38B	38B	38B	2x 80GB	2x A100

This guide uses DeepSeek-VL2-Small and InternVL3-8B as the practical sweet spots.

Installation

See the Cloud Deployment Guide for running on Lambda Labs or Modal.

pip install -r requirements.txt

Quick Start

Document Extraction (DeepSeek-VL2)

from src.deepseek_vl2 import DocumentExtractor

extractor = DocumentExtractor()
result = extractor.extract("invoice.pdf", output_format="json")
print(result)

Video Analysis (InternVL3)

from src.internvl3 import VideoAnalyzer

analyzer = VideoAnalyzer()
anomalies = analyzer.detect_anomalies("conveyor.mp4", check_interval=30)
for frame, description in anomalies:
    print(f"Frame {frame}: {description}")

Documentation

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
benchmarks		benchmarks
docs		docs
examples		examples
src		src
.gitignore		.gitignore
README.md		README.md
modal_app.py		modal_app.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Advanced Visual Reasoning with DeepSeek-VL2 and InternVL3

Quick Decision Framework

Project Structure

Hardware Requirements

Installation

Quick Start

Document Extraction (DeepSeek-VL2)

Video Analysis (InternVL3)

Documentation

License

About

Uh oh!

Releases

Packages

Languages

argotdev/vision-reasoning-deepseek

Folders and files

Latest commit

History

Repository files navigation

Advanced Visual Reasoning with DeepSeek-VL2 and InternVL3

Quick Decision Framework

Project Structure

Hardware Requirements

Installation

Quick Start

Document Extraction (DeepSeek-VL2)

Video Analysis (InternVL3)

Documentation

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages