Skip to content

argotdev/vision-reasoning-deepseek

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Advanced Visual Reasoning with DeepSeek-VL2 and InternVL3

Open-source vision-language models have matured beyond "GPT-4o alternatives" into genuine specialists. DeepSeek-VL2 dominates OCR and document understanding (834 on OCRBench vs GPT-4o's 736). InternVL3 excels at complex reasoning and spatial understanding (72.2 MMMU, outperforms GPT-4o on 3D scene benchmarks).

This guide shows you when to use each model and provides production-ready code for both.

Quick Decision Framework

Task Best Model Why
PDF/document extraction DeepSeek-VL2 93.3% DocVQA, 20x token compression
Handwritten text, formulas DeepSeek-VL2 Purpose-built OCR pipeline
Video understanding InternVL3 Multi-frame temporal reasoning
3D scene analysis InternVL3 Beats GPT-4o on VSI-Bench
GUI automation InternVL3 11.7 vs GPT-4o's 1.9 on WebArena
Scientific diagrams Either Both strong, test on your data

Project Structure

├── src/
│   ├── deepseek_vl2/
│   │   ├── loader.py          # Model loading with vLLM
│   │   └── document_ocr.py    # PDF/image extraction pipeline
│   ├── internvl3/
│   │   ├── loader.py          # Model loading with LMDeploy
│   │   └── video_analysis.py  # Frame sampling and analysis
│   └── utils/
│       └── preprocessing.py   # Shared image/video utilities
├── examples/
│   ├── extract_invoice.py     # Document OCR example
│   └── detect_anomaly.py      # Video analysis example
├── benchmarks/
│   └── comparison.md          # Detailed benchmark analysis
└── requirements.txt

Hardware Requirements

Model Parameters Activated Min GPU Recommended
DeepSeek-VL2-Tiny 3.37B 1.0B 16GB RTX 4090
DeepSeek-VL2-Small 16.1B 2.8B 80GB A100
InternVL3-8B 8B 8B 24GB RTX 4090
InternVL3-38B 38B 38B 2x 80GB 2x A100

This guide uses DeepSeek-VL2-Small and InternVL3-8B as the practical sweet spots.

Installation

See the Cloud Deployment Guide for running on Lambda Labs or Modal.

pip install -r requirements.txt

Quick Start

Document Extraction (DeepSeek-VL2)

from src.deepseek_vl2 import DocumentExtractor

extractor = DocumentExtractor()
result = extractor.extract("invoice.pdf", output_format="json")
print(result)

Video Analysis (InternVL3)

from src.internvl3 import VideoAnalyzer

analyzer = VideoAnalyzer()
anomalies = analyzer.detect_anomalies("conveyor.mp4", check_interval=30)
for frame, description in anomalies:
    print(f"Frame {frame}: {description}")

Documentation

License

MIT

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages