A modular Python framework for running, switching, and benchmarking state-of-the-art Vision-Language Models (VLMs) for OCR tasks locally.
This project provides a unified interface to interact with various Transformer-based OCR engines (Florence-2, Qwen-VL, DeepSeek-OCR, etc.) without managing disparate API styles or conflicting dependencies. After model download, it operates entirely offline using PyTorch.
- Unified API: Interact with any model using a standard
reader.read(image)call. The system handles preprocessing, prompting, and output parsing. - Multi-Architecture Support: Seamless integration of Microsoft, Alibaba, Tencent, DeepSeek architectures in a single environment.
- Benchmarking Suite: A built-in scientific testing rig that compares load times, inference speeds, and accuracy using dual metrics (Structural Precision vs. Content Similarity).
- Global CLI: Access the toolkit from any terminal window or external application via a simple
ocrcommand. - Pure PyTorch: All engines are implemented using standard
transformerslibraries, avoiding C++ dependency conflicts (DLL hell).
The toolkit includes wrappers for the following models, configured in config.yaml:
| Engine | Model ID | Params | Role |
|---|---|---|---|
| Florence-2 | microsoft/Florence-2-large |
0.7B | Generalist. Fast, highly accurate for standard text and captions. |
| Qwen2-VL | Qwen/Qwen2-VL-2B-Instruct |
2B | Balanced. Excellent instruction following and speed. |
| Qwen3-VL | Qwen/Qwen3-VL-8B-Instruct |
8B | Reasoning. SOTA performance for complex visual reasoning (High VRAM). |
| DeepSeek-OCR | unsloth/DeepSeek-OCR |
3B | Dense Text. Specialized in document density and markdown formatting. |
| HunyuanOCR | lvyufeng/HunyuanOCR |
1B | End-to-End. Strong multilingual support and full-page parsing. |
| PaddleOCR-VL | lvyufeng/PaddleOCR-VL-0.9B |
0.9B | Compact. Efficient VLM specialized for tables and charts. |
| GOT-OCR 2.0 | stepfun-ai/GOT-OCR-2.0-hf |
0.6B | Formatting. Specialized in LaTeX math and sheet music. |
| EasyOCR | easyocr |
N/A | Baseline. Traditional CRAFT+ResNet pipeline (Control variable). |
- Python 3.10 (Recommended)
- NVIDIA GPU with CUDA 12.1+ support
- ~20GB Disk Space (ONLY IF you are downloading ALL models)
-
Clone or Create Directory
mkdir C:\Tools\OCR cd C:\Tools\OCR
-
Create Environment
conda create -n local_ocr python=3.10 -y conda activate local_ocr
-
Install PyTorch (Crucial Step) We must install the CUDA-enabled version explicitly before other dependencies.
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
-
Install Toolkit Dependencies
pip install -r requirements.txt
If you prefer containerization, you can build the image locally.
docker build -t local-ocr .
# Run with local volume mount to persist downloaded models
docker run --gpus all -it --rm -v ${PWD}:/app local-ocrYou can configure this tool to run from any terminal window (PowerShell, CMD, Git Bash) without manually activating Conda environments.
-
Edit
ocr.bat: Openocr.batand ensurePYTHON_EXEpoints to yourlocal_ocrpython path. (Runpython -c "import sys; print(sys.executable)"inside your env to find it). -
Install: Move
ocr.batto a folder in your System PATH (e.g.,C:\Windows\System32or a customC:\Bin). -
Usage: Now, from anywhere on your PC:
# Basic Read (Uses default engine) ocr "C:\Users\Desktop\invoice.png" # Save to file ocr image.jpg > output.txt # Specific Engine & Task ocr math.jpg --engine qwen2 --task formula
To use this OCR toolkit inside another Python project (or Node/C# app) without installing heavy dependencies in that project, use the Subprocess wrapper pattern.
Example ocr_client.py (Safe to drop in any project):
import subprocess
import json
OCR_BAT_PATH = "ocr" # Assumes ocr.bat is in PATH
def read_image(image_path, engine=None):
cmd = [OCR_BAT_PATH, image_path, "--raw"]
if engine:
cmd.extend(["--engine", engine])
# Run silently
result = subprocess.run(cmd, capture_output=True, text=True)
if result.returncode == 0:
return json.loads(result.stdout) # Returns dict with 'text'
else:
raise Exception(f"OCR Failed: {result.stderr}")
# Usage
print(read_image("test.jpg")['text'])The toolkit includes benchmark.py to scientifically compare models against a Ground Truth.
Metrics:
- Structure % (Levenshtein): Measures exact layout/character precision.
- Content % (Cosine): Measures "Bag of Words" semantic accuracy (ignoring layout).
Run Benchmark:
# Compare all enabled engines on a specific image
python benchmark.py inputs/test.jpg inputs/truth.txtExample Results: Hardware: NVIDIA RTX 3070 | Input: Magic Card (Text + Layout)
| Model | Load (s) | Infer (s) | Struct % | Content % |
|---|---|---|---|---|
| EasyOCR (Baseline) | 1.10 | 0.37 | 32.99 | 75.23 |
| Florence-2 | 3.31 | 1.14 | 98.22 | 95.71 |
| GOT-OCR 2.0 | 3.66 | 1.96 | 86.77 | 78.63 |
| HunyuanOCR 1B | 4.27 | 2.47 | 96.90 | 97.14 |
| DeepSeek-OCR 3B | 5.38 | 4.76 | 96.03 | 95.71 |
| Qwen2-VL 2B | 4.75 | 2.05 | 88.65 | 86.97 |
| Qwen3-VL 8B | 10.81 | 48.65 | 98.04 | 98.56 |
Note: in our testing Qwen3 achieves the highest semantic accuracy but requires significantly more compute resources. Florence-2 offers the best balance of speed vs. structural accuracy.
To add a new model:
- Create Engine: Add a new file in
engines/inheriting fromBaseOCREngine. Implementload()andprocess(). - Update Config: Add the model parameters to
config.yaml. - Register: Add the import switch to
core/factory.py.
C:/Tools/OCR/
├── config.yaml # User configuration (Enable/Disable engines here)
├── ocr.bat # System Launcher
├── cli.py # CLI Entry Point
├── benchmark.py # Benchmarking Tool
├── ocr.py # Main Python Library
├── core/ # System logic
├── engines/ # Model wrappers
└── models_cache/ # Hugging Face model weights