-
Notifications
You must be signed in to change notification settings - Fork 3
Open
Labels
enhancementNew feature or requestNew feature or requestmedium-priorityphase-3OCR EnhancementOCR Enhancement
Description
Task
Evaluate GLM-OCR (zai-org/GLM-OCR) as a candidate for the Phase 3 OCR enhancement.
Background
GLM-OCR is a new state-of-the-art multimodal OCR model that was not available when the original Phase 3 candidates (PaddleOCR, DeepSeek-OCR, Qwen2.5-VL) were identified.
Key Specs:
- 94.62 score on OmniDocBench V1.5 (ranked Welcome to Law7! #1 overall)
- 0.9B parameters - efficient (1.86 pages/second for PDFs)
- Handles complex layouts, tables, formulas, seals
- MIT licensed, multiple deployment options (vLLM, SGLang, Ollama, Transformers)
- Optimized for real-world business scenarios
Evaluation Criteria
Test against sample Russian legal documents from pravo.gov.ru:
- Accuracy: Character-level accuracy for Cyrillic text
- Layout Handling: Multi-column text, tables, numbered articles
- Speed: Pages per minute on target hardware
- Memory: GPU/CPU memory requirements
- Ease of Integration: API quality, Python support
Comparison Matrix
| Model | Accuracy | Speed | Memory | Russian Support | Complexity |
|---|---|---|---|---|---|
| Tesseract (current) | Baseline | Medium | Low | rus+eng | Low |
| PaddleOCR | 82.5% | Fast | Medium | Good | Medium |
| DeepSeek-OCR | +25% vs Tesseract | Medium | High | Excellent | High |
| GLM-OCR | 94.62 (SOTA) | 1.86 pg/s | Low-Medium | TBD | Medium |
| Qwen2.5-VL | 32 languages | Medium | High | Good | High |
Deliverables
- Benchmark GLM-OCR on 10-20 sample PDFs from pravo.gov.ru
- Document accuracy, speed, memory metrics
- Create comparison report with Phase 3 candidates
- Recommendation for Phase 3 implementation
- Update PHASE3_OCR.md with findings
Reference
- Phase 3 Plan: docs/PHASE3_OCR.md
- GLM-OCR: https://huggingface.co/zai-org/GLM-OCR
- Current OCR: scripts/country_modules/russia/parsers/html_parser.py (lines 86-98, 169-240)
- OCR Tests: scripts/test/parser/test_ocr_engine.py (skipped pending Phase 3)
Related Issues
- Closes [Phase 1.4] Create parser tests #6 (OCR tests created but skipped pending Phase 3)
- Related to [FEAT] Comprehensive Court Decision Fetching - All Court Types (2022-2024) #25 (court decisions - OCR disabled in court_sync.py)
- Related to Phase 7C: Priority 1 Enhancements - Regional, Courts, Ministry Data #22 (Phase 7C - court decision fetching)
Priority
MEDIUM - Phase 3 is not blocked, but this is valuable research for when Phase 3 becomes active.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or requestmedium-priorityphase-3OCR EnhancementOCR Enhancement