Skip to content

[Phase 3] Evaluate GLM-OCR for PDF text extraction #26

@mikhashev

Description

@mikhashev

Task

Evaluate GLM-OCR (zai-org/GLM-OCR) as a candidate for the Phase 3 OCR enhancement.

Background

GLM-OCR is a new state-of-the-art multimodal OCR model that was not available when the original Phase 3 candidates (PaddleOCR, DeepSeek-OCR, Qwen2.5-VL) were identified.

Key Specs:

  • 94.62 score on OmniDocBench V1.5 (ranked Welcome to Law7! #1 overall)
  • 0.9B parameters - efficient (1.86 pages/second for PDFs)
  • Handles complex layouts, tables, formulas, seals
  • MIT licensed, multiple deployment options (vLLM, SGLang, Ollama, Transformers)
  • Optimized for real-world business scenarios

Evaluation Criteria

Test against sample Russian legal documents from pravo.gov.ru:

  1. Accuracy: Character-level accuracy for Cyrillic text
  2. Layout Handling: Multi-column text, tables, numbered articles
  3. Speed: Pages per minute on target hardware
  4. Memory: GPU/CPU memory requirements
  5. Ease of Integration: API quality, Python support

Comparison Matrix

Model Accuracy Speed Memory Russian Support Complexity
Tesseract (current) Baseline Medium Low rus+eng Low
PaddleOCR 82.5% Fast Medium Good Medium
DeepSeek-OCR +25% vs Tesseract Medium High Excellent High
GLM-OCR 94.62 (SOTA) 1.86 pg/s Low-Medium TBD Medium
Qwen2.5-VL 32 languages Medium High Good High

Deliverables

  • Benchmark GLM-OCR on 10-20 sample PDFs from pravo.gov.ru
  • Document accuracy, speed, memory metrics
  • Create comparison report with Phase 3 candidates
  • Recommendation for Phase 3 implementation
  • Update PHASE3_OCR.md with findings

Reference

Related Issues

Priority

MEDIUM - Phase 3 is not blocked, but this is valuable research for when Phase 3 becomes active.

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions