[Phase 3] Evaluate GLM-OCR for PDF text extraction

## Task

Evaluate GLM-OCR (zai-org/GLM-OCR) as a candidate for the Phase 3 OCR enhancement.

## Background

GLM-OCR is a new state-of-the-art multimodal OCR model that was not available when the original Phase 3 candidates (PaddleOCR, DeepSeek-OCR, Qwen2.5-VL) were identified.

**Key Specs:**
- 94.62 score on OmniDocBench V1.5 (ranked #1 overall)
- 0.9B parameters - efficient (1.86 pages/second for PDFs)
- Handles complex layouts, tables, formulas, seals
- MIT licensed, multiple deployment options (vLLM, SGLang, Ollama, Transformers)
- Optimized for real-world business scenarios

## Evaluation Criteria

Test against sample Russian legal documents from pravo.gov.ru:

1. **Accuracy**: Character-level accuracy for Cyrillic text
2. **Layout Handling**: Multi-column text, tables, numbered articles
3. **Speed**: Pages per minute on target hardware
4. **Memory**: GPU/CPU memory requirements
5. **Ease of Integration**: API quality, Python support

## Comparison Matrix

| Model | Accuracy | Speed | Memory | Russian Support | Complexity |
|-------|----------|-------|--------|-----------------|------------|
| Tesseract (current) | Baseline | Medium | Low | rus+eng | Low |
| PaddleOCR | 82.5% | Fast | Medium | Good | Medium |
| DeepSeek-OCR | +25% vs Tesseract | Medium | High | Excellent | High |
| GLM-OCR | 94.62 (SOTA) | 1.86 pg/s | Low-Medium | TBD | Medium |
| Qwen2.5-VL | 32 languages | Medium | High | Good | High |

## Deliverables

- [ ] Benchmark GLM-OCR on 10-20 sample PDFs from pravo.gov.ru
- [ ] Document accuracy, speed, memory metrics
- [ ] Create comparison report with Phase 3 candidates
- [ ] Recommendation for Phase 3 implementation
- [ ] Update PHASE3_OCR.md with findings

## Reference

- Phase 3 Plan: [docs/PHASE3_OCR.md](../blob/main/docs/PHASE3_OCR.md)
- GLM-OCR: https://huggingface.co/zai-org/GLM-OCR
- Current OCR: [scripts/country_modules/russia/parsers/html_parser.py](../blob/main/scripts/country_modules/russia/parsers/html_parser.py) (lines 86-98, 169-240)
- OCR Tests: [scripts/test/parser/test_ocr_engine.py](../blob/main/scripts/test/parser/test_ocr_engine.py) (skipped pending Phase 3)

## Related Issues

- Closes #6 (OCR tests created but skipped pending Phase 3)
- Related to #25 (court decisions - OCR disabled in court_sync.py)
- Related to #22 (Phase 7C - court decision fetching)

## Priority

**MEDIUM** - Phase 3 is not blocked, but this is valuable research for when Phase 3 becomes active.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Phase 3] Evaluate GLM-OCR for PDF text extraction #26

Task

Background

Evaluation Criteria

Comparison Matrix

Deliverables

Reference

Related Issues

Priority

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Model	Accuracy	Speed	Memory	Russian Support	Complexity
Tesseract (current)	Baseline	Medium	Low	rus+eng	Low
PaddleOCR	82.5%	Fast	Medium	Good	Medium
DeepSeek-OCR	+25% vs Tesseract	Medium	High	Excellent	High
GLM-OCR	94.62 (SOTA)	1.86 pg/s	Low-Medium	TBD	Medium
Qwen2.5-VL	32 languages	Medium	High	Good	High

[Phase 3] Evaluate GLM-OCR for PDF text extraction #26

Description

Task

Background

Evaluation Criteria

Comparison Matrix

Deliverables

Reference

Related Issues

Priority

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions