Arabic OCR presents unique challenges due to the script's cursive nature, diacritical marks (tashkeel), and diverse fonts and layouts. Alef-OCR-Image2Html addresses these challenges by converting Arabic document images into clean, semantic HTML output.
Built on top of Qwen2.5-VL-Instruct, the model was fine-tuned using:
- QLoRA with 4-bit quantization
- LoRA rank of 16 applied to all modules
- Unsloth for memory optimization and training speed
The base model's strong Arabic text understanding capabilities made it an ideal backbone for this task.
The model was trained on a custom dataset of 28K image-HTML pairs, created through two approaches:
- Collected Arabic article URLs from Wikipedia
- Analyzed and post-processed HTML structure to extract semantic content
- Removed unnecessary tags and converted to semantic HTML elements
- Captured screenshots using Playwright with real styling applied
- Built structured HTML documents with various semantic tags
- Rendered images using CSS to mimic real-world document types:
- Historical manuscripts
- Newspaper articles
- Scientific papers
- Invoices
- Recipes
- And more (~13 formats total)
- Filled templates with plain Arabic text from open datasets
- Simulated different layouts, styles, noise levels, fonts, and text flows
Training was performed in two stages to manage memory constraints:
Epoch 1:
- Data: 40% of training dataset
- Learning rate: 5e-5
- LR scheduler: Linear
Epoch 2:
- Data: 30% of training dataset (different split)
- Learning rate: 1e-5
- LR scheduler: Cosine
This approach enabled smooth convergence under limited computational resources using Kaggle's free tier.
Evaluation metrics were computed by the NAMAA community using an anonymous benchmark dataset.
| Model | WER | CER | BLEU |
|---|---|---|---|
| Qari-OCR-v0.3 | 0.84 | 0.73 | 0.17 |
| Alef-OCR-Image2Html | 0.92 | 0.72 | 0.19 |
- Slightly better BLEU and Character Error Rate compared to Qari-OCR-v0.3
- Lower Word Error Rate than Qari-OCR-v0.3
- The WER difference is attributed to limited diacritics handling, as the training dataset contained few examples with diacritical marks
- Dataset: arabic-image2html
- Model: Alef-OCR-Image2Html
This is a first version with promising performance. There is significant room for optimization and improvement, particularly in:
- Enhanced diacritics handling
- Expanded dataset diversity
- Further model refinement
This project builds upon the excellent work of the NAMAA community and their state-of-the-art Qari-OCR model, which serves as the baseline for comparison.
2025 is the year of OCR - pushing the boundaries of Arabic visual understanding.