Alef-OCR-Image2Html

Transforming Arabic document images into structured, semantic HTML

Overview

Arabic OCR presents unique challenges due to the script's cursive nature, diacritical marks (tashkeel), and diverse fonts and layouts. Alef-OCR-Image2Html addresses these challenges by converting Arabic document images into clean, semantic HTML output.

Model Architecture

Built on top of Qwen2.5-VL-Instruct, the model was fine-tuned using:

QLoRA with 4-bit quantization
LoRA rank of 16 applied to all modules
Unsloth for memory optimization and training speed

The base model's strong Arabic text understanding capabilities made it an ideal backbone for this task.

Dataset

The model was trained on a custom dataset of 28K image-HTML pairs, created through two approaches:

1. Web Scraping (46% of dataset, ~13K samples)

Collected Arabic article URLs from Wikipedia
Analyzed and post-processed HTML structure to extract semantic content
Removed unnecessary tags and converted to semantic HTML elements
Captured screenshots using Playwright with real styling applied

2. Image Generation from HTML (54% of dataset, ~15K samples)

Built structured HTML documents with various semantic tags
Rendered images using CSS to mimic real-world document types:
- Historical manuscripts
- Newspaper articles
- Scientific papers
- Invoices
- Recipes
- And more (~13 formats total)
Filled templates with plain Arabic text from open datasets
Simulated different layouts, styles, noise levels, fonts, and text flows

Training

Training was performed in two stages to manage memory constraints:

Epoch 1:

Data: 40% of training dataset
Learning rate: 5e-5
LR scheduler: Linear

Epoch 2:

Data: 30% of training dataset (different split)
Learning rate: 1e-5
LR scheduler: Cosine

This approach enabled smooth convergence under limited computational resources using Kaggle's free tier.

Evaluation

Evaluation metrics were computed by the NAMAA community using an anonymous benchmark dataset.

Model	WER	CER	BLEU
Qari-OCR-v0.3	0.84	0.73	0.17
Alef-OCR-Image2Html	0.92	0.72	0.19

Results Analysis

Slightly better BLEU and Character Error Rate compared to Qari-OCR-v0.3
Lower Word Error Rate than Qari-OCR-v0.3
The WER difference is attributed to limited diacritics handling, as the training dataset contained few examples with diacritical marks

Resources

Dataset: arabic-image2html
Model: Alef-OCR-Image2Html

Future Work

This is a first version with promising performance. There is significant room for optimization and improvement, particularly in:

Enhanced diacritics handling
Expanded dataset diversity
Further model refinement

Related Work

This project builds upon the excellent work of the NAMAA community and their state-of-the-art Qari-OCR model, which serves as the baseline for comparison.

2025 is the year of OCR - pushing the boundaries of Arabic visual understanding.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.gitignore		.gitignore
Readme.md		Readme.md
inference.ipynb		inference.ipynb
train.ipynb		train.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Alef-OCR-Image2Html

Overview

Model Architecture

Dataset

1. Web Scraping (46% of dataset, ~13K samples)

2. Image Generation from HTML (54% of dataset, ~15K samples)

Training

Evaluation

Results Analysis

Resources

Future Work

Related Work

About

Uh oh!

Releases

Packages

Languages

OussamaBenSlama/Alef-OCR-Image2Html

Folders and files

Latest commit

History

Repository files navigation

Alef-OCR-Image2Html

Overview

Model Architecture

Dataset

1. Web Scraping (46% of dataset, ~13K samples)

2. Image Generation from HTML (54% of dataset, ~15K samples)

Training

Evaluation

Results Analysis

Resources

Future Work

Related Work

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages