Skip to content

Alef-OCR-Image2Html, an OCR model designed to transform Arabic documents including historical texts, scanned pages, and handwritten materials into structured and semantic HTML.

Notifications You must be signed in to change notification settings

OussamaBenSlama/Alef-OCR-Image2Html

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Alef-OCR-Image2Html

License Hugging Face

Transforming Arabic document images into structured, semantic HTML

Overview

Arabic OCR presents unique challenges due to the script's cursive nature, diacritical marks (tashkeel), and diverse fonts and layouts. Alef-OCR-Image2Html addresses these challenges by converting Arabic document images into clean, semantic HTML output.

Model Architecture

Built on top of Qwen2.5-VL-Instruct, the model was fine-tuned using:

  • QLoRA with 4-bit quantization
  • LoRA rank of 16 applied to all modules
  • Unsloth for memory optimization and training speed

The base model's strong Arabic text understanding capabilities made it an ideal backbone for this task.

Dataset

The model was trained on a custom dataset of 28K image-HTML pairs, created through two approaches:

1. Web Scraping (46% of dataset, ~13K samples)

  • Collected Arabic article URLs from Wikipedia
  • Analyzed and post-processed HTML structure to extract semantic content
  • Removed unnecessary tags and converted to semantic HTML elements
  • Captured screenshots using Playwright with real styling applied

2. Image Generation from HTML (54% of dataset, ~15K samples)

  • Built structured HTML documents with various semantic tags
  • Rendered images using CSS to mimic real-world document types:
    • Historical manuscripts
    • Newspaper articles
    • Scientific papers
    • Invoices
    • Recipes
    • And more (~13 formats total)
  • Filled templates with plain Arabic text from open datasets
  • Simulated different layouts, styles, noise levels, fonts, and text flows

Training

Training was performed in two stages to manage memory constraints:

Epoch 1:

  • Data: 40% of training dataset
  • Learning rate: 5e-5
  • LR scheduler: Linear

Epoch 2:

  • Data: 30% of training dataset (different split)
  • Learning rate: 1e-5
  • LR scheduler: Cosine

This approach enabled smooth convergence under limited computational resources using Kaggle's free tier.

Evaluation

Evaluation metrics were computed by the NAMAA community using an anonymous benchmark dataset.

Model WER CER BLEU
Qari-OCR-v0.3 0.84 0.73 0.17
Alef-OCR-Image2Html 0.92 0.72 0.19

Results Analysis

  • Slightly better BLEU and Character Error Rate compared to Qari-OCR-v0.3
  • Lower Word Error Rate than Qari-OCR-v0.3
  • The WER difference is attributed to limited diacritics handling, as the training dataset contained few examples with diacritical marks

Resources

Future Work

This is a first version with promising performance. There is significant room for optimization and improvement, particularly in:

  • Enhanced diacritics handling
  • Expanded dataset diversity
  • Further model refinement

Related Work

This project builds upon the excellent work of the NAMAA community and their state-of-the-art Qari-OCR model, which serves as the baseline for comparison.


2025 is the year of OCR - pushing the boundaries of Arabic visual understanding.

About

Alef-OCR-Image2Html, an OCR model designed to transform Arabic documents including historical texts, scanned pages, and handwritten materials into structured and semantic HTML.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published