GLOCR: GeezLab OCR Dataset

Overview

GLOCR is a Text Recognition (TR) and Optical Character Recognition (OCR) dataset for the Tigrinya language. The dataset contains a total of 661K image-label pairs from multiple data sources. In addition to the characters-only data, the major part of the dataset is a collection of multi-word text images with labels from three categories: News (from Haddas Ertra newspaper), the Bible, and random-trigrams of the 150k most common words in Tigrinya.

Dataset Summary

Total samples: ~661K image-label pairs
Total size: >1.3GB (tar.gz archives)
DOI: 10.7910/DVN/RQTSD2

Examples

Components

News text-lines dataset:
- Subset name: news
- Samples: train (200k), dev (15k), and test (15k)
- Download
Bible text-lines dataset:
- Subset name: bible
- Samples: train (80k), dev (10k), and test (10k)
- Download
Top 150k text-lines dataset:
- Subset name: top150k
- Samples: train (150k), dev (15k), and test (15k)
- Download
Characters dataset:
- Subset name: characters
- Samples: train (120k), dev (15k), and test (15k)
- Download
Unsegmented full-page scanned dataset:
- Subset name: unsegmented
- Samples: 506 scanned pages with the corresponding text
- Download

Download Dataset

The GLOCR dataset is available on 🤗 Datasets hub.

The raw dataset (>1.3GB) is published on Harvard Dataverse and can be downloaded from there.

Usage of HF Dataset

Loading a specific subset

from datasets import load_dataset

# Load a specific subset, one of: news, bible, top150k, characters, unsegmented
news = load_dataset("fgaim/GLOCR-Tigrinya", "news")

# Access samples
sample = news["train"][0]
print(sample["text"])
sample["image"].show()

Loading a specific split

# Load a specific split of a subset
bible_test = load_dataset("fgaim/GLOCR-Tigrinya", "bible", split="test")

# Access samples
print(bible_test["text"][0])
bible_test["image"][0].show()

Loading all text-line data combined

# Load all text-line data combined
all_data = load_dataset("fgaim/GLOCR-Tigrinya", "all")

# Access samples
sample = all_data["train"][0]
print(sample["text"])
sample["image"].show()

Cite

If you use this dataset in your product or research, please cite as follows:

@data{gaim-2021-glocr,
  title     = {{GLOCR: GeezLab OCR Dataset}},
  author    = {Fitsum Gaim},
  year      = {2021},
  month     = {April},
  version   = {1.0},
  publisher = {Harvard Dataverse},
  doi       = {10.7910/DVN/RQTSD2},
  url       = {https://doi.org/10.7910/DVN/RQTSD2},
  dataverse = {https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/RQTSD2}
}

License

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
images		images
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

GLOCR: GeezLab OCR Dataset

Overview

Dataset Summary

Examples

Components

Download Dataset

Usage of HF Dataset

Loading a specific subset

Loading a specific split

Loading all text-line data combined

Cite

License

About

Uh oh!

Releases

Packages

fgaim/GLOCR

Folders and files

Latest commit

History

Repository files navigation

GLOCR: GeezLab OCR Dataset

Overview

Dataset Summary

Examples

Components

Download Dataset

Usage of HF Dataset

Loading a specific subset

Loading a specific split

Loading all text-line data combined

Cite

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages