Skip to content

hplt-project/openlid

 
 

Repository files navigation

OpenLID - fast natural language identification for 200+ languages

OpenLID-v3

Fast natural language identification for 184 languages, plus (almost) all the data to train the model (work in progress).

Note

The previous version -- OpenLID-v2 model and dataset is available on HuggingFace: OpenLID-v2

Features

  • Supports 184 languages
  • High performance
  • Fast and easy to use
  • Fully transparent: training data and per-language performance openly available
  • Used by HPLT

Updates compared to OpenLID-v2

  • frp_Latn, lat_Latn, srp_Latn classes added
  • not-a-language class zxx_Zxxx added
  • dyu_Latn class merged with bam_Latn class (the classes are not distinguishable with this type of model trained on the data we were able to obtain)
  • for the same reason, Arabic dialects merged into the macrolanguage ara_Arab; pes_Arab and prs_Arab merged into the macrolanguage fas_Arab
  • pilar data for oci_Latn class not used (caused false positives)
  • for some languages, training data from glotlid-corpus and Wikipedia added

Get started

OpenLID is a fastText model.

To download:

wget https://zenodo.org/records/17601701/files/openlid-v3.bin

Example to get most likely labels for $DATA:

fasttext predict openlid-v3.bin $DATA > output.fasttext

Dataset

work in progress

Adding GlotLID data

cd add_data/glotlid

For v3, we used glotlid-corpus data for several languages. It is also possible download them using the script download_glotlid.py.

make_list_of_glotlid_sources.py creates the list of GlotLID sources for each language and shows number of samples in GlotLID data. There is no need to run it, since the resulting list is in other.tsv in the root of this repository.

The script add_from_glotlid.py shows how to select only the data sources that are of reliable quality and not proprietary. (Beware of hardcoded paths...) The list of filters there is also for the languages we worked with before; for Scandinavian etc., if there are some other sources, check their quality and license according to GlotLID list. We also collected licenses of the sources we used here at LangID sources sheet.

That script also ensures that wikipedia GlotLID data do not intersect with OpenLID wikipedia data.

Adding Wikipedia data

We also used the most recent (at the fall 2025) Wikipedia data for some languages in v3.

cd add_data/wikipedia

Training

cd retrain_openlid

This folder contains mostly OpenLID author's scripts with minor changes. The current cleaning is language-independent.

OpenLID pipeline

Following OpenLID's instructions (be cautious, they were not fully up-to-date), the pipeline is as follows:

  1. Find additional data and format by the scheme <text>\t<language>\t<source>. If it is an addition to an existing language, it can be appended to it either from a *.parquet or *.tsv using the script append_to_openlid_parquet.py. If the data are for a new language, just convert to a parquet.

  2. Data for all languages must be in the same directory.

  3. Cleaning, deduplication, up/downsampling, writing to FastText format and shuffling are done by make_training_openlid.py. I was able to run that script on my laptop with only 16 GB of memory, except shuffling. If you fail on memory when shuffling, run shuf.sh on LUMI.

When running from scratch, the command is

python3 make_training_openlid.py <output_dir> --data_dir <data_dir>

If the output of stage 2 from make_training_openlid.py, named openlid_stage2_prep.fasttext, is in <data_dir> directory and contains only languages of interest, the command to run preprocessing will be:

python3 make_training_openlid.py <output_dir> --skip_clean --skip_sort
  1. The training on LUMI is run by lid.sh. Don't forget to pass a new path to data/saved model instead of the default one. The hyperparameters are the same as in OpenLID-v2.

Citations

If you use our model, please cite us. If you use the dataset, please cite us plus all the articles in the citations.bib file. Thank you to everyone who put in so much work creating these datasets!

Licenses

The model is licensed under the GNU General Public License v3.0. The individual datasets that make up the training dataset have different licenses but all allow (at minimum) free use for research - a full list is available in this repo.


This project has received funding from the European Union’s Horizon Europe research and innovation programme under grant agreement No 101070350 and from UK Research and Innovation (UKRI) under the UK government’s Horizon Europe funding guarantee [grant number 10052546]. The contents of this publication are the sole responsibility of the HPLT consortium and do not necessarily reflect the opinion of the European Union.

About

OpenLID-v3

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 43.8%
  • Perl 26.2%
  • TeX 22.7%
  • Shell 7.3%