Fast natural language identification for 184 languages, plus (almost) all the data to train the model (work in progress).
Note
The previous version -- OpenLID-v2 model and dataset is available on HuggingFace: OpenLID-v2
- Supports 184 languages
- High performance
- Fast and easy to use
- Fully transparent: training data and per-language performance openly available
- Used by HPLT
- frp_Latn, lat_Latn, srp_Latn classes added
- not-a-language class zxx_Zxxx added
- dyu_Latn class merged with bam_Latn class (the classes are not distinguishable with this type of model trained on the data we were able to obtain)
- for the same reason, Arabic dialects merged into the macrolanguage ara_Arab; pes_Arab and prs_Arab merged into the macrolanguage fas_Arab
- pilar data for oci_Latn class not used (caused false positives)
- for some languages, training data from glotlid-corpus and Wikipedia added
OpenLID is a fastText model.
To download:
wget https://zenodo.org/records/17601701/files/openlid-v3.binExample to get most likely labels for $DATA:
fasttext predict openlid-v3.bin $DATA > output.fasttext
work in progress
cd add_data/glotlid
For v3, we used glotlid-corpus data for several languages.
It is also possible download them using the script download_glotlid.py.
make_list_of_glotlid_sources.py creates the list of GlotLID sources for each language and shows number of samples in GlotLID data.
There is no need to run it, since the resulting list is in other.tsv in the root of this repository.
The script add_from_glotlid.py shows how to select only the data sources that are of reliable quality and not proprietary. (Beware of hardcoded paths...)
The list of filters there is also for the languages we worked with before;
for Scandinavian etc., if there are some other sources, check their quality and license according to GlotLID list.
We also collected licenses of the sources we used here at LangID sources sheet.
That script also ensures that wikipedia GlotLID data do not intersect with OpenLID wikipedia data.
We also used the most recent (at the fall 2025) Wikipedia data for some languages in v3.
cd add_data/wikipedia
cd retrain_openlid
This folder contains mostly OpenLID author's scripts with minor changes. The current cleaning is language-independent.
Following OpenLID's instructions (be cautious, they were not fully up-to-date), the pipeline is as follows:
-
Find additional data and format by the scheme
<text>\t<language>\t<source>. If it is an addition to an existing language, it can be appended to it either from a *.parquet or *.tsv using the scriptappend_to_openlid_parquet.py. If the data are for a new language, just convert to a parquet. -
Data for all languages must be in the same directory.
-
Cleaning, deduplication, up/downsampling, writing to FastText format and shuffling are done by
make_training_openlid.py. I was able to run that script on my laptop with only 16 GB of memory, except shuffling. If you fail on memory when shuffling, runshuf.shon LUMI.
When running from scratch, the command is
python3 make_training_openlid.py <output_dir> --data_dir <data_dir>
If the output of stage 2 from make_training_openlid.py, named openlid_stage2_prep.fasttext, is in <data_dir> directory and contains only languages of interest,
the command to run preprocessing will be:
python3 make_training_openlid.py <output_dir> --skip_clean --skip_sort
- The training on LUMI is run by
lid.sh. Don't forget to pass a new path to data/saved model instead of the default one. The hyperparameters are the same as in OpenLID-v2.
If you use our model, please cite us. If you use the dataset, please cite us plus all the articles in the citations.bib file. Thank you to everyone who put in so much work creating these datasets!
The model is licensed under the GNU General Public License v3.0. The individual datasets that make up the training dataset have different licenses but all allow (at minimum) free use for research - a full list is available in this repo.
This project has received funding from the European Union’s Horizon Europe research and innovation programme under grant agreement No 101070350 and from UK Research and Innovation (UKRI) under the UK government’s Horizon Europe funding guarantee [grant number 10052546]. The contents of this publication are the sole responsibility of the HPLT consortium and do not necessarily reflect the opinion of the European Union.
