Indigenous-Language-Translator

Project Overview

The indigenous languages in Guyana are under threat. There is a pressing challenge in regards to finding individuals with the ability to read and write the various languages spoken in these communities. The primary reason these languages are in decline is because they are spoken but hardly written. In response to this problem, we curated an approach centered on AI to support the linguistic heritage of communities whose languages are under threat. This initiative focuses on developing a language translation portal to revitalize the Luganda language.

Objectives & Goals

Develop a language translation portal to help preserve & sustain linguistic heritage
Provide detailed documentation of our methodologies, tools, and decisions when establishing a logical translational framework

Methodology

The first avenue we explored was fine-tuning an existing model, given that training a model from scratch within the time constraints was not feasible. During the exploration phase, we inspected Meta’s No Language Left Behind (NLLB) model, which offers a variety of open-source functionality that supports the translation of 200 languages. This tool was a great foundational framework, as it offered support for the Luganda language under the name “Ganda.” Our next avenue was mBART-50 from HuggingFace, which is a pretrained multilingual Sequence-to-Sequence model. As opposed to fine-tuning in one direction, this pre-trained model is fine-tuned in various directions concurrently. This model is an extension of the original mBART model, as it offers additional support for 50 languages. In order to increase our understanding of the subject matter, we dedicated time to simply fine-tuning a model for a custom high-resource language dataset. Specifically, we fine-tuned mBart-50 for a French-English translation model using the pre-existing data set from Hugging Face. This exploration allowed us to gain a better understanding of the requirements for translation, including using the tokenizer.

Given our time and resource constraints, we decided to pursue fine-tuning the NLLB model since it already had support for Luganda. NLLB is an open-source model and has several checkpoints available for public use, which vary by size. Since we were completing the model training on our laptops, we opted for the smallest model, NLLB-200-Distilled with 600 million parameters. However, it is important to note that model performance would likely improve with the larger NLLB models.

Rather than writing a customized training loop, we used Hugging Face’s Trainer class, which provides a simple API for training in PyTorch. Then, we loaded the tokenizer from NLLB’s smallest model checkpoint and tokenized both the English and Luganda sentences. After splitting the dataset into a training and test set with an 80/20 split, we attempted to run the model on the data but were unable to do so in Google Colab - the runtime would automatically disconnect after a period of time. To resolve this issue, we have frozen all layers in the encoder of the NLLB model to further reduce the model size. This sped up training time significantly. Later on, we were able to move this model to our local devices to at least train more epochs, but due to the lack of access to GPU’s, the training time increased significantly. For reference, training 12 epochs took around 11 hours.

For the model’s hyperparameters, we mainly focused on extending the number of epochs. While we initially did explore learning rate and weight decay, again due to our time constraints, we could not perform a grid search, for example, testing all combinations of learning rate and weight decay to see which performed the best. Generally, from our initial tests however, the model performed best with a learning rate around 0.0002 and weight decay of 0.02.

Results & Key Findings

We decided to use the Bilingual Evaluation Understudy (BLEU) score to evaluate our model’s performance, which compares the model’s output translation similarity to a reference translation. BLEU scores can range from a number 0 to 1 and our model, at its highest, has reached a BLEU score of around 0.261 after training for 12 epochs. However, the evaluation loss began to increase after 5 epochs, while the BLEU scores continued to improve, so it is possible that the model is overfitting. Ideally, a model with good translations should be achieving at least a 0.3 BLEU score.

Next Steps

We have several suggestions and thoughts for people who may work on this project moving forward. First, our computational resources were a big limiting factor in our ability to train models. Thus, with access to more resources, one may try to use NLLB’s larger models, which have billions of parameters as opposed to 600 million, which could lead to higher success rates. Additionally, one would be able to train the model for more epochs, until the training and validation loss stop decreasing, which may improve model performance. Moreover, as briefly mentioned previously, since we did not have the time to run a grid search for the best combination of hyperparameters (learning rate and weight decay), we would advise for that to be done in the future.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.gitignore		.gitignore
Luganda_tokenize.ipynb		Luganda_tokenize.ipynb
README.md		README.md
en_lug.ipynb		en_lug.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Indigenous-Language-Translator

Project Overview

Objectives & Goals

Methodology

Results & Key Findings

Next Steps

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Indigenous-Language-Translator

Project Overview

Objectives & Goals

Methodology

Results & Key Findings

Next Steps

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages