The indigenous languages in Guyana are under threat. There is a pressing challenge in regards to finding individuals with the ability to read and write the various languages spoken in these communities. The primary reason these languages are in decline is because they are spoken but hardly written. In response to this problem, we curated an approach centered on AI to support the linguistic heritage of communities whose languages are under threat. This initiative focuses on developing a language translation portal to revitalize the Luganda language.
- Develop a language translation portal to help preserve & sustain linguistic heritage
- Provide detailed documentation of our methodologies, tools, and decisions when establishing a logical translational framework
The first avenue we explored was fine-tuning an existing model, given that training a model from scratch within the time constraints was not feasible. During the exploration phase, we inspected Meta’s No Language Left Behind (NLLB) model, which offers a variety of open-source functionality that supports the translation of 200 languages. This tool was a great foundational framework, as it offered support for the Luganda language under the name “Ganda.” Our next avenue was mBART-50 from HuggingFace, which is a pretrained multilingual Sequence-to-Sequence model. As opposed to fine-tuning in one direction, this pre-trained model is fine-tuned in various directions concurrently. This model is an extension of the original mBART model, as it offers additional support for 50 languages. In order to increase our understanding of the subject matter, we dedicated time to simply fine-tuning a model for a custom high-resource language dataset. Specifically, we fine-tuned mBart-50 for a French-English translation model using the pre-existing data set from Hugging Face. This exploration allowed us to gain a better understanding of the requirements for translation, including using the tokenizer.
Given our time and resource constraints, we decided to pursue fine-tuning the NLLB model since it already had support for Luganda. NLLB is an open-source model and has several checkpoints available for public use, which vary by size. Since we were completing the model training on our laptops, we opted for the smallest model, NLLB-200-Distilled with 600 million parameters. However, it is important to note that model performance would likely improve with the larger NLLB models.
Rather than writing a customized training loop, we used Hugging Face’s Trainer class, which provides a simple API for training in PyTorch. Then, we loaded the tokenizer from NLLB’s smallest model checkpoint and tokenized both the English and Luganda sentences. After splitting the dataset into a training and test set with an 80/20 split, we attempted to run the model on the data but were unable to do so in Google Colab - the runtime would automatically disconnect after a period of time. To resolve this issue, we have frozen all layers in the encoder of the NLLB model to further reduce the model size. This sped up training time significantly. Later on, we were able to move this model to our local devices to at least train more epochs, but due to the lack of access to GPU’s, the training time increased significantly. For reference, training 12 epochs took around 11 hours.
For the model’s hyperparameters, we mainly focused on extending the number of epochs. While we initially did explore learning rate and weight decay, again due to our time constraints, we could not perform a grid search, for example, testing all combinations of learning rate and weight decay to see which performed the best. Generally, from our initial tests however, the model performed best with a learning rate around 0.0002 and weight decay of 0.02.
We decided to use the Bilingual Evaluation Understudy (BLEU) score to evaluate our model’s performance, which compares the model’s output translation similarity to a reference translation. BLEU scores can range from a number 0 to 1 and our model, at its highest, has reached a BLEU score of around 0.261 after training for 12 epochs. However, the evaluation loss began to increase after 5 epochs, while the BLEU scores continued to improve, so it is possible that the model is overfitting. Ideally, a model with good translations should be achieving at least a 0.3 BLEU score.
We have several suggestions and thoughts for people who may work on this project moving forward. First, our computational resources were a big limiting factor in our ability to train models. Thus, with access to more resources, one may try to use NLLB’s larger models, which have billions of parameters as opposed to 600 million, which could lead to higher success rates. Additionally, one would be able to train the model for more epochs, until the training and validation loss stop decreasing, which may improve model performance. Moreover, as briefly mentioned previously, since we did not have the time to run a grid search for the best combination of hyperparameters (learning rate and weight decay), we would advise for that to be done in the future.