GandaBERT: Luganda News Classification using Fine-tuned mBERT

A final-year project that fine-tunes multilingual BERT (mBERT) for Luganda news topic classification across five categories: business, health, politics, religion, and sports.

Overview

This project builds a Luganda-language news classifier by:

Combining multiple data sources — MasakhaNEWS Luganda corpus, BBC News articles translated to Luganda via Google Cloud Translation API, and GPT-4o-generated synthetic articles translated to Luganda.
Fine-tuning mBERT (bert-base-multilingual-cased) on the combined dataset using Hugging Face Transformers.
Evaluating the model on a stratified held-out test set with per-category precision, recall, and F1-score.
Running inference on unlabeled Luganda news articles to classify them.

Project Structure

FYP_2026/
├── gandabert_complete.ipynb        # Complete pipeline notebook (Colab)
├── news_classifier_model/          # Saved model artifacts
│   ├── config.json                 # Model architecture config
│   ├── label_mapping.json          # Label-to-ID mapping
│   ├── tokenizer.json              # Tokenizer vocabulary
│   └── tokenizer_config.json       # Tokenizer settings
├── combined_final.tsv              # Combined training dataset
├── split_train.tsv                 # Training split
├── split_val.tsv                   # Validation split
├── split_test.tsv                  # Test split
├── new_train_split.tsv             # MasakhaNEWS train split
├── new_test_split.tsv              # MasakhaNEWS test split
├── new_train_luganda.tsv           # MasakhaNEWS Luganda training data
├── train_Luganda.tsv               # Original Luganda training set
├── test_Luganda.tsv                # Original Luganda test set
├── generated_news_articles.csv     # GPT-4o generated articles (English)
├── translated_generated.csv        # Translated synthetic articles (Luganda)
├── training_curves.png             # Training/validation loss curves
├── confusion_matrix.png            # Test set confusion matrix
├── dataset_distribution.png        # Category distribution chart
├── article_lengths.png             # Article length distribution
├── eda_split_verification.png      # Train/val/test split verification
└── inference_results.png           # Inference results on unlabeled data

Key Results

Output	Description
Table 1	Dataset composition by category & source
Table 2	Per-category precision, recall, F1-score
Figure 1	Training and validation loss curves
Figure 2	Validation accuracy over training steps
Figure 3	Confusion matrix
Figure 4	Category distribution comparison
Figure 5	Article length distribution

Large Files (Not in This Repo)

The following files are excluded from the repository due to GitHub size limits. See instructions below to obtain them.

Model Weights (~692 MB)

The trained model weights (news_classifier_model/model.safetensors) should be downloaded and placed in the news_classifier_model/ directory.

TODO: Upload the model to Hugging Face Hub and add the download link here.

Source Datasets

File	Size	Description
`classified_news.csv`	63 MB	Classified Luganda news articles
`luganda-news-articles.csv`	32 MB	Raw Luganda news articles corpus
`translated.csv`	11 MB	BBC articles translated to Luganda
`bbc.csv`	4.9 MB	Original BBC News dataset (English)

These datasets were used to build the combined training set. The processed splits (split_train.tsv, split_val.tsv, split_test.tsv) are included in the repo and are sufficient to reproduce training.

Setup and Reproduction

Requirements

The notebook is designed to run on Google Colab with GPU acceleration (T4 recommended).

pip install transformers datasets evaluate scikit-learn seaborn matplotlib wordcloud

Running the Notebook

Open gandabert_complete.ipynb in Google Colab.
Mount Google Drive or upload the required data files.
Run all cells sequentially.

The notebook covers the complete pipeline:

Environment setup and imports
Data loading and exploratory analysis
Translation pipeline (Google Cloud Translation API)
Synthetic data generation (GPT-4o) and translation
Dataset combination and preparation
Stratified train/validation/test split
Tokenization and model training
Evaluation and visualization
Inference on unlabeled articles

Tools and Libraries

Hugging Face Transformers (Wolf et al., 2020) — model loading, tokenization, training
scikit-learn (Pedregosa et al., 2011) — evaluation metrics, stratified splitting
Google Cloud Translation API v3 — English-to-Luganda neural machine translation
OpenAI API (GPT-4o) — synthetic news article generation
pandas / NumPy — data manipulation
matplotlib / seaborn — visualization
PyTorch — deep learning backend

License

This project is part of a Final Year Project (2025).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GandaBERT: Luganda News Classification using Fine-tuned mBERT

Overview

Project Structure

Key Results

Large Files (Not in This Repo)

Model Weights (~692 MB)

Source Datasets

Setup and Reproduction

Requirements

Running the Notebook

Tools and Libraries

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
news_classifier_model		news_classifier_model
.gitignore		.gitignore
README.md		README.md
article_lengths.png		article_lengths.png
bbc.csv		bbc.csv
classified_news.csv		classified_news.csv
combined_final.tsv		combined_final.tsv
confusion_matrix.png		confusion_matrix.png
dataset_distribution.png		dataset_distribution.png
eda_split_verification.png		eda_split_verification.png
gandabert_complete.ipynb		gandabert_complete.ipynb
generated_news_articles.csv		generated_news_articles.csv
inference_results.png		inference_results.png
luganda-news-articles.csv		luganda-news-articles.csv
new_test_split.tsv		new_test_split.tsv
new_train_luganda.tsv		new_train_luganda.tsv
new_train_split.tsv		new_train_split.tsv
split_test.tsv		split_test.tsv
split_train.tsv		split_train.tsv
split_val.tsv		split_val.tsv
test_Luganda.tsv		test_Luganda.tsv
train_Luganda.tsv		train_Luganda.tsv
training_curves.png		training_curves.png
translated.csv		translated.csv
translated_generated.csv		translated_generated.csv

Folders and files

Latest commit

History

Repository files navigation

GandaBERT: Luganda News Classification using Fine-tuned mBERT

Overview

Project Structure

Key Results

Large Files (Not in This Repo)

Model Weights (~692 MB)

Source Datasets

Setup and Reproduction

Requirements

Running the Notebook

Tools and Libraries

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages