Skip to content

Mbashas/gandabert

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GandaBERT: Luganda News Classification using Fine-tuned mBERT

A final-year project that fine-tunes multilingual BERT (mBERT) for Luganda news topic classification across five categories: business, health, politics, religion, and sports.

Overview

This project builds a Luganda-language news classifier by:

  1. Combining multiple data sources — MasakhaNEWS Luganda corpus, BBC News articles translated to Luganda via Google Cloud Translation API, and GPT-4o-generated synthetic articles translated to Luganda.
  2. Fine-tuning mBERT (bert-base-multilingual-cased) on the combined dataset using Hugging Face Transformers.
  3. Evaluating the model on a stratified held-out test set with per-category precision, recall, and F1-score.
  4. Running inference on unlabeled Luganda news articles to classify them.

Project Structure

FYP_2026/
├── gandabert_complete.ipynb        # Complete pipeline notebook (Colab)
├── news_classifier_model/          # Saved model artifacts
│   ├── config.json                 # Model architecture config
│   ├── label_mapping.json          # Label-to-ID mapping
│   ├── tokenizer.json              # Tokenizer vocabulary
│   └── tokenizer_config.json       # Tokenizer settings
├── combined_final.tsv              # Combined training dataset
├── split_train.tsv                 # Training split
├── split_val.tsv                   # Validation split
├── split_test.tsv                  # Test split
├── new_train_split.tsv             # MasakhaNEWS train split
├── new_test_split.tsv              # MasakhaNEWS test split
├── new_train_luganda.tsv           # MasakhaNEWS Luganda training data
├── train_Luganda.tsv               # Original Luganda training set
├── test_Luganda.tsv                # Original Luganda test set
├── generated_news_articles.csv     # GPT-4o generated articles (English)
├── translated_generated.csv        # Translated synthetic articles (Luganda)
├── training_curves.png             # Training/validation loss curves
├── confusion_matrix.png            # Test set confusion matrix
├── dataset_distribution.png        # Category distribution chart
├── article_lengths.png             # Article length distribution
├── eda_split_verification.png      # Train/val/test split verification
└── inference_results.png           # Inference results on unlabeled data

Key Results

Output Description
Table 1 Dataset composition by category & source
Table 2 Per-category precision, recall, F1-score
Figure 1 Training and validation loss curves
Figure 2 Validation accuracy over training steps
Figure 3 Confusion matrix
Figure 4 Category distribution comparison
Figure 5 Article length distribution

Large Files (Not in This Repo)

The following files are excluded from the repository due to GitHub size limits. See instructions below to obtain them.

Model Weights (~692 MB)

The trained model weights (news_classifier_model/model.safetensors) should be downloaded and placed in the news_classifier_model/ directory.

TODO: Upload the model to Hugging Face Hub and add the download link here.

Source Datasets

File Size Description
classified_news.csv 63 MB Classified Luganda news articles
luganda-news-articles.csv 32 MB Raw Luganda news articles corpus
translated.csv 11 MB BBC articles translated to Luganda
bbc.csv 4.9 MB Original BBC News dataset (English)

These datasets were used to build the combined training set. The processed splits (split_train.tsv, split_val.tsv, split_test.tsv) are included in the repo and are sufficient to reproduce training.

Setup and Reproduction

Requirements

The notebook is designed to run on Google Colab with GPU acceleration (T4 recommended).

pip install transformers datasets evaluate scikit-learn seaborn matplotlib wordcloud

Running the Notebook

  1. Open gandabert_complete.ipynb in Google Colab.
  2. Mount Google Drive or upload the required data files.
  3. Run all cells sequentially.

The notebook covers the complete pipeline:

  • Environment setup and imports
  • Data loading and exploratory analysis
  • Translation pipeline (Google Cloud Translation API)
  • Synthetic data generation (GPT-4o) and translation
  • Dataset combination and preparation
  • Stratified train/validation/test split
  • Tokenization and model training
  • Evaluation and visualization
  • Inference on unlabeled articles

Tools and Libraries

  • Hugging Face Transformers (Wolf et al., 2020) — model loading, tokenization, training
  • scikit-learn (Pedregosa et al., 2011) — evaluation metrics, stratified splitting
  • Google Cloud Translation API v3 — English-to-Luganda neural machine translation
  • OpenAI API (GPT-4o) — synthetic news article generation
  • pandas / NumPy — data manipulation
  • matplotlib / seaborn — visualization
  • PyTorch — deep learning backend

License

This project is part of a Final Year Project (2025).

About

Fine-tuned mBERT for Luganda news classification across 5 categories (Politics, Business, Sports, Health, Religion) — advancing NLP for low-resource African languages.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors