A final-year project that fine-tunes multilingual BERT (mBERT) for Luganda news topic classification across five categories: business, health, politics, religion, and sports.
This project builds a Luganda-language news classifier by:
- Combining multiple data sources — MasakhaNEWS Luganda corpus, BBC News articles translated to Luganda via Google Cloud Translation API, and GPT-4o-generated synthetic articles translated to Luganda.
- Fine-tuning mBERT (
bert-base-multilingual-cased) on the combined dataset using Hugging Face Transformers. - Evaluating the model on a stratified held-out test set with per-category precision, recall, and F1-score.
- Running inference on unlabeled Luganda news articles to classify them.
FYP_2026/
├── gandabert_complete.ipynb # Complete pipeline notebook (Colab)
├── news_classifier_model/ # Saved model artifacts
│ ├── config.json # Model architecture config
│ ├── label_mapping.json # Label-to-ID mapping
│ ├── tokenizer.json # Tokenizer vocabulary
│ └── tokenizer_config.json # Tokenizer settings
├── combined_final.tsv # Combined training dataset
├── split_train.tsv # Training split
├── split_val.tsv # Validation split
├── split_test.tsv # Test split
├── new_train_split.tsv # MasakhaNEWS train split
├── new_test_split.tsv # MasakhaNEWS test split
├── new_train_luganda.tsv # MasakhaNEWS Luganda training data
├── train_Luganda.tsv # Original Luganda training set
├── test_Luganda.tsv # Original Luganda test set
├── generated_news_articles.csv # GPT-4o generated articles (English)
├── translated_generated.csv # Translated synthetic articles (Luganda)
├── training_curves.png # Training/validation loss curves
├── confusion_matrix.png # Test set confusion matrix
├── dataset_distribution.png # Category distribution chart
├── article_lengths.png # Article length distribution
├── eda_split_verification.png # Train/val/test split verification
└── inference_results.png # Inference results on unlabeled data
| Output | Description |
|---|---|
| Table 1 | Dataset composition by category & source |
| Table 2 | Per-category precision, recall, F1-score |
| Figure 1 | Training and validation loss curves |
| Figure 2 | Validation accuracy over training steps |
| Figure 3 | Confusion matrix |
| Figure 4 | Category distribution comparison |
| Figure 5 | Article length distribution |
The following files are excluded from the repository due to GitHub size limits. See instructions below to obtain them.
The trained model weights (news_classifier_model/model.safetensors) should be downloaded and placed in the news_classifier_model/ directory.
TODO: Upload the model to Hugging Face Hub and add the download link here.
| File | Size | Description |
|---|---|---|
classified_news.csv |
63 MB | Classified Luganda news articles |
luganda-news-articles.csv |
32 MB | Raw Luganda news articles corpus |
translated.csv |
11 MB | BBC articles translated to Luganda |
bbc.csv |
4.9 MB | Original BBC News dataset (English) |
These datasets were used to build the combined training set. The processed splits (
split_train.tsv,split_val.tsv,split_test.tsv) are included in the repo and are sufficient to reproduce training.
The notebook is designed to run on Google Colab with GPU acceleration (T4 recommended).
pip install transformers datasets evaluate scikit-learn seaborn matplotlib wordcloud- Open
gandabert_complete.ipynbin Google Colab. - Mount Google Drive or upload the required data files.
- Run all cells sequentially.
The notebook covers the complete pipeline:
- Environment setup and imports
- Data loading and exploratory analysis
- Translation pipeline (Google Cloud Translation API)
- Synthetic data generation (GPT-4o) and translation
- Dataset combination and preparation
- Stratified train/validation/test split
- Tokenization and model training
- Evaluation and visualization
- Inference on unlabeled articles
- Hugging Face Transformers (Wolf et al., 2020) — model loading, tokenization, training
- scikit-learn (Pedregosa et al., 2011) — evaluation metrics, stratified splitting
- Google Cloud Translation API v3 — English-to-Luganda neural machine translation
- OpenAI API (GPT-4o) — synthetic news article generation
- pandas / NumPy — data manipulation
- matplotlib / seaborn — visualization
- PyTorch — deep learning backend
This project is part of a Final Year Project (2025).