This project evaluates and compares the fluency of various Large Language Models (LLMs) in under-represented languages, with a primary focus on Kinyarwanda. It helps assess which models perform best for specific languages, enabling better data governance and more inclusive AI applications.
The global AI landscape often overlooks languages with fewer digital resources. This project addresses this gap by:
- Evaluating multiple LLMs' ability to understand and generate fluent responses in under-represented languages
- Providing comparative analysis across different service providers and model sizes
- Creating a framework for systematic evaluation of language fluency
- Supporting data governance by identifying the most appropriate models for specific language tasks
The project currently supports evaluation across these LLM providers:
- OpenAI: GPT-4o, GPT-3.5-Turbo, GPT-4-Turbo
- Anthropic: Claude-3-Sonnet, Claude-3-Haiku
- Groq: Llama3-70B (via Groq API)
- Google: Gemini Flash 2.0
- Google Translate: For baseline comparison
- Digital Umuganda MT: Locally developed machine translation service for Kinyarwanda
- Other local MT services: Can be integrated through the flexible input system
The project uses a dataset of Kinyarwanda questions across various service categories:
- Government services (Irembo)
- Land registration
- Passport services
- Permits and licenses
- National ID and documentation
- Tax services
- Marriage registration
- And more
Each question is evaluated for fluency on a scale of 1-10, with model-generated responses collected and compared.
The system is designed to work with any language or domain by providing custom data in the data/input directory. The input files should follow this format:
question,language,topic
"What is your question here?",language_code,topic_categoryFor example:
"Kubera iki mumaze igihe kinini mutaduha uburenganzira bwo gufungura Irembo ryacu?",kinyarwanda,Irembo Services
"How do I apply for a passport?",english,Passport Services
"Comment puis-je payer mes impôts?",french,TaxesThis flexible structure allows you to evaluate LLM fluency in any language or domain of interest.
-
Clone this repository to your local machine.
-
Create a virtual environment and activate it:
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install the required packages:
pip install -r requirements.txt
-
Create a
.envfile from the provided.env.example:cp .env.example .env
-
Add your API keys for the LLM providers you want to use:
- OpenAI: https://platform.openai.com/api-keys
- Anthropic: https://console.anthropic.com/
- Groq: https://console.groq.com/
- Google (Gemini): https://aistudio.google.com/
- RapidAPI: For certain translation services
To evaluate the fluency of questions in Kinyarwanda:
python evaluate_kinyarwanda_questions.pyThis script:
- Processes a set of Kinyarwanda questions
- Sends them to multiple LLMs for fluency evaluation
- Collects responses and fluency scores
- Generates a CSV file with comparative results
To create a new dataset of questions:
python create_questions_csv.pyTo run a direct comparison of responses across different LLMs:
python compare_llms.pyTo consolidate question-answer pairs from multiple sources:
python consolidate_qna.pyThis project supports better data governance for under-represented languages by:
- Identifying which models provide the most fluent and accurate responses
- Highlighting gaps in language understanding across different LLMs
- Providing a framework for evaluating AI systems before deployment in multilingual contexts
- Supporting more inclusive AI development that respects linguistic diversity
- Benchmarking locally developed MT services against state-of-the-art commercial offerings
- Enabling data-driven decisions about which language models to deploy for specific communities
The evaluation produces CSV files containing:
- Original questions in the target language
- Topic categorization
- Responses from each LLM
- Fluency scores (1-10 scale)
- Comparative analysis across models
Initial findings show that more advanced models (GPT-4o, Claude-3-Sonnet) typically outperform smaller models, but performance varies significantly by language and topic.
A key feature of this project is the ability to compare locally developed machine translation services (like Digital Umuganda for Kinyarwanda) against global commercial offerings. This comparison helps:
- Identify strengths and weaknesses of local MT services
- Provide quantitative evidence of where local services excel or need improvement
- Support the development of specialized language models for under-represented languages
- Create a feedback loop for improving local language technologies
The analysis can reveal cases where locally-developed, specialized models outperform larger general-purpose models on specific languages or domains, highlighting the value of targeted language technology development.
Contributions to expand the project to other under-represented languages are welcome. Please feel free to submit pull requests or open issues for discussion.
This project is available under the MIT License.