A framework for creating high-quality instruction datasets for low-resource languages by combining large language model reasoning capabilities with structured linguistic knowledge. This repository contains the implementation used to create over 2 million Bambara conversations for language model training.
This project addresses the critical shortage of instruction datasets for Bambara through a novel methodology that combines linguistic knowledge injection with the reasoning capabilities of large language models. Rather than relying on direct translation, our approach enables deliberative linguistic transformation that respects Bambara's complex morphology and syntax.
- Knowledge-Enhanced Translation: Integrates glossaries, grammar rules, and annotated examples
- Reasoning-Based Processing: Leverages LLM reasoning for complex linguistic transformations
- High-Performance Architecture: Concurrent processing with caching, fault tolerance, and checkpointing
- Scale: Successfully generated 2M+ Bambara conversations from diverse English/French datasets
Our system generated over 2 million Bambara conversational samples from multiple source datasets, achieving a 98.5 % processing success rate. The 93.4 % reduction in validation loss during language model training suggests that the dataset exhibits strong internal consistency and linguistic coherence, enabling effective model generalization. The resulting outputs are grammatically coherent, adhering to Bambara’s SOV word order and morphological rules.
[COMING SOON]
The framework can be adapted to other low-resource languages by providing:
- Lexical resources (glossary mapping source to target language)
- Grammatical rule specifications (structured rules covering morphology and syntax)
- Annotated examples (Universal Dependencies or similar annotations)
@article{diallo2025bambara,
title={Linguistically-Informed Large Language Models for Low-Resource Instruction Dataset Creation},
author={Diallo, Seydou},
journal={[Unpublished manuscript]},
year={2025},
month={July},
url={Unpublished}
}