Bambara Instruction Dataset Creation

A framework for creating high-quality instruction datasets for low-resource languages by combining large language model reasoning capabilities with structured linguistic knowledge. This repository contains the implementation used to create over 2 million Bambara conversations for language model training.

Overview

This project addresses the critical shortage of instruction datasets for Bambara through a novel methodology that combines linguistic knowledge injection with the reasoning capabilities of large language models. Rather than relying on direct translation, our approach enables deliberative linguistic transformation that respects Bambara's complex morphology and syntax.

Key Features

Knowledge-Enhanced Translation: Integrates glossaries, grammar rules, and annotated examples
Reasoning-Based Processing: Leverages LLM reasoning for complex linguistic transformations
High-Performance Architecture: Concurrent processing with caching, fault tolerance, and checkpointing
Scale: Successfully generated 2M+ Bambara conversations from diverse English/French datasets

Results

Our system generated over 2 million Bambara conversational samples from multiple source datasets, achieving a 98.5 % processing success rate. The 93.4 % reduction in validation loss during language model training suggests that the dataset exhibits strong internal consistency and linguistic coherence, enabling effective model generalization. The resulting outputs are grammatically coherent, adhering to Bambara’s SOV word order and morphological rules.

Evaluations

[COMING SOON]

Adaptation for Other Languages

The framework can be adapted to other low-resource languages by providing:

Lexical resources (glossary mapping source to target language)
Grammatical rule specifications (structured rules covering morphology and syntax)
Annotated examples (Universal Dependencies or similar annotations)

Citation

@article{diallo2025bambara,
  title={Linguistically-Informed Large Language Models for Low-Resource Instruction Dataset Creation},
  author={Diallo, Seydou},
  journal={[Unpublished manuscript]},
  year={2025},
  month={July},
  url={Unpublished}
}

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
assets		assets
datasets		datasets
metrics		metrics
notebooks		notebooks
resources		resources
scripts		scripts
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Bambara Instruction Dataset Creation

Overview

Key Features

Results

Evaluations

Adaptation for Other Languages

Citation

About

Uh oh!

Releases

Packages

Languages

sudoping01/instructions-gen

Folders and files

Latest commit

History

Repository files navigation

Bambara Instruction Dataset Creation

Overview

Key Features

Results

Evaluations

Adaptation for Other Languages

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages