Skip to content

This repo focuses on leveraging linguistic structures and reasoning capabilities to generate high-quality instruction datasets for low-resource languages.

Notifications You must be signed in to change notification settings

sudoping01/instructions-gen

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

29 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Bambara Instruction Dataset Creation

License: MIT Dataset Model

A framework for creating high-quality instruction datasets for low-resource languages by combining large language model reasoning capabilities with structured linguistic knowledge. This repository contains the implementation used to create over 2 million Bambara conversations for language model training.

Overview

This project addresses the critical shortage of instruction datasets for Bambara through a novel methodology that combines linguistic knowledge injection with the reasoning capabilities of large language models. Rather than relying on direct translation, our approach enables deliberative linguistic transformation that respects Bambara's complex morphology and syntax.

Key Features

  • Knowledge-Enhanced Translation: Integrates glossaries, grammar rules, and annotated examples
  • Reasoning-Based Processing: Leverages LLM reasoning for complex linguistic transformations
  • High-Performance Architecture: Concurrent processing with caching, fault tolerance, and checkpointing
  • Scale: Successfully generated 2M+ Bambara conversations from diverse English/French datasets

Results

Our system generated over 2 million Bambara conversational samples from multiple source datasets, achieving a 98.5 % processing success rate. The 93.4 % reduction in validation loss during language model training suggests that the dataset exhibits strong internal consistency and linguistic coherence, enabling effective model generalization. The resulting outputs are grammatically coherent, adhering to Bambara’s SOV word order and morphological rules.

Evaluations

[COMING SOON]

MALIBA-AI Chat Screenshot

Adaptation for Other Languages

The framework can be adapted to other low-resource languages by providing:

  1. Lexical resources (glossary mapping source to target language)
  2. Grammatical rule specifications (structured rules covering morphology and syntax)
  3. Annotated examples (Universal Dependencies or similar annotations)

Citation

@article{diallo2025bambara,
  title={Linguistically-Informed Large Language Models for Low-Resource Instruction Dataset Creation},
  author={Diallo, Seydou},
  journal={[Unpublished manuscript]},
  year={2025},
  month={July},
  url={Unpublished}
}

About

This repo focuses on leveraging linguistic structures and reasoning capabilities to generate high-quality instruction datasets for low-resource languages.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published