Skip to content

mabogunje/yorudi

Repository files navigation

Yorùdí

A Standardized, Accessible, & Downloadable Comprehensive Yoruba Multilingual Dictionary

The Yorùdí project aims to compile a complete multi-lingual lexical database with Yoruba as the pivot language. The project is modelled after the CC-CEDICT project by Paul Andrew Denisowski which was itself modeled on the highly successful EDICT project by Jim Breen. The former being a Chinese-English Electronic Dictionary and the latter, a Japanese-English Dictionary.

This dictionary, in addition to being standardized and downloadable by humans, also aims to be accessible to machines. As such, it consists of 3 parts:

  1. A Command Line interface for programmers & programs to use directly.
  2. A REST API for accessibility over the internet hosted on Azure
  3. A Simple Front-End (modeled after Tangorin) hosted here on GitHub Pages

The combination of these 3 access points provides a tool that gives access to the Yoruba language in a way that appeals to both programmers (like me), and the ordinary human user... at least I hope it does.

Usage

1. Using the Command Line

The Command Line application can be accessed by running the Yorudi Class as main i.e.

sbt runMain net.mabogunje.yorudi.Yorudi [required arguments] [optional arguments] [yoruba word].

1.1 Required Arguments
  • --dict (dictionary) This specifies which dictionary you want to query. Acceptable values are:
    • cms This refers to the Church Missionary Society Yoruba Dictionary (currently incomplete)
    • gpt This refers to the ChatGPT dictionary (a dictionary of 100 of the most popular Yoruba words in each alphabet according to ChatGPT)
    • names This refers to the Yoruba Personal Names dictionary based on the book by *Adeboye Babalola & Olugboyega Alaba (currently incomplete)
    • sample This is a sample dictionary intended for testing
1.2 Optional Arguments
  • -s (strict) Return only results with exact tone matches
  • -g (glossary) Return a glossary of all words related to your query
  • -d (derivative) Return all derivative words of your query
  • --fmt (format) Return results in spacified format. Options are:
    • plain (standard command-line output)
    • xml (Extensible Markup Language)
    • json (Javascript Object Notation)
1.3 Examples
  1. Find all words matching "aba" in the cms dictionary (tone-insensitive)

    sbt 'runMain net.mabogunje.yorudi.Yorudi --dict cms aba'

  2. Find all words matching "àbà" in the cms dictionary (tone-sensitive)

    sbt 'runMain net.mabogunje.yorudi.Yorudi --dict cms -s àbà'

  3. Display a glossary of all words related to "àbà" in the cms dictionary

    sbt 'runMain net.mabogunje.yorudi.Yorudi --dict cms -g àbà'

  4. Find all words derived from "àbà" in the cms dictionary

    sbt 'runMain net.mabogunje.yorudi.Yorudi --dict cms -d àbà'

2. Using the Rest Service

Self Hosted

Run:

`sbt 'runMain YorubaRestService'`

The service will spring up at http://localhost:3330 with a basic webpage you can use to query the dictionaries. You can also check the RESTful responses directly by visiting: http://localhost:3330/word/YOURWORD. Parameters are the same as the Azure service detailed below.

Hosted on Azure

The service is already hosted for free on Azure as mentioned earlier. Below are the details:

Endpoints

As you can see above, the service currently has only one endpoint which accepts 1 argument and up to 2 parameters. Supported endpoints are below:

  1. https://yorudi.azurewebsites.net/word/
Parameters

When you query the word endpoint, you can supply up to 2 parameters:

  1. dictionary: Acceptable values for the dictionary parameter are as follows:

    • cms This refers to the Church Missionary Society Yoruba Dictionary (currently incomplete)
    • gpt This refers to the ChatGPT dictionary (a dictionary of 100 of the most popular Yoruba words in each alphabet according to ChatGPT)
    • names This refers to the Yoruba Personal Names dictionary based on the book by *Adeboye Babalola & Olugboyega Alaba (currently incomplete)
    • sample This is a sample dictionary intended for testing
  2. mode: Acceptable values for the mode parameter are as follows:

    • match Returns any matching word in the dictionary (tone-insensitive)
    • strict Returns any matching word in the dictionary (tone-sensitive)
    • related Returns any word that contains the queried word in its decomposition
    • derivative Returns any word that has the queried word as its root

Tip

The default dictionary is cms, and the default mode is match. So hitting the endpoint without using any parameters will return matching results from the cms dictionary.

3. Using the Simple Front-End

If you are not looking to contribute to or self-host this project, and you just want to use the dictionary, you can access the Front-End at https://mabogunje.github.io/yorudi. It's also a good place to see if any changes committed to the dictionaries work as expected.

An Introduction to Yoruba & The Problem

Yoruba is the native tongue of the Yoruba people of West Africa. It is tonal (like Chinese), with a romanized writing system for demarcating tone and pronounciation. That is to say, like Chinese Pinyin, and Japanese Romaji, Yoruba can be written entirely within the extended Latin alphabet.

That notwithstanding, the construction of words in Yoruba is still fundamentally different from other languages, and it is my belief that because existing databases do not take this into account, they fail to provide an adequate level of detail in their definitions. In particular, the way most Yoruba words are made up of other Yoruba words is not taken advantage of.

Contractions in Yoruba

At its core, Yoruba has very few self-contained words over 4 letters (if any at all). All other words, are created through the combination and permutation of the vocabulary: and as such, the direct meaning of any word is little more than the sum of its parts.

Similarly, the spellings of words are always the result of merging their components. This merging may be done in any of 3 ways.

  1. Linking :- This is a simple joining of words

    bi + bọ = bibọ i.e "ask" + "to worship" = "that which is to be worshipped"

  2. Elision :- This is the deletion of a vowel when joining words

    ní + ilé = n'ílé i.e "in" + "house" = "in the house"

  3. Assimilation :- This is the inheritance by a vowel of another vowel sound when joining words

    kú + ilé = kúulé i.e "greet" + "house" = "greetings!"

To learn more about the Yoruba people and their language, see http://yorupedia.com/

Creating a Yòrúdí File

Check out this sample dictionary and others in the dicts folder for examples.

Such files may be easily created with any text editor able to save to .txt. Once created, you can change the extension to .yor so it will be recognized as a translation file.

Understanding Yòrúdí Entries

Given the unique properties of the Yoruba language (as detailed above), a specialized input format is used to accurately record words. Details of this format are below:

            yoruba decomposition (2)                   optional attribute list (4)
                    v                                             v
        gbogbo [gbó . gbó]  /all  /many  /every  <first: attribute | second: attribute>
           ^                              ^                       
    simplified yoruba (1)       glossary of definitions (3)

1. Simplified Yoruba

This is simply the word in the standard roman alphabet.

  • It should be recorded as it is spoken in the Oyo dialect for consistency
  • Neither tone nor decomposition should be indicated e.g ati, jeun, loke, sugbon

2. Yoruba Decomposition

Here the word must be fully specified to include the following properties

  • Tone marks
  • Component words (making sure to identify the root)
  • Linguistic properties i.e Assimilation and Elision

3. Glossary

The glossary is a list of synonymous words and phrases in the target language

  • Each synonym must be separated by a forward slash
  • Each glossary entry may optionally feature short annotations in parentheses
  • For readability, each slash in the glossary should be two (2) spaces away from the last entry

4. Attribute List

The attribute list may be used to indicate special properties such as indexes into other Yòrúdí language dictionaries. In most cases a contributor need not concern themselves with these.

  • The attribute list must be denoted by angle brackets e.g. < attrib. list >
  • Each attribute consist of a key-value pair separated by a colon and must be separated by a vertical bar
  • For readability, there should always be a space between vertical-bars and attributes as well as the colon and value in the key value pair (as in the previous example)

ADDITIONAL NOTES

Writing some Yoruba characters requires that your keyboard is configured for writing accented and underdotted letters. The way to do this varies by operating system.

Mac Configuration

  1. Go to System Preferences -> Keyboard -> Input Sources
  2. Check the US Extended and US International Keyboards

Accenting a letter is best done with the US International Keyboard.

  • Acute accents are added by pressing ['] then the letter
  • Grave accents are added by pressing [`] then the letter

Underdotting a letter is best done with the US Extended Keyboard.

  • Press [Option] + [X] at the same time, then press the letter. OR
  • Press the letter, then press [Option] + [Shift] + [X] at the same time

About

(Scala) Multi-Lingual Yoruba Dictionary

Resources

License

Stars

Watchers

Forks

Packages

No packages published