The Yorùdí project aims to compile a complete multi-lingual lexical database with Yoruba as the pivot language. The project is modelled after the CC-CEDICT project by Paul Andrew Denisowski which was itself modeled on the highly successful EDICT project by Jim Breen. The former being a Chinese-English Electronic Dictionary and the latter, a Japanese-English Dictionary.
This dictionary, in addition to being standardized and downloadable by humans, also aims to be accessible to machines. As such, it consists of 3 parts:
- A Command Line interface for programmers & programs to use directly.
- A REST API for accessibility over the internet hosted on Azure
- A Simple Front-End (modeled after Tangorin) hosted here on GitHub Pages
The combination of these 3 access points provides a tool that gives access to the Yoruba language in a way that appeals to both programmers (like me), and the ordinary human user... at least I hope it does.
The Command Line application can be accessed by running the Yorudi Class as main i.e.
sbt runMain net.mabogunje.yorudi.Yorudi [required arguments] [optional arguments] [yoruba word].
- --dict (dictionary) This specifies which dictionary you want to query.
Acceptable values are:
- cms This refers to the Church Missionary Society Yoruba Dictionary (currently incomplete)
- gpt This refers to the ChatGPT dictionary (a dictionary of 100 of the most popular Yoruba words in each alphabet according to ChatGPT)
- names This refers to the Yoruba Personal Names dictionary based on the book by *Adeboye Babalola & Olugboyega Alaba (currently incomplete)
- sample This is a sample dictionary intended for testing
- -s (strict) Return only results with exact tone matches
- -g (glossary) Return a glossary of all words related to your query
- -d (derivative) Return all derivative words of your query
- --fmt (format) Return results in spacified format. Options are:
- plain (standard command-line output)
- xml (Extensible Markup Language)
- json (Javascript Object Notation)
-
Find all words matching "aba" in the cms dictionary (tone-insensitive)
sbt 'runMain net.mabogunje.yorudi.Yorudi --dict cms aba' -
Find all words matching "àbà" in the cms dictionary (tone-sensitive)
sbt 'runMain net.mabogunje.yorudi.Yorudi --dict cms -s àbà' -
Display a glossary of all words related to "àbà" in the cms dictionary
sbt 'runMain net.mabogunje.yorudi.Yorudi --dict cms -g àbà' -
Find all words derived from "àbà" in the cms dictionary
sbt 'runMain net.mabogunje.yorudi.Yorudi --dict cms -d àbà'
Run:
`sbt 'runMain YorubaRestService'`
The service will spring up at http://localhost:3330 with a basic webpage you can use to query the dictionaries. You can also check the RESTful responses directly by visiting: http://localhost:3330/word/YOURWORD. Parameters are the same as the Azure service detailed below.
The service is already hosted for free on Azure as mentioned earlier. Below are the details:
As you can see above, the service currently has only one endpoint which accepts 1 argument and up to 2 parameters. Supported endpoints are below:
When you query the word endpoint, you can supply up to 2 parameters:
-
dictionary: Acceptable values for the dictionary parameter are as follows:
- cms This refers to the Church Missionary Society Yoruba Dictionary (currently incomplete)
- gpt This refers to the ChatGPT dictionary (a dictionary of 100 of the most popular Yoruba words in each alphabet according to ChatGPT)
- names This refers to the Yoruba Personal Names dictionary based on the book by *Adeboye Babalola & Olugboyega Alaba (currently incomplete)
- sample This is a sample dictionary intended for testing
-
mode: Acceptable values for the mode parameter are as follows:
- match Returns any matching word in the dictionary (tone-insensitive)
- strict Returns any matching word in the dictionary (tone-sensitive)
- related Returns any word that contains the queried word in its decomposition
- derivative Returns any word that has the queried word as its root
Tip
The default dictionary is cms, and the default mode is match. So hitting the endpoint without using any parameters will return matching results from the cms dictionary.
If you are not looking to contribute to or self-host this project, and you just want to use the dictionary, you can access the Front-End at https://mabogunje.github.io/yorudi. It's also a good place to see if any changes committed to the dictionaries work as expected.
Yoruba is the native tongue of the Yoruba people of West Africa. It is tonal (like Chinese), with a romanized writing system for demarcating tone and pronounciation. That is to say, like Chinese Pinyin, and Japanese Romaji, Yoruba can be written entirely within the extended Latin alphabet.
That notwithstanding, the construction of words in Yoruba is still fundamentally different from other languages, and it is my belief that because existing databases do not take this into account, they fail to provide an adequate level of detail in their definitions. In particular, the way most Yoruba words are made up of other Yoruba words is not taken advantage of.
At its core, Yoruba has very few self-contained words over 4 letters (if any at all). All other words, are created through the combination and permutation of the vocabulary: and as such, the direct meaning of any word is little more than the sum of its parts.
Similarly, the spellings of words are always the result of merging their components. This merging may be done in any of 3 ways.
-
Linking :- This is a simple joining of words
bi + bọ = bibọ i.e "ask" + "to worship" = "that which is to be worshipped"
-
Elision :- This is the deletion of a vowel when joining words
ní + ilé = n'ílé i.e "in" + "house" = "in the house"
-
Assimilation :- This is the inheritance by a vowel of another vowel sound when joining words
kú + ilé = kúulé i.e "greet" + "house" = "greetings!"
To learn more about the Yoruba people and their language, see http://yorupedia.com/
Check out this sample dictionary and others in the dicts folder for examples.
Such files may be easily created with any text editor able to save to .txt. Once created, you can change the extension to .yor so it will be recognized as a translation file.
Given the unique properties of the Yoruba language (as detailed above), a specialized input format is used to accurately record words. Details of this format are below:
yoruba decomposition (2) optional attribute list (4)
v v
gbogbo [gbó . gbó] /all /many /every <first: attribute | second: attribute>
^ ^
simplified yoruba (1) glossary of definitions (3)
This is simply the word in the standard roman alphabet.
- It should be recorded as it is spoken in the Oyo dialect for consistency
- Neither tone nor decomposition should be indicated e.g ati, jeun, loke, sugbon
Here the word must be fully specified to include the following properties
- Tone marks
- Component words (making sure to identify the root)
- Linguistic properties i.e Assimilation and Elision
The glossary is a list of synonymous words and phrases in the target language
- Each synonym must be separated by a forward slash
- Each glossary entry may optionally feature short annotations in parentheses
- For readability, each slash in the glossary should be two (2) spaces away from the last entry
The attribute list may be used to indicate special properties such as indexes into other Yòrúdí language dictionaries. In most cases a contributor need not concern themselves with these.
- The attribute list must be denoted by angle brackets e.g. < attrib. list >
- Each attribute consist of a key-value pair separated by a colon and must be separated by a vertical bar
- For readability, there should always be a space between vertical-bars and attributes as well as the colon and value in the key value pair (as in the previous example)
Writing some Yoruba characters requires that your keyboard is configured for writing accented and underdotted letters. The way to do this varies by operating system.
- Go to System Preferences -> Keyboard -> Input Sources
- Check the US Extended and US International Keyboards
Accenting a letter is best done with the US International Keyboard.
- Acute accents are added by pressing ['] then the letter
- Grave accents are added by pressing [`] then the letter
Underdotting a letter is best done with the US Extended Keyboard.
- Press [Option] + [X] at the same time, then press the letter. OR
- Press the letter, then press [Option] + [Shift] + [X] at the same time