-
Notifications
You must be signed in to change notification settings - Fork 3
Open
Description
so far, we have the following recurring structures (many more to be added, this is just a first overview):
| name | data type | content |
|---|---|---|
| character | character list | character in big5 version |
| simplified_character | character list | character in simplified version, could be automatically produced or ignored if not present in original data |
| pinyin | character list | character reading in pinyin |
| doculect | character list, word list | doculect in source |
| source | character list, word list, structure list | source (only to be used if multiple sources per dataset, otherwise specified in metadata) |
| reading | character list | original reading as given in source (maybe consider replacing with "value") |
| segments | character list, word list | segmented reading, following clpa specs |
| structure | character list, word list | the context description, that is, the phonetic/phonological structure of a given string (used for context determination) |
| concept | word list | the concept, which is then also linked to the concepticon |
| concepticon_id | word list | obligatory if there is a concept in the data |
| gloss | character list, structure list | not obligatory for character list, as the character is here the main gloss |
| value | word list, structure list | reading for a given word, that is, the main value, or the content of a structural feature in a structure list, as we retrieve it from the source |
It is important to regularize the treatment of these values. Added values in all datasets are the refined segmented readings in CLPA, built on top of the other readings, but then, there are cases for data-checking, like "sampa", which may be useful, etc., the běnzì, etc. This all needs to be organized and structured.
Metadata
Metadata
Assignees
Labels
No labels