Skip to content

decide about structure of data #7

@LinguList

Description

@LinguList

so far, we have the following recurring structures (many more to be added, this is just a first overview):

name data type content
character character list character in big5 version
simplified_character character list character in simplified version, could be automatically produced or ignored if not present in original data
pinyin character list character reading in pinyin
doculect character list, word list doculect in source
source character list, word list, structure list source (only to be used if multiple sources per dataset, otherwise specified in metadata)
reading character list original reading as given in source (maybe consider replacing with "value")
segments character list, word list segmented reading, following clpa specs
structure character list, word list the context description, that is, the phonetic/phonological structure of a given string (used for context determination)
concept word list the concept, which is then also linked to the concepticon
concepticon_id word list obligatory if there is a concept in the data
gloss character list, structure list not obligatory for character list, as the character is here the main gloss
value word list, structure list reading for a given word, that is, the main value, or the content of a structural feature in a structure list, as we retrieve it from the source

It is important to regularize the treatment of these values. Added values in all datasets are the refined segmented readings in CLPA, built on top of the other readings, but then, there are cases for data-checking, like "sampa", which may be useful, etc., the běnzì, etc. This all needs to be organized and structured.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions