Skip to content

Statistical prediction of which common gloss is most appropriate in context #248

@arrocke

Description

@arrocke

We aggregate the most common glosses for each word, and show the most common gloss as the primary suggestion. A couple ideas to improve this:

  • Feed the top glosses to an LLM with the verse as context to see if it can improve rates of acceptance
  • Cluster tuples of glosses across languages to try to predict which gloss is most likely to be appropriate. This would work by scoring pairs of glosses across languages, and then algorithmically picking the gloss that has a high correlation with how a word has already been glossed in other languages.
Spanish Gloss A, English Gloss 1 - 0.9 // When a word is glossed with English Gloss 1, predict Spanish Gloss A
Spanish Gloss A, English Gloss 2 - 0.1
Spanish Gloss B, English Gloss 1 - 0.3
Spanish Gloss B, English Gloss 2 - 0.7 // When a word is glossed with English Gloss 2, predict Spanish Gloss B
  • We could improve sense clustering by clustering at the lemma level rather than the lemma+morphology level. This should yield better sense clusters. Then for each language, use the surrounding context and morphology information to predict precise glosses. This might perform better when languages have lots of grammatical forms for the same sense, because it can predict the right form, and is more likely to fail by getting the grammar wrong instead of the sense which should be easier to correct.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    Status

    📋 Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions