Skip to content

Farsi  #41

@gilgameshjw

Description

@gilgameshjw

Farsi

Transliteration in Farsi

With mahdi, we have identified a number of challenges peculiar to Farsi:

  1. Persians can use various characters for a particular one, requiring "normalisation" work, probably with maps.
  2. Persians are in practice not strict with the usage of spaces, i.e. the same Farsi word can appear with or without spaces between the characters or they may use a ZWNJ character (zero-width non-joiner).
  3. Transliteration of single words:
    • Mahdi has found Large dictionaries with farsi words and with transliteration in their various part of speech (N,V,...)
    • The above table is quite extensive and could be used.
    • Research shows that transliteration can be better learned with NNets than with rules.
    • The resulting transliteration seems NOT aligned with interscript one (requiring maps probably)
  4. Transliteration of several words
    • In Farsi, words get pre/suffixes depending on their position and role in a sentence.
    • As a consequence, we think of using a PoS tagging technology
    • PoS Tagging: there are Algos doing that in Farsi, we need to research software and possibly compare or even train.

Ideas (bad and goods)

  • speech to text data?
  • learn Farsi $\Rightarrow$ interscript-like transliteration

Plan

  1. Look for mappings: farsi $\Rightarrow$ +- latine
    Done
  2. Stats of collisions and concept validation
    952 collisions for 50k dictionary, 0.5% at word level.
    Done, Validated
  3. Create git branch so that Mahdi+Jair can collaborate
    Done
  4. Run simplest possible transliteration:
    • Mahdi provides dataset
    • Jair build naive map and transliterate (model 0)
    • Ronald, Mahdi, Jair: feedbacks
  5. Review NLP libraries, codebases and research in Farsi.
  6. Improve (char normalisation, preprocessing and PoS)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    High priority

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions