-
Notifications
You must be signed in to change notification settings - Fork 3
Open
Description
Farsi
Transliteration in Farsi
With mahdi, we have identified a number of challenges peculiar to Farsi:
- Persians can use various characters for a particular one, requiring "normalisation" work, probably with maps.
- Persians are in practice not strict with the usage of spaces, i.e. the same Farsi word can appear with or without spaces between the characters or they may use a ZWNJ character (zero-width non-joiner).
- Transliteration of single words:
- Mahdi has found Large dictionaries with farsi words and with transliteration in their various part of speech (N,V,...)
- The above table is quite extensive and could be used.
- Research shows that transliteration can be better learned with NNets than with rules.
- The resulting transliteration seems NOT aligned with interscript one (requiring maps probably)
- Transliteration of several words
- In Farsi, words get pre/suffixes depending on their position and role in a sentence.
- As a consequence, we think of using a PoS tagging technology
- PoS Tagging: there are Algos doing that in Farsi, we need to research software and possibly compare or even train.
Ideas (bad and goods)
- speech to text data?
- learn Farsi
$\Rightarrow$ interscript-like transliteration
Plan
- Look for mappings: farsi
$\Rightarrow$ +- latine
Done - Stats of collisions and concept validation
952 collisions for 50k dictionary, 0.5% at word level.
Done, Validated - Create git branch so that Mahdi+Jair can collaborate
Done - Run simplest possible transliteration:
- Mahdi provides dataset
- Jair build naive map and transliterate (model 0)
- Ronald, Mahdi, Jair: feedbacks
- Review NLP libraries, codebases and research in Farsi.
- Improve (char normalisation, preprocessing and PoS)
Metadata
Metadata
Assignees
Labels
No labels
Type
Projects
Status
High priority