Farsi 

# Farsi

### Transliteration in Farsi

With mahdi, we have identified a number of challenges peculiar to Farsi:

1. Persians can use various characters for a particular one, requiring "normalisation" work, probably with maps.
2. Persians are in practice not strict with the usage of spaces, i.e. the same Farsi word can appear with or without spaces between the characters or they may use a ZWNJ character (zero-width non-joiner).
3. **Transliteration of single words:** 
	* Mahdi has found Large dictionaries with farsi words and with transliteration in their various part of speech (N,V,...)
	* The above table is quite extensive and could be used. 
	* Research shows that transliteration can be better learned with NNets than with rules. 
	* The resulting transliteration seems NOT aligned with interscript one (requiring maps probably)
4. **Transliteration of several words**
	* In Farsi, words get pre/suffixes depending on their position and role in a sentence. 
	* As a consequence, we think of using a PoS tagging technology
	* PoS Tagging: there are Algos doing that in Farsi, we need to research software and possibly compare or even train.


### Ideas (bad and goods)
* speech to text data?
* learn Farsi $\Rightarrow$ interscript-like transliteration

### Plan
1. Look for mappings: farsi $\Rightarrow$ +- latine 
**Done**
2. Stats of collisions and concept validation
952 collisions for 50k dictionary, 0.5% at word level.
**Done, Validated**
3. Create git branch so that Mahdi+Jair can collaborate
**Done**
4. Run simplest possible transliteration:
	* Mahdi provides dataset
	* Jair build naive map and transliterate (model 0)
	* Ronald, Mahdi, Jair: feedbacks 
5. Review NLP libraries, codebases and research in Farsi.
6. Improve (char normalisation, preprocessing and PoS)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Farsi #41

Farsi

Transliteration in Farsi

Ideas (bad and goods)

Plan

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Farsi #41

Description

Farsi

Transliteration in Farsi

Ideas (bad and goods)

Plan

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions