Adichol

Rule based Tamil Adichol extractor This is an attempt at creating the first part of a rule based Tamil stemmer - it only handles nouns (பெயர்ச்சொல்), verbs (வினைச்சொல்) and pronouns (பதிலிடுபெயர்) for now. I am using the following resources and thank them for the same:

Python 3.12
The list of Tamil nouns, unique_sorted_noun_master.txt, from Kaniyam Foundation.
Bloom filter to check whether a given word is in the list of lexicon words: pybloom_live.
The Flowchart of Tamil Noun and Verb Forms is from AU-KBC Computational Linguistics Research Group.
The list of Tamil verbs and pronouns is extracted from the crea.babylon file shared by stardict-tamil site.

அடிச்சொல் ஏன்?

நாம் ஒரு சொல்லைப் பற்றித் தேடும்போது, தேடல் பெட்டியில் உள்ளிட்டது மட்டுமல்லாமல் அதன் பிற சாத்தியமான வடிவங்களுக்கும் பொருத்தமான முடிவுகளைக் கண்டுபிடிக்கத்தான் விரும்புகிறோம். எடுத்துக்காட்டாக “மின்னூல்” என்று தேடல் பெட்டியில் உள்ளிடுகிறோம் என்று வைத்துக் கொள்வோம். நமக்கு ‘மின்னூல்கள்’, ‘மின்னூலை’, ‘மின்னூலின்’ என்ற சொற்கள் இருக்கும் பக்கங்களும் தேவைதானே? இதைச் செயல்படுத்த நாம் வேறுபாடுகளை நீக்கி சொற்களை அவற்றின் அடிப்படை வடிவத்திற்குக் குறைக்க வேண்டும்.

Stemmer (தண்டுச்சொல் பிரிப்பி) vs Lemmatizer (அடிச்சொல் பிரிப்பி)

Stemming and lemmatization are both techniques used in Natural Language Processing (NLP) to reduce words to their base form, but they differ in their approach and output. Stemming is a faster method that simply removes suffixes from words, potentially resulting in non-dictionary words. For example, the Porter stemmer reduces both "apple" and "apples" to the stem "appl". Lemmatization, on the other hand, considers the context and the part of speech of a word to produce a meaningful dictionary word, or lemma. Here, what we are attemting to do, is closer to a Lemmatizer than a Stemmer:

We are writing rules that are specific to the part of speech of the word.
We are using a lexicon to match with and output a valid dictionary word.
However, the limitation in our approach is that we are not considering the context in which the word has been used.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.gitignore		.gitignore
FlowChart-of-Tamil-noun-forms-by-AUKBC.pdf		FlowChart-of-Tamil-noun-forms-by-AUKBC.pdf
Flowchart-of-Tamil-verb-forms.png		Flowchart-of-Tamil-verb-forms.png
LICENSE		LICENSE
README.md		README.md
bloom_filter.py		bloom_filter.py
noun_functions.py		noun_functions.py
noun_stem.py		noun_stem.py
pronoun-stem.py		pronoun-stem.py
pronouns.txt		pronouns.txt
unique_sorted_noun_master.txt		unique_sorted_noun_master.txt
verb_stem.py		verb_stem.py
verbs.txt		verbs.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Adichol

அடிச்சொல் ஏன்?

Stemmer (தண்டுச்சொல் பிரிப்பி) vs Lemmatizer (அடிச்சொல் பிரிப்பி)

About

Uh oh!

Releases

Packages

Languages

License

AshokR/Adichol

Folders and files

Latest commit

History

Repository files navigation

Adichol

அடிச்சொல் ஏன்?

Stemmer (தண்டுச்சொல் பிரிப்பி) vs Lemmatizer (அடிச்சொல் பிரிப்பி)

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages