Part of Speech-Tag

There are two “tagsets” that are more commonly used to classify the PoS of a word: the Universal Dependencies Tagset (simpler, used by spaCy) and the Penn Treebank Tagset (more detailed, used by nltk).

Ultimately, what PoS Tagging means is assigning the correct PoS tag to each word in a sentence. Today, it is more commonly done using automated methods. Let us first understand how useful is it, then we can discuss how it can be done. Main ways to do so:

Manual Tagging: This means having people versed in syntax rules applying a tag to every and each word in a phrase. This is the time consuming, old school non automated method. Reminds you of homeworks? Yeah… But it is also the basis for the third and fourth way.
Rule-Based Tagging: The first automated way to do tagging. Consists of a series of rules (if the preceding word is an article and the succeeding word is a noun, then it is an adjective…). Has to be done by a specialist and can easily get complicated (far more complicated than the Stemmer we built). The position of “Most famous and widely used Rule Based Tagger” is usually attributed to E. Brill’s Tagger.
Stochastic/Probabilistic Methods: Automated ways to assign a PoS to a word based on the probability that a word belongs to a particular tag or based on the probability of a word being a tag based on a sequence of preceding/succeeding words. These are the preferred, most used and most successful methods so far. They are also the simpler ones to implement (given that you already have pre annotated samples — a corpus). Among these methods, there could be defined two types of automated Probabilistic methods: the Discriminative Probabilistic Classifiers (examples are Logistic Regression, SVM’s and Conditional Random Fields — CRF’s) and the Generative Probabilistic Classifiers (examples are Naive Bayes and Hidden Markov Models — HMM).
Deep Learning Methods: Methods that use deep learning techniques to infer PoS tags. So far, these methods have not shown to be superior to Stochastic/Probabilistic methods in PoS tagging — they are, at most, at the same level of accuracy — at the cost of more complexity/training time. Today, some consider PoS Tagging a solved problem. Some closed context cases achieve 99% accuracy for the tags, and the gold-standard for Penn Treebank is kept at above 97.6 f1-score since 2002 in the ACL (Association for Computer Linguistics) gold-standard records.

But I’ll make a short summary of the things that we’ll do here. First, download a corpus. A corpus is how we call a Dataset in NLP. We’ll use Penn Treebank sample from NLTK and Universal Dependencies (UD) corpus. We’ll also see how to use the CoNLL-u format, the most common format for linguistic annotated corpora (the plural of corpus). Second step is to extract features from the words. We do that to by getting word termination, preceding word, checking for hyphens, etc. This will compose the feature set used to predict the POS tag. Third, we load and train a Machine Learning Algorithm. We’ll use a Conditional Random Field (CRF) suite that is compatible with sklearn, the most used Machine Learning Module in Python. We test the trained models, checking f1 scores (explained there) for each. We also provide a way to test the models in a more “practical” manner. We save the models to be able to use them in our algorithm.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
POS_Tagging.ipynb		POS_Tagging.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Part of Speech-Tag

About

Uh oh!

Releases

Packages

Languages

smritisridhar41/POS-Tag

Folders and files

Latest commit

History

Repository files navigation

Part of Speech-Tag

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages