NER-NLPproj

Objective

This repository contains code in which a single transformer model called XLM-RoBERTa has be fine-tuned to perform named entity recognition (NER) across 4 languages - German, French, Italian, and English. NER is a common NLP task that identifies entities like people, organizations, or locations in text. These entities can be used for various applications such as gaining insights from company documents, augmenting the quality of search engines, or simply building a structured database from a corpus.

To simulate the real world, I have assumed that we want to perform NER for a customer based in Switzerland, where there are 4 national languages, with English often serving as a bridge between them.

Dataset

I have used a subset of the Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark called WikiANN or PAN-X. This dataset consists of Wikipedia articles in many languages, including the 4 most commonly spoken languages in Switzerland: German (62.9%), French (22.9%), Italian (8.4%), and English (5.9%). Each article is annotated with LOC (location), PER (person), and ORG (organization) tags in the “inside-outside-beginning” (IOB2) format.

To make a realistic Swiss corpus, I sampled the German (de), French (fr), Italian (it), and English (en) corpora from PAN-X according to their spoken proportions.

Choice of Transformer

Multilingual transformers involve similar architectures and training procedures as their monolingual counterparts, except that the corpus used for pretraining consists of documents in many languages. Despite receiving no explicit information to differentiate among the languages, the resulting linguistic representations are able to generalize well across languages for a variety of downstream tasks.

For our task, we consider the XLM-RoBERTa model or XLM-R. XLM-R uses only MLM as a pretraining objective for 100 languages, and its pre-training corpus is several orders of magnitude larger than the ones used in earlier models.

Tokenizer

XLM-R uses the SentencePiece tokenizer, which is based on a type of subword segmentation called Unigram and encodes each input text as a sequence of Unicode characters. This last feature is especially useful for multilingual corpora since it allows SentencePiece to be agnostic about accents, punctuation, and the fact that many languages, like Japanese, do not have whitespace characters.

Procedure for Getting to the Final Model

Made a corpus of data from the PAN-X dataset
Tokenized the input corpus
Imported the base pre-trained XLM-R model from Hugging Face
Fine tuned the model on the multilingual corpus
Deployed the model using Gradio Spaces on Hugging Face

Results and Findings Along the Way

For a small corpus, zero shot cross lingual transfer outperforms fine-tuning. As we increase the corpus size for fine tuning, we see an improvement in performance as compared to zero shot transfer.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.gitignore		.gitignore
LICENSE		LICENSE
NER_Multilingual.ipynb		NER_Multilingual.ipynb
README.md		README.md
install.py		install.py
ner_multilingual.py		ner_multilingual.py
requirements.txt		requirements.txt
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

NER-NLPproj

Objective

Dataset

Choice of Transformer

Tokenizer

Procedure for Getting to the Final Model

Results and Findings Along the Way

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

AdiK10/NER-NLPproj

Folders and files

Latest commit

History

Repository files navigation

NER-NLPproj

Objective

Dataset

Choice of Transformer

Tokenizer

Procedure for Getting to the Final Model

Results and Findings Along the Way

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages