SemEval 2017 Task 10: Extracting Keyphrases from Scientific Publications

Introduction

The SemEval 2017 Task 10 deals with automatic extraction of keyphrases from Computer Science, Material Sciences and Physics publications, as well as extracting types of keyphrases and relations between keyphrases.

In this project, I have implemented a Bi-LSTM + CRF model to do an end-to-end classification and recognition of keywords.

Data

SemEval 2017 ScienceIE corpus consists of 500 journal articles evenly distributed among the domains Computer Science, Material Sciences and Physics. The corpus consists of 350 documents for training, 50 for development and 100 for testing. The keywords are classified into three categories: TASK, PROCESS and MATERIAL.

Approach

Data Preprocessing

Text data corresponding to each article in the corpus is processed to get a list of tokens and their corresponding POS tagging using spaCy python library.
Before passing the input through the model, the lists are converted into integer sequences, and the resulting sequences are padded to ensure that all sequences are the same length.
IOB label scheme is used to generate the output label sequence. To distinguish different type of keywords the output label for each token is appended with correpsonding keyword type. For example, for a TASK type keyword the output label will look like B_Task and I_Task.

Code for data preprocessing can be found in file Data_Processing.ipynb

Modelling

To train the model, the token and POS tagging sequences obtained after data preprocessing are passed through separate embedding layers. The weighted average of the output, with high weightage to POS Tag, is passed through a Bidirectional LSTM layer of 20 units. The results of the forward LSTM and the backward LSTM are concatenated and fed into a Time Distributed Dense Layer with softmax activation function. The output of the Dense layer is then passed through a CRF layer to predict the output label sequence.

Code for model training and testing can be found in file Bi_LSTM_CRF.ipynb

Results

The below image shows model performance on a random paper selected from the domain of material science. The model appears to detect the majority of the main keyphrases such as material science, novel therapeutic treatment, and selective drug delivery.

The selected paper can be found here

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
data		data
Bi_LSTM_CRF.ipynb		Bi_LSTM_CRF.ipynb
Data_Processing.ipynb		Data_Processing.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SemEval 2017 Task 10: Extracting Keyphrases from Scientific Publications

Introduction

Data

Approach

Data Preprocessing

Modelling

Results

About

Uh oh!

Releases

Packages

Languages

ankitagarg26/Keyphrase-Extraction

Folders and files

Latest commit

History

Repository files navigation

SemEval 2017 Task 10: Extracting Keyphrases from Scientific Publications

Introduction

Data

Approach

Data Preprocessing

Modelling

Results

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages