Skip to content

Bebock/Jedha-DeepLearning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

1. Project presentation : Natural Language Processing with Disaster Tweets

This project is one of the getting started challenges offered by Kaggle. As an instant way of communication, Twitter has become an important channel in times of emergency. It enables people to announce an emergency in real-time. Because of this immediacy, a growing numhber of agencies are interested in automatically monitoring Twitter.


2. Objective

The objective is to build a machine learning model that predicts which Tweets are about real disasters and which one’s aren’t.


3. How to procede ?

Requirements

The following librairies are required :

re
unidecode
spellchecker
contractions
string
math
pandas
numpy
plotly
matplotlib
seaborn
spacy
nltk
gensim
pyLDAvis
wordcloud
sklearn
xgboost
tensorflow

Dataset

Kaggle provides a dataset of 10 000 tweets that were hand classified as disaster-related or not. In the train dataset, around 43% are disaster-related.


4. Overview of the main results

Exploratory Descriptive Analysis

First of all, we extracted some quantitative features, not directly linked to the contents. Doing so, we tried to extract some additional information not available through text / content analysis.

  • Lenght of the tweet measured by the number of characters and by the number of words
  • Average lenght of words
  • Number of exclamation marks
  • Number of uppercase letters
  • Number and presence of #
  • Number and presence of @
  • Number and presence of urls

Statistically speaking, all these characteristics were related to the target variable (disaster tweet or not). However, the sample size of the dataset provides too much statistical power so we focused on graphical explorations. It appeared that the disaster-related tweets contain longer words in average, less uppercases, more #, much less @, and less urls in mean but the disaster-related tweets are in proportion more prone to contain at least one url.

image

Content description

The tweets content has been described after pre-processing and lemmatization with :

  • Top bigrams according to the target variable (disaster-related or not)

image

  • Wordclouds according to the target variable (disaster-related or not)

image

Sentiment analysis

Disaster tweets appeared less associated with neutral or positive sentiments.

image

However, the sentiment analysis allowed us to detect some questionning coding in the dataset. Indeed, displaying the tweets coded as disaster-related and categorized as positive by the sentiment analysis, we can read for exemple the following tweets :

  • "my favorite lady came to our volunteer meeting hopefully joining her youth collision and i am excite"
  • "ok peace I hope I fall off a cliff along with my dignity"
  • ":) well I think that sounds like a fine plan where little derailment is possible so I applaud you :)" ... and they do not refer to disasters. It might indicate some confusing labels in the training dataset.

Prediction models : How to automatically detect a disaster-related tweet ?

We tried 3 main approaches :

  • Using classical Machine learning models on the previously extracte features
  • Using classical Machine learning models on recoded text through TF-IDF or BoW
  • Using neural networks

The results were the following : image

It appeared that a pre-trained embedding neural netword (Universal Sentence Encoder) had the best performances in terms of ROC score, f1 score and accuracy. However, classical Machine Leanrning models performed on Bag of Words (Neural Net and SGDC) performed at a similar level.

Perspectives

The best models may now be improved thanks to fine tuning of hyperparameters.


5. Informations

Tools

The notebook has been developed with Visual Studio Code.

The USE is available here

Authors & contributors

Author :

Helene alias @Bebock

The dream team :

Henri alias @HenriPuntous
Jean alias @Chedeta
Nicolas alias @NBridelance

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published