1. Project presentation : Natural Language Processing with Disaster Tweets

This project is one of the getting started challenges offered by Kaggle. As an instant way of communication, Twitter has become an important channel in times of emergency. It enables people to announce an emergency in real-time. Because of this immediacy, a growing numhber of agencies are interested in automatically monitoring Twitter.

2. Objective

The objective is to build a machine learning model that predicts which Tweets are about real disasters and which one’s aren’t.

3. How to procede ?

Requirements

The following librairies are required :

re
unidecode
spellchecker
contractions
string
math
pandas
numpy
plotly
matplotlib
seaborn
spacy
nltk
gensim
pyLDAvis
wordcloud
sklearn
xgboost
tensorflow

Dataset

Kaggle provides a dataset of 10 000 tweets that were hand classified as disaster-related or not. In the train dataset, around 43% are disaster-related.

4. Overview of the main results

Exploratory Descriptive Analysis

First of all, we extracted some quantitative features, not directly linked to the contents. Doing so, we tried to extract some additional information not available through text / content analysis.

Lenght of the tweet measured by the number of characters and by the number of words
Average lenght of words
Number of exclamation marks
Number of uppercase letters
Number and presence of #
Number and presence of @
Number and presence of urls

Statistically speaking, all these characteristics were related to the target variable (disaster tweet or not). However, the sample size of the dataset provides too much statistical power so we focused on graphical explorations. It appeared that the disaster-related tweets contain longer words in average, less uppercases, more #, much less @, and less urls in mean but the disaster-related tweets are in proportion more prone to contain at least one url.

Content description

The tweets content has been described after pre-processing and lemmatization with :

Top bigrams according to the target variable (disaster-related or not)

Wordclouds according to the target variable (disaster-related or not)

Sentiment analysis

Disaster tweets appeared less associated with neutral or positive sentiments.

However, the sentiment analysis allowed us to detect some questionning coding in the dataset. Indeed, displaying the tweets coded as disaster-related and categorized as positive by the sentiment analysis, we can read for exemple the following tweets :

"my favorite lady came to our volunteer meeting hopefully joining her youth collision and i am excite"
"ok peace I hope I fall off a cliff along with my dignity"
":) well I think that sounds like a fine plan where little derailment is possible so I applaud you :)" ... and they do not refer to disasters. It might indicate some confusing labels in the training dataset.

Prediction models : How to automatically detect a disaster-related tweet ?

We tried 3 main approaches :

Using classical Machine learning models on the previously extracte features
Using classical Machine learning models on recoded text through TF-IDF or BoW
Using neural networks

The results were the following :

It appeared that a pre-trained embedding neural netword (Universal Sentence Encoder) had the best performances in terms of ROC score, f1 score and accuracy. However, classical Machine Leanrning models performed on Bag of Words (Neural Net and SGDC) performed at a similar level.

Perspectives

The best models may now be improved thanks to fine tuning of hyperparameters.

5. Informations

Tools

The notebook has been developed with Visual Studio Code.

The USE is available here

Authors & contributors

Author :

Helene alias @Bebock

The dream team :

Henri alias @HenriPuntous
Jean alias @Chedeta
Nicolas alias @NBridelance

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
Disaster_tweets_final.ipynb		Disaster_tweets_final.ipynb
Disaster_tweets_final.py		Disaster_tweets_final.py
README.md		README.md
train.csv		train.csv
twitter.png		twitter.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

1. Project presentation : Natural Language Processing with Disaster Tweets

2. Objective

3. How to procede ?

Requirements

Dataset

4. Overview of the main results

Exploratory Descriptive Analysis

Content description

Sentiment analysis

Prediction models : How to automatically detect a disaster-related tweet ?

Perspectives

5. Informations

Tools

Authors & contributors

About

Uh oh!

Releases

Packages

Languages

Bebock/Jedha-DeepLearning

Folders and files

Latest commit

History

Repository files navigation

1. Project presentation : Natural Language Processing with Disaster Tweets

2. Objective

3. How to procede ?

Requirements

Dataset

4. Overview of the main results

Exploratory Descriptive Analysis

Content description

Sentiment analysis

Prediction models : How to automatically detect a disaster-related tweet ?

Perspectives

5. Informations

Tools

Authors & contributors

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages