Big-Data---Sentiment-Analysis

Design details:

Linear and nonlinear classification models were used to make predictions and the difference in their performance was analysed (from the sklearn module).
The preprocessing pipeline includes Tokenizer, StopWordsRemover, HashingTF and Inverse Data Frequency (from the PySpark module).

Surface level implementation about each unit:

For linear classification, we used Stochastic Gradient Descent, Multinomial Naive Bayes and Perceptron whereas we used Multilayer perceptron for non-linear classification using ReLU as the activation function.
The dataset was sent in batches to a pipeline that applied preprocessing on the train dataset which would make changes in the tweet features and then send it to the models for training.
For training the models, the data was fitted using the partial fit function to exhibit incremental learning.
The trained models were stored in pickle files which would later on be used for predicting the classes for the test dataset.
The results were visualised by plotting the accuracy of the predictions for each batch streamed to the model.
After training, the test dataset was streamed and predictions were made for the tweets in the test dataset after applying the same preprocessing as done during the training phase.
F1-score, Precision, Recall and Confusion matrices were calculated on the predictions made for the test dataset.
K Means MiniBatch clustering was performed after training and obtaining the accuracy. The number of clusters was set to 2. The difference in the performance before and after significant preprocessing like removal of stop words was observed.

Reason behind design decisions:

The ReLU activation function was used for non-linear classification as it has predominantly given good results when compared to the other activation functions available.
Tokeniser was used to break down the string tweets into tokens, which were further passed to other functions for preprocessing.
StopWordsRemover was implemented as a part of the preprocessing to filter out the nltk stop words.
HashingTF was used to convert the tokenized tweets into vectors of fixed size.
Inverse Data Frequency (IDF) was used to determine how relevant each word in the tweet was by associating a number with each word in the tweets which were passed to the function.

Take Away from the project:

Usage of different spark in-built functions
Streaming data
Operations that can be performed on data frames
Effects of parameters on predictions in terms of accuracy and time taken to execute
Incremental learning

Observations and conclusions:

After changing the different hyper parameters, it was observed that using 128 features for HashingTF gives a good result in terms of accuracy and time taken for execution.
Batch size of 15200 was feasible when compared to higher batch size of 152000 and lower batch size of 1520.
ReLU as the activation function for the non-linear classification model of Multilayer Perceptron gave good results. Changing it to tanh did not show any significant improvement in the accuracy. The difference between the results from the two activation functions for MultilayerPerceptron is visualised below.
Non-linear classification gave prediction results with better accuracy when compared to the predictions made by linear classification.
Decreasing order of accuracy: MLP > MNB > SGD > Perceptron.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
data		data
README.md		README.md
classifier_test.py		classifier_test.py
classifier_train.py		classifier_train.py
cluster_plot.py		cluster_plot.py
cluster_test.py		cluster_test.py
cluster_train.py		cluster_train.py
mlb_model_15200.pkl		mlb_model_15200.pkl
mlp_model_15200.pkl		mlp_model_15200.pkl
perceptron_model_15200.pkl		perceptron_model_15200.pkl
pred.npy		pred.npy
sgd_model_15200.pkl		sgd_model_15200.pkl
true.npy		true.npy

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Big-Data---Sentiment-Analysis

Design details:

Surface level implementation about each unit:

Reason behind design decisions:

Take Away from the project:

Observations and conclusions:

Performance of the two activation functions in terms of accuracy:

Plots without stop words removal:

Plots with stop words removal :

Clustering for predictions made for the classes the tweets belong to

Clustering for true class values that the tweets belong to:

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Big-Data---Sentiment-Analysis

Design details:

Surface level implementation about each unit:

Reason behind design decisions:

Take Away from the project:

Observations and conclusions:

Performance of the two activation functions in terms of accuracy:

Plots without stop words removal:

Plots with stop words removal :

Clustering for predictions made for the classes the tweets belong to

Clustering for true class values that the tweets belong to:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages