Skip to content

davejeon/PolAna

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 

Repository files navigation

PolAna


DSR Portfolio Project Political Analyser

Logo
Table of Contents
  1. About The Project
  2. Getting Started
  3. Usage
  4. Roadmap
  5. Contributing
  6. License
  7. Contact
  8. Acknowledgments

About The Project

The Data Science Retreat intensive programming boot camp culminates in a portfolio project. The project we have decided to pursue is to address the growing manipulation and influencing of voters through the use of online media.

Our main question was:

Is there a way for a user to monitor whether they are being influenced by media articles?

Media bias has contributed to people becoming more politically polarized. When we can’t identify, understand and appreciate diverse perspectives, we are more likely to be manipulated into thinking or voting a certain way.

The online domain is a slippery slope where a person could venture far away from their original political disposition within a matter of clicks i.e. going "deeper down the rabbit hole".

This project aims to create a personal media tracker, based on a person's online reading history, to notify the user that they are viewing unusual material and run the risk of being influenced or manipulated.

(back to top)

Built With

...love... no, not really.

The project was implemented using the following packages:

  • Numpy
  • Pandas
  • MatplotLib
  • CategoryEncoders (for binary encoders)
  • Feature_Engine_Encoding (for frequency encoders)
  • OS
  • SKLearn
  • Lightgbm
  • Catboost
  • XGBoost
  • Pickle
  • Time
  • Datetime
  • Glob

(back to top)

Getting Started

In order to obtain the required data, you will need to create an account on Driven Data (https://www.drivendata.org/). After having created an account, the data can be downloaded from the following page:

https://www.drivendata.org/competitions/57/nepal-earthquake/

Prerequisites

Ensure that the required packages listed under "Built with" have been installed and are up to date.

(back to top)

Usage

Use this space to show how winning is done.

Logo

The training data labels and values have to be imported and merged together into one dataset. This data was then used for data visualisation:

Logo Logo Logo Logo

Subsequently, the features for the model need to be built and selected. The code specifies two iterations, one vanilla with no changes to the code and having dropped all categorical and low correlation data, as well as a routine build where categorical data were encoded using a binary and frequency encoder, as well as some modification to the data, such as normalisation and removal of outliers.

To use this routine on a different data set, You will have to edit build_features and value_column_string in split_train_dataset as this is specific to Richter's Predictor: Modeling Earthquake Damage https://www.drivendata.org/competitions/57/nepal-earthquake

use_vanilla_data is made as a global variable as it needs to be consistent for train and test datasets

(back to top)

Split the train dataframe according to value_column_string and train_test_split params.

The routine will go through all sklearn classifiers from: https://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html

and additional: sklearn.tree.DecisionTree sklearn.linear_models.SGDClassifier XGBoost CatBoost LGBM

Use only the baseline model set with use_baseline_models=True (default:False)

Mutually exclusive options for hyperparameter optimization: Enable GridSearchCV for all models with grid_cv=True (default: False) Enable RandomSearchCV for all models with random_cv=True (default: False)

GridSearchCV parameter rages done with rule of thumb adequately to a Classifier class Starting RandomSearchCV were done with a rule of thumb adequately to a Classifier class Hyperparameter optimization takes a considerable amount of time so use with caution

The method will test and score the model with F1 micro and macro averaged score Additionally a cross validation score will be generated for the train dataset

Create the test dataset to generate results for upload

Apply test dataset to all trained models ang generate results. Results in separate files per model found in ../data/results

Execute, run away, and pray for the best and pray to God you haven't ruined Shishtoff's code.

Roadmap

  • Download data
  • Review data for patterns and/or discrepancies
  • Clean data
  • Build and select features for use in model
  • Select model and fit on training data
  • Use the model to make predictions for test data
  • Call it a day and go grab a beer

(back to top)

Contributing

A special shout out to my boy, Shishtoff and Paul and last, but not least the sweet flower of the office, me, David.

(back to top)

License

Used for learning purposes only. Not to be distributed.

(back to top)

Acknowledgments

Special shout out to the DSR! Go Quesadas!

(back to top)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages