Skip to content

cariahmarey/CERProject23MC

Repository files navigation

CERProject23MC

University project for the seminar "Computational Empirical Research" by

  • Marius B.
  • Cuong V. T.

Data Collection

  • The original twitter dataset of The German Federal Election 2021 is in EPINetz_TwitterPoliticians_2021.csv
  • In the scraping folder is the python script getTwitterData.py which is using the Rettiwt-API to fetch profile description of twitter users
  • The complete and cleaned dataset with 2048 twitter user and their profile description can be found in username_with_description_cleaned.csv

FileHandling & Classification

  • Used to prepare and handle the different files with labeled and unlabeled data
  • Predicted Labels contains the CSVs with the 350 predicted labels from the active learning process from small-text
  • True Labels contains an XLSX with the 350 predicted and the manually coded labels
  • initial_train_bert.csv & initial_train_bert.csv are the training files for the different machine-learning-methods
  • In abeled_profiles_candm.xlsx there are the 200 manually labeled examples from both coders and in labeled_testdata_bert.csv & labeled_testdata_smalltext.csv are the according datasets for the test run
  • LabeledandUnlabeled_Profiles.xlsx contains the 70 prelabeled texts + all other unlabeled texts
  • A simple list of the categories can be found in Labels.txt
  • In profiledescriptions_withpartyanduserid.csv there are all profiles that have a profile description, with their twitter_handle, user_id and party
  • The file trainingdata_classifier_cer.csv holds the training data for the small-text active classification process

small-text

  • small-text/FirstPrediction_Classifier_CER.ipynb is for the first prediction, small-text/Classifier_Main_CER.ipynb for the following and small-text/ClassificationTest_CER.ipynb for the evaluation test
  • small-text/class_balancer is from the small-text library

BERT

  • The Juypter Notebooks for finetuning the different BERT models are in the jupyter-notebooks directory
  • bert_1860.ipynb is using the big dataset of 1860 profile descriptions
  • bert_420.ipynb is using the small dataset of 420 profile descriptions
  • With the variable pretrained_LM one can choose between the BERT models or other language models

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors