University project for the seminar "Computational Empirical Research" by
- Marius B.
- Cuong V. T.
- The original twitter dataset of The German Federal Election 2021 is in
EPINetz_TwitterPoliticians_2021.csv - In the scraping folder is the python script
getTwitterData.pywhich is using the Rettiwt-API to fetch profile description of twitter users - The complete and cleaned dataset with 2048 twitter user and their profile description can be found in
username_with_description_cleaned.csv
- Used to prepare and handle the different files with labeled and unlabeled data
Predicted Labelscontains the CSVs with the 350 predicted labels from the active learning process from small-textTrue Labelscontains an XLSX with the 350 predicted and the manually coded labelsinitial_train_bert.csv&initial_train_bert.csvare the training files for the different machine-learning-methods- In
abeled_profiles_candm.xlsxthere are the 200 manually labeled examples from both coders and inlabeled_testdata_bert.csv&labeled_testdata_smalltext.csvare the according datasets for the test run LabeledandUnlabeled_Profiles.xlsxcontains the 70 prelabeled texts + all other unlabeled texts- A simple list of the categories can be found in
Labels.txt - In
profiledescriptions_withpartyanduserid.csvthere are all profiles that have a profile description, with their twitter_handle, user_id and party - The file
trainingdata_classifier_cer.csvholds the training data for the small-text active classification process
small-text/FirstPrediction_Classifier_CER.ipynbis for the first prediction,small-text/Classifier_Main_CER.ipynbfor the following andsmall-text/ClassificationTest_CER.ipynbfor the evaluation testsmall-text/class_balanceris from the small-text library
- The Juypter Notebooks for finetuning the different BERT models are in the
jupyter-notebooksdirectory bert_1860.ipynbis using the big dataset of 1860 profile descriptionsbert_420.ipynbis using the small dataset of 420 profile descriptions- With the variable
pretrained_LMone can choose between the BERT models or other language models