This repository contains code for two publications. One relates to demographic embeddings and the other is about personalizing embeddings and language models for specific users.
Note: We were not able to share data directly due to licensing issues. However, the data we downloaded is available here and we have scripts to perform the extraction the same way as described in our paper. This repository contains generated JSON files containing fake data that the scripts can be tested on in the 'data' folder. These files contain fake authors with single letter names in [a-z].
More details of experiments run with this code can be found in our papers on demographic and personalized embeddings.
If you use this code for demographic embeddings please cite:
@InProceedings{emnlp20compositional,
title = {Compositional Demographic Word Embeddings},
author = {Charles Welch and Jonathan K. Kummerfeld and Ver{\'o}nica P{\'e}rez-Rosas and Rada Mihalcea},
booktitle = {Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing},
month = {November},
year = {2020},
location = {Online}
}
If you use this code for personalized word embeddings for individual users please cite:
@InProceedings{coling20personal,
title = {Exploring the Value of Personalized Word Embeddings},
author = {Charles Welch and Jonathan K. Kummerfeld and Ver{\'o}nica P{\'e}rez-Rosas and Rada Mihalcea},
booktitle = {Proceedings of the 28th International Conference on Computational Linguistics},
month = {December},
year = {2020},
location = {Online}
}
If you want to sample and annotate new 'i am' posts you can use compose/sample_i_posts.py. The functions for extracting attributes from users are in compose/find_self_statements.py. Run this script to produce lists of users for each attribute. Replace DIR_LOC with the location of your Reddit data.
python find_self_statements.py --type gender
python find_self_statements.py --type location
python find_self_statements.py --type religion
python find_self_statements.py --type age
After running this for location, you will have files for locations in the demographic folder. You can then run resolve_locations.py from the compose folder to resolve locations to the following set, which was based on the amount of available data for each region:
- USA
- Asia
- Oceania
- United Kingdom
- Europe
- Africa
- Mexico
- South America
- Canada
Next, run complete_authors.py --find, which will create a complete_authors file in the demographic folder, which will be used in the creation of matrix embeddings. There are a few important constants in complete_authors.py at the top of the file that can be changed:
MAX_AGEandMIN_AGEare the range of accepted ages. If a user states that they are an age outside of this range, they will be excluded from thecomplete_authorslist.MAX_RANGEis the largest accepted difference in stated ages. This acts as a simple heuristic to exclude users with incorrect ages. For instance, if a user states they are 20 in 2002 and 30 in 2003, this would not be possible. Our data spans 9 years, so we default this value to 9.N_UNKSis the criterion number of unknown demographic values that causes a user to be excluded. If you would like no users to be excluded, set this to the number of demographic variables plus one (4+1 for us). If you would like to exclude users who have no known demographic values, set this to the number of demographic variables (4 for us).MIN_PERCENTis the exclusion criteria for users who state both gender identities recognized by this study. For the less often expressed value, if it exceeds theMIN_PERCENT, the user is excluded.
- Put the speaker names for speakers of interest in
top_speakers. If you would like to runfind_bots_in_list.pyat this point, you can and it will output the names of speakers in your file that are known bots. At this point you can manually remove them if you'd like. A script to remove them can be run at step 6. - Run
get_ts_posts.py -p -d allto get posts from this set of speakers from all years. - Run
merge_ts_posts.pyto combine these files and output author_json files in all_posts. - Run
preprocess_all_posts.pyto preprocess all_posts/author_json files. - Run
sentence_tokenize.pyto run CoreNLP tokenizer on all posts. - Run
rm_known_bots.pyto remove files in all_posts belonging to known bots.
The highest performing embeddings described in our papers use separate matricies for each demographic value or user and are learned using Bamman et al.'s 2014 code. I have compiled separate JAR files and included separate config files for each demographic scenario. Note: If you plan to also run the language model, make sure to first separate out a sample of data for training the model.
- First run
prepare_demographic_embed_data.pyto create thejava_ctx_embeds_reddit_demographicfile containing relevant data and thereddit_vocabfile. - In the embeddings folder, run the shell script for each demographic (
./run_VARIABLE.sh) to generate embeddings.
- First run
prepare_author_embed_data.pyto create thests_combined_usersfile containing text with user labels for training embeddings and areddit_user_vocabfile. - In the embeddings folder, run the shell script
./run_users.shto generate embeddings.
The language model code we use is modified from Merity et al's 2018 code and is in the awd_lstm directory. The modifications allow loading of multiple pretrained embedding matrices with which to initialize the model and allows for freezing and untying embedding weights, as we found this works well when there is enough in-domain data to pretrain on.
To run with all demographic embeddings use python main.py --batch_size 20 --data /path/to/data --dropouti 0.4 --dropouth 0.25 --seed 141 --epoch 50 --save reddit_demo.pt --usepre --pre /path/to/demo/embeds/embeddings.without.suffix. --emsize 100 --burnin 99 --demouse cat --usedemo
datais a path to a folder with atrain.txt,valid.txt, andtest.txtin the format used by JARs above to create embeddings.preis the folder containing pretrained embeddings. The file format must have age, gender, location, or religion as the end of the filename and the parameter when running should omit this (e.g. our scripts create embeddings in this repo with the prefixembed_test_so this would be what would be passed).burnindetermines how many epochs for which embeddings will not be updated. Here, since it is larger than the number of epochs, they will be frozen the whole time.usedemois a flag that tells the model to use demographic embeddings.demousecan becatorsumand determines whether to concatenate or add demographic embeddings.- If you would like to use one demographic instead of all, you can use
useoneand specify which demographic to use. - If you have a second test set to evaluate on, use
testto specify the name of the second file relative to the folder specified bydata.
- First prepare the data by running
prepare_aa_data.pywhich will create a separateaa.txtevaluation file for each user, in addition to the train/validation/test split for training language models. The code for user models is similar to the code for the demographic language model. - To create a model for a single user you will use code in the
single_userfolder and can runpython main.py --batch_size 20 --data /path/to/data/aa/ --dropouti 0.4 --dropouth 0.25 --seed 141 --epoch 50 --save /path/to/output/models/ --nhid 550 --pre /path/to/user/embedding/matrices/ --emsize 100 --usepre --burnin 99 --induse cat --name $uname --aaeval --useind
nameis the name of the user for which the model is being created.aaevalwill run the user's model on each other user's data and produce output files with perplexities that can be used later to determine which model had the best predictions.
- Follow the steps for creating embeddings for users above. The scripts for plotting category distributions are not currently available for demographic embeddings.
- Run
vocab_counts.pyto get word frequencies per user. - Run
compare_spaces.pyto get word distances to generic embedding space per user. - Run
plot_cats_dist.pyto generate the graphs. Options are available for POS tags, LIWC, and ROGET categories. Figures can be saved or displayed with additional options, see--helpfor more details.
token_counter.pyoutputs a file calledtoken_countsthat contains counts of tokens from all speakers in yourtop_speakersfile.lexicon_map.pyis used byplot_cat_dist.pyto graph LIWC and ROGET word class distributions and requires theLIWC_PATHandROGET_PATHto be set inlexicon_map.pylines 9-10- Running
complete_authors.py --plotprovides some useful plots of the demographic and post distributions of the dataset.