The full goodreads_reviews_young_adult.json used for this can be found here: https://mengtingwan.github.io/data/goodreads.html
Initial data labelling and preprocessing can be found /data folder Needs to include goodreads_reviews_young_adult.json
Most processing and training required large amount of ram and processing power, so they where run on a HPC-cluster (High Performance Computing) HPC scripts covers all high data volume processing and training of embeddings using word2vec.
Cleans and preprocesses the entire dataset and compiles it to single file.
Trains the classifier model on the split labelled dataset provided.
Used for additional stats of the classifier, with a given confidence.
Uses the trained classifier model to create a male and female dataset, from the entire dataset with high confidence samples.
Merges the human labelled and classified datasets to one using sets.
Skipgram model. Train the embeddings on a given dataset.
HPC script specifying python script and ressources for the HPC.
SemAxis contain the SemAxis framework functions.
Plots and analysis loads the embeddings models that where trained on the hpc, and uses the SemAxis framework to make some analysis functions and create plots using the axis word found in the folder /axes.
The pretrained models from our analysis can be found in the /models folder, except the entire datamodel, which requires the download of an addtional file "full_young.wordvectors.vectors.npy" which can be downloaded here: https://drive.google.com/file/d/1sTgbDlwCm7_DjXyJ2nmaqDpg65p457f8/view?usp=drive_link