COMP550_project

The full goodreads_reviews_young_adult.json used for this can be found here: https://mengtingwan.github.io/data/goodreads.html

Initial labelling, stats and loading of data

Initial data labelling and preprocessing can be found /data folder Needs to include goodreads_reviews_young_adult.json

HPC Scripts

Most processing and training required large amount of ram and processing power, so they where run on a HPC-cluster (High Performance Computing) HPC scripts covers all high data volume processing and training of embeddings using word2vec.

clean_to_pickle.py

Cleans and preprocesses the entire dataset and compiles it to single file.

Cluster_review.py

Trains the classifier model on the split labelled dataset provided.

evalute classifier.py

Used for additional stats of the classifier, with a given confidence.

gender_split_data.py

Uses the trained classifier model to create a male and female dataset, from the entire dataset with high confidence samples.

merge_dataset.ipynb

Merges the human labelled and classified datasets to one using sets.

word2vec.py

Skipgram model. Train the embeddings on a given dataset.

submit.sh

HPC script specifying python script and ressources for the HPC.

Analysis and semaxis.

SemAxis contain the SemAxis framework functions.

Plots and analysis loads the embeddings models that where trained on the hpc, and uses the SemAxis framework to make some analysis functions and create plots using the axis word found in the folder /axes.

The pretrained models from our analysis can be found in the /models folder, except the entire datamodel, which requires the download of an addtional file "full_young.wordvectors.vectors.npy" which can be downloaded here: https://drive.google.com/file/d/1sTgbDlwCm7_DjXyJ2nmaqDpg65p457f8/view?usp=drive_link

Name		Name	Last commit message	Last commit date
Latest commit History 78 Commits
HPC_scripts		HPC_scripts
axes		axes
data		data
images		images
models		models
.gitignore		.gitignore
README.md		README.md
final_report.pdf		final_report.pdf
plots_and_analysis.ipynb		plots_and_analysis.ipynb
ressources		ressources
semaxis.py		semaxis.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

COMP550_project

Initial labelling, stats and loading of data

HPC Scripts

clean_to_pickle.py

Cluster_review.py

evalute classifier.py

gender_split_data.py

merge_dataset.ipynb

word2vec.py

submit.sh

Analysis and semaxis.

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

COMP550_project

Initial labelling, stats and loading of data

HPC Scripts

clean_to_pickle.py

Cluster_review.py

evalute classifier.py

gender_split_data.py

merge_dataset.ipynb

word2vec.py

submit.sh

Analysis and semaxis.

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages