To reproduce the figures and analysis in this paper:
- Run
query_semanticscholar.pyon the domain {CS, Chemistry, Economics, Medicine, Physics}, or your own list of scientists - Run
filter_by_yearusing the birth/death dates file, output toabstracts_filtered_year - Clean the abstracts with
get_vectors.py, output toabstracts-cleaneddirectory - Encode the abstracts with SBERT via
sbert.py, output tosbert-abstracts - Order/convert dates to timestamps via
emergence_order.py, output toabstracts-ordereddirectory - Run models on the
abstracts-ordereddirectory
-
Hyperparameters must be tuned for models on each scientist: run
opt_hyperparam_exemplar.pyfor each model/field combination, output individual param values to `/individual-s-vals/ -
Run comparison between models:
src/models/predict.py --type <nobel/turing> --field <field> --measure ll -i -
Run shuffle tests between models:
src/models/predict.py --type <nobel/turing> --field <field> --measure ll -i -s --sy -
Run authorship analysis:
src/models/predict_k_author_papers.py --type <nobel/turing> --field <field> -k <max authors, or -1 for first author>
Most figures are generated through functions in rain_plots.py, based on simulation outputs generated through the "Running models" section.
The stacked authorship charts can be generated through stacked_bar_authorship.py.
tSNE visualizations can be generated through make_tsne_figure.py.