This is the code repository for the paper titled "A dataset of mentorship in science with semantic and demographic estimations".
Linking Academic Family Tree to Microsoft Academic Graph
Code:
prepare_people.ipynb: Normalizing researcher profilesprepare_connect.ipynb: Extracting mentor-mentee pairsprepare_aft_to_mag_affil.ipynb: Matching institutions between AFT and MAGprepare_aft_to_mag_author.ipynb: Linking AFT researchers to MAG authorsprepare_mag_authorship.ipynb: Exporting authorship and paper IDs
Validations:
prepare_validation_paper.ipynb: Matching validation papers with MAGvalidate_aft_to_mag_author.ipynb: Validating AFT-to-MAG author matching
Vector:
prepare_paper_author_vector_tfidf.ipynb: TF-IDF vectors of papers and authors
To load TF-IDF vectors of papers:
from scipy.sparse import load_npz
paper_tfidf = load_npz('dataset/paper_tfidf.npz')
paper_tfidf.shapepaper_tfidf is a sparse matrix. Each row corresponds to a paper, with its ID given in the file paper_tfidf_MAGPaperID.txt.
Researchers' TF-IDF vectors can be loaded similarly.
SPECTER vectors of papers and researchers can be loaded in a row-like style:
import pickle
fin = open('dataset/paper_specter_0.pkl', 'rb')
unpickler = pickle.Unpickler(fin)
while True:
try:
pubid, vec = unpickler.load()
#
except EOFError:
break
fin.close()