- Grace Connors
- Govind Singhal
- Damian Calabresi
The topic we’ve selected is how to best find “closeness” between two descriptive data objects. Specifically, we will explore how to best organize and compare the research profiles of University of Maryland’s (UMD) Data Science and Computer Science professors. Our interest in this investigative aim came from our review of Dijkstra's Algorithm. This foundational shortest path between nodes algorithm was concerned with the physical space between nodes, but more broadly, it concerns how to best represent the users desired association. Within this looser frame of “closeness” we will hope to discover and apply the most efficient method for students to list a series of professors that are most relevant to their research interests and the shortest path to collaborate with them.
The relationship between researchers, as well as the relationship between research interests is represented as an undirected weighted graph.
The strength of the relationship between two researchers or between two research interests is based on the number of publications that the two entities have in common. This means, the higher number of publications shared, the shortest is the distance between the two entities.
The distance between two vertices is represented by 1 / (number of publications shared) (The precise formula is defined in the implementation section).
Dijkstra's Algorithm is a graph search algorithm that finds the shortest path from a starting node to all other nodes in the graph. Given it has a time complexity of O(V log V), it is a good fit for our problem.
The algorithm is complex enough to provide the shortest path between two researchers, being this path the most probably successful way to connect with a researcher given a series of researchers already connected to a student. Similarly, the algorithm will be leveraged to find the shortest path and hence minimal distance between two research interests.
During the data collection phase, BFS will be used to build the graph from a starting set of vertices, being these vertices the professors of the University of Maryland with the highest number of publications.
The OpenAlex API is a RESTful API that provides access to a database of scholarly articles, books, and other scholarly works. This is the easiest way to programatically access a dataset of academic publications and research. Starting from the previously mentioned faculty index, we can search for the publications of the professors, extract co-authors and keywords, and continue building the graph.
The API is limited to 100,000 requests per day and 10 requests per second, which could be exceeded by the number of requests needed to build the graph. For this reason the graph will be limited to authors and works affiliated with the University of Maryland College Park.
- Python 3.10
- Poetry
The Python code will make use of the Poetry package manager to manage the dependencies properly.
In order to run the code, the Authors, Works, and Topics files need to be downloaded from the OpenAlex API.
The API is limited to 1000 requests per day and 10 requests per second, for this reason we will download the list of Authors, as well as the list of Works and Topics affiliated with the University of Maryland at once, and then correlate the data to build the graph.
Download the Authors associated with the University of Maryland from the OpenAlex API. Use the last_known_institution_id to filter the Authors. The affiliations field will give us many Authors which have no relation to the university but had Authorship on a Work associated with the university.
The top 100 authors with the highest number of publications will be used as an initial set of vertices to build the graph.
For each Author, we will only keep the following fields:
- id
- orcid
- display_name
- display_name_alternatives (Comma separated list of alternative names)
- works_count (Number of works associated with the Author)
- topics.id (Comma separated list of topics associated with the Author)
Store all these Authors in a CSV file.
For each Author, we will download the Works associated with the Author. The Work (Or publication) contains a list of Authors, with their affiliations. If one of the Authors is also affiliated with the University of Maryland, they're both considered co-authors and a new edge is created in the graph.
Starting from the top 100 authors, we will use BFS to build the graph. The graph is built based on the following rules:
- An Author is related to a series of Works
- Get the list of Works for the Author
- Filtering by Affiliations, get the list of co-authors
- If a co-author is repeated, the strength of the edge is the number of publications shared between the two Authors
- If a co-author is not repeated, the strength of the edge is 1
- Store the edges in a CSV file.
- Store the new Authors (If not discovered) in a CSV file.
- By storing the initial set of Authors in a queue, we can start the BFS algorithm.
- Get an Author from the Queue
- Get the list of co-authors (Through the Works entity)
- If the co-author is in the CSV file, it's discovered, otherwise add it to the CSV file and to the queue
- Repeat the process until the queue is empty
The vertices of the graph are the Authors, and the edges are the co-authorship.
- Create the Graph from the CSV files (Edges and Vertices)
- Create HashMap for lookup of Authors
- Run Djikstra's Algorithm to find the shortest path between two Authors
All the code is located in the Poetry package researchConnector. In order to execute the multiple scripts within this package, the following command can be used:
poetry run python -m researchconnectorOr
poetry run python -c "import researchconnector; researchconnector.download_authors()"
We use a HashMap to store the Authors discovered and already stored in the CSV file. The key is the Author's id, and the value is the Author's object.
The HashMap is created at the beginning of the BFS process based on the CSV file. and maintained during the BFS process.
We use a Queue to store the Authors to be discovered. The queue is implemented as a linked list. Only the Author ID is stored in the queue.
The queue is created in-memory during the BFS process.
We use a Graph to store the co-authorship relationships between the Authors. The graph is implemented as an adjacency list.
The Graph is created in-memory from the CSV files before running the Djikstra's Algorithm.
The BFS is done in a hybrid implementation with calls the OpenAlex API to get the list of Works, the identifies the co-authors from the University of Maryland, and then enqueues the co-authors.
To see if a node has been "discovered", we use the HashMap to check if the Author ID is in the CSV. If it is, the node has been discovered.
The Djikstra's Algorithm is implemented as a priority queue. The priority queue is implemented as a binary heap.
The Djikstra's Algorithm is run on the Graph to find the shortest path between two Authors.
This module calls the OpenAlex API filtering by the last known institution id. It orders the authors by number of publications, to retrieve the top 100 authors. It then stores the Authors in a CSV file.
This module loads the initial 100 authors into a queue and a hashmap. The queue is used for the BFS algorithm, the hashmap is used to verify if an author has been discovered.
When a new author is discovered, the author is added to the queue, and the hashmap.
Either if the author was already discovered or not, the co-authorship relationship is stored in a CSV file. The strength of the relationship between two authors will be calculated later.
The BFS process is stopped when 200,000 authors have been discovered.
This module loads the file containing the co-authorship relationships between authors. It then calculates the distance between two authors based on the number of publications they have in common, considering different Work IDs and taking into account that the relationship is bidirectional.
To implement this calculation an anidation of HashMaps is used. The first HashMap is used to store the first author and ensure its unique, the second HashMap is used the second author and verify it's unique too. The same is done in a third HashMap to store the Work ID. At the end the algorithm navigates through the HashMaps and calculate the number of unique works between the two authors.
As the graph is bidirectional, I take the two authors in the edges CSV file and order them by ID, to ensure that the same relationship is not calculated twice.
At the end it calculates the distance as 1 / (number of publications shared) and stores the results in a CSV file.
Loads the file from all the authors and relationships and calculates the shortest path between two authors.
This module contains a simple version and one "enriched" version.
The enriched version uses the Djikstra's Algorithm to find the shortest path between two authors and then process the list of Works/Publications shared between each pair of authors. For each Work it retrieves the name from OpenAlex and prints it to the console.
This module loads the author and weight files and generates a graph visualization with the Pyvis library. Then it saves the HTML file to the data directory.
- data/initial_authors.csv: Contains the initial 100 authors.
- data/all_authors.csv: Contains all the authors discovered during the BFS process.
- data/all_edges.csv: Contains all relationships between two authors and the work they are related by.
- data/all_weights.csv: Contains the calculated edges and distances between the authors.
- data/graph_visualization.html: Contains the HTML visualization of the graph.