Skip to content

govinds108/Research-Connector

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 

Repository files navigation

MSML612 - Group Project - Connecting UMD Researchers: Djikstra's Algorithm in Action

About

Team Members

  • Grace Connors
  • Govind Singhal
  • Damian Calabresi

Goals

The topic we’ve selected is how to best find “closeness” between two descriptive data objects. Specifically, we will explore how to best organize and compare the research profiles of University of Maryland’s (UMD) Data Science and Computer Science professors. Our interest in this investigative aim came from our review of Dijkstra's Algorithm. This foundational shortest path between nodes algorithm was concerned with the physical space between nodes, but more broadly, it concerns how to best represent the users desired association. Within this looser frame of “closeness” we will hope to discover and apply the most efficient method for students to list a series of professors that are most relevant to their research interests and the shortest path to collaborate with them.

Overview

Algorithms and Data Structures

Graphs

The relationship between researchers, as well as the relationship between research interests is represented as an undirected weighted graph.

The strength of the relationship between two researchers or between two research interests is based on the number of publications that the two entities have in common. This means, the higher number of publications shared, the shortest is the distance between the two entities.

The distance between two vertices is represented by 1 / (number of publications shared) (The precise formula is defined in the implementation section).

Djikstra's Algorithm

Dijkstra's Algorithm is a graph search algorithm that finds the shortest path from a starting node to all other nodes in the graph. Given it has a time complexity of O(V log V), it is a good fit for our problem.

The algorithm is complex enough to provide the shortest path between two researchers, being this path the most probably successful way to connect with a researcher given a series of researchers already connected to a student. Similarly, the algorithm will be leveraged to find the shortest path and hence minimal distance between two research interests.

BFS

During the data collection phase, BFS will be used to build the graph from a starting set of vertices, being these vertices the professors of the University of Maryland with the highest number of publications.

Data Collection

OpenAlex API

OpenAlex

The OpenAlex API is a RESTful API that provides access to a database of scholarly articles, books, and other scholarly works. This is the easiest way to programatically access a dataset of academic publications and research. Starting from the previously mentioned faculty index, we can search for the publications of the professors, extract co-authors and keywords, and continue building the graph.

OpenAlex API Documentation

The API is limited to 100,000 requests per day and 10 requests per second, which could be exceeded by the number of requests needed to build the graph. For this reason the graph will be limited to authors and works affiliated with the University of Maryland College Park.

Technical Requirements

Python

  • Python 3.10
  • Poetry

The Python code will make use of the Poetry package manager to manage the dependencies properly.

Data

In order to run the code, the Authors, Works, and Topics files need to be downloaded from the OpenAlex API.

OpenAlex API Documentation

The API is limited to 1000 requests per day and 10 requests per second, for this reason we will download the list of Authors, as well as the list of Works and Topics affiliated with the University of Maryland at once, and then correlate the data to build the graph.

Design

Data Collection

1. Download the Authors

Download the Authors associated with the University of Maryland from the OpenAlex API. Use the last_known_institution_id to filter the Authors. The affiliations field will give us many Authors which have no relation to the university but had Authorship on a Work associated with the university.

The top 100 authors with the highest number of publications will be used as an initial set of vertices to build the graph.

For each Author, we will only keep the following fields:

  • id
  • orcid
  • display_name
  • display_name_alternatives (Comma separated list of alternative names)
  • works_count (Number of works associated with the Author)
  • topics.id (Comma separated list of topics associated with the Author)

Store all these Authors in a CSV file.

2. Download the Works as Edges

For each Author, we will download the Works associated with the Author. The Work (Or publication) contains a list of Authors, with their affiliations. If one of the Authors is also affiliated with the University of Maryland, they're both considered co-authors and a new edge is created in the graph.

Starting from the top 100 authors, we will use BFS to build the graph. The graph is built based on the following rules:

  • An Author is related to a series of Works
  • Get the list of Works for the Author
  • Filtering by Affiliations, get the list of co-authors
  • If a co-author is repeated, the strength of the edge is the number of publications shared between the two Authors
  • If a co-author is not repeated, the strength of the edge is 1
  • Store the edges in a CSV file.
  • Store the new Authors (If not discovered) in a CSV file.

3. BFS

  • By storing the initial set of Authors in a queue, we can start the BFS algorithm.
  • Get an Author from the Queue
  • Get the list of co-authors (Through the Works entity)
  • If the co-author is in the CSV file, it's discovered, otherwise add it to the CSV file and to the queue
  • Repeat the process until the queue is empty

Graph Building

The vertices of the graph are the Authors, and the edges are the co-authorship.

  • Create the Graph from the CSV files (Edges and Vertices)
  • Create HashMap for lookup of Authors
  • Run Djikstra's Algorithm to find the shortest path between two Authors

Implementation

Package

All the code is located in the Poetry package researchConnector. In order to execute the multiple scripts within this package, the following command can be used:

poetry run python -m researchconnector

Or

poetry run python -c "import researchconnector; researchconnector.download_authors()"

ADTs

Hash Map

We use a HashMap to store the Authors discovered and already stored in the CSV file. The key is the Author's id, and the value is the Author's object.

The HashMap is created at the beginning of the BFS process based on the CSV file. and maintained during the BFS process.

Queue

We use a Queue to store the Authors to be discovered. The queue is implemented as a linked list. Only the Author ID is stored in the queue.

The queue is created in-memory during the BFS process.

Graph

We use a Graph to store the co-authorship relationships between the Authors. The graph is implemented as an adjacency list.

The Graph is created in-memory from the CSV files before running the Djikstra's Algorithm.

Algorithms

BFS

The BFS is done in a hybrid implementation with calls the OpenAlex API to get the list of Works, the identifies the co-authors from the University of Maryland, and then enqueues the co-authors.

To see if a node has been "discovered", we use the HashMap to check if the Author ID is in the CSV. If it is, the node has been discovered.

Djikstra's Algorithm

The Djikstra's Algorithm is implemented as a priority queue. The priority queue is implemented as a binary heap.

The Djikstra's Algorithm is run on the Graph to find the shortest path between two Authors.

Modules

download_initial_authors

This module calls the OpenAlex API filtering by the last known institution id. It orders the authors by number of publications, to retrieve the top 100 authors. It then stores the Authors in a CSV file.

bfs

This module loads the initial 100 authors into a queue and a hashmap. The queue is used for the BFS algorithm, the hashmap is used to verify if an author has been discovered.

When a new author is discovered, the author is added to the queue, and the hashmap.

Either if the author was already discovered or not, the co-authorship relationship is stored in a CSV file. The strength of the relationship between two authors will be calculated later.

The BFS process is stopped when 200,000 authors have been discovered.

calculate_weights

This module loads the file containing the co-authorship relationships between authors. It then calculates the distance between two authors based on the number of publications they have in common, considering different Work IDs and taking into account that the relationship is bidirectional.

To implement this calculation an anidation of HashMaps is used. The first HashMap is used to store the first author and ensure its unique, the second HashMap is used the second author and verify it's unique too. The same is done in a third HashMap to store the Work ID. At the end the algorithm navigates through the HashMaps and calculate the number of unique works between the two authors.

As the graph is bidirectional, I take the two authors in the edges CSV file and order them by ID, to ensure that the same relationship is not calculated twice.

At the end it calculates the distance as 1 / (number of publications shared) and stores the results in a CSV file.

shortest_path

Loads the file from all the authors and relationships and calculates the shortest path between two authors.

This module contains a simple version and one "enriched" version.

The enriched version uses the Djikstra's Algorithm to find the shortest path between two authors and then process the list of Works/Publications shared between each pair of authors. For each Work it retrieves the name from OpenAlex and prints it to the console.

visualize_graph

This module loads the author and weight files and generates a graph visualization with the Pyvis library. Then it saves the HTML file to the data directory.

Files

  • data/initial_authors.csv: Contains the initial 100 authors.
  • data/all_authors.csv: Contains all the authors discovered during the BFS process.
  • data/all_edges.csv: Contains all relationships between two authors and the work they are related by.
  • data/all_weights.csv: Contains the calculated edges and distances between the authors.
  • data/graph_visualization.html: Contains the HTML visualization of the graph.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors