Heterogeneous Graph Transformer for learning Compound-Ortholog links from the Kyoto Encyclopedia of Genes and Genomes (KEGG)
This repository contains code for building and running a HGT model on a constructed knowledge graph from the KEGG database containing features for Kegg Orthologs and compounds.
Create your conda environment:
conda env create --name YOUR_ENV_NAME --file environment.yml
conda activate YOUR_ENV_NAME
How to set up:
pip install -e .
To run the code as is:
cd scripts
bash run
scripts/run contains how to use the run.py to run with your parameters.
Note: Make sure to have git-lfs (git large file system) installed. The file data/embeddings/prok_esm650_ko_emb.csv is larger than 100MB and is necessary to build the KG.
The idea for this project is build a knowledge graph (KG) from the KEGG database from a subset of the data types contained in the database and evaluate a heterogeneous GNN on a link prediction task between 2 different types. Here we focus on learning a link prediction task by optimizing the node embeddings for each type to accurately capture compound-ortholog relationships.
We have two types of nodes in this graph: compounds, which are represented by a MACCS structural key from the PubChem database, and KEGG Orthologs (KO) which are represented as dense vectors that are mean-pooled protein embeddings from the ESM-2 transformer-based protein language model. A KO represents a family (or set) of proteins that related by function. In order to represent each KO in a compact way, we just take the average of the embeddings for each protein within the KO. This produces a single vector for each KO. Below is an subgraph from our graph which illustrates this idea.
The model is based on the Heterogeneous Graph Transformer model. We use the same algorithm as defined in the paper from the implementation in pytorch geometric. We add a link prediction head on top of the GNN to predict if a compound and a KO link or not.
We ran a series of experiments to evaluate the HGT link model. We tested a series of attention heads and hidden layer size values. Each experiment was ran for 10 epochs due to time and resource constraints.
