Note: This documentation was reviewed and edited with assistance from an LLM. Please verify critical details against the source code.
To perform interpretable and convenient drug-target interaction predictions using knowledge graph, we have implemented a graph neural network to make predictions, however, we use an unconventional data encoding to mitigate the need to do drug-protein embedding comparisons.
We rationalize that in most biological knowledge graphs, rational and explainable drug-target predictions should be made based on the unique path(s) from drug to target, through many intermediary relation types. This is somewhat different than traditional GNN prediction tasks that typically recognize local structure or patterns predictive of the outcome. For example, a plausible path that is predictive of a drug-target interaction (DTI) is:
drug -associated-with-> disease -associated-with-> protein or drug -targets-> protein -is-similar-> protein
One of the goals of this path based GNN modeling is to enable explainable predictions that users can interpret reationally. Notably, a DTI is unlikely to be predictive from a single path, as multiple lines of collaborating evidence will presecribe more confidence. Due to this expectation, we do not limit the GNN to a single path, but instead allow many paths contribute to the prediction of a DTI, which we model as a diffusion process from drug to proteins.
All nodes in the input graph are encoded with a zero input value, except for the drug we wish to predict targets for (designated src), we then use Graph convolutions to create a node embedding, which we rationalize is analagous to a diffusion process from the nonzero src node to all potential protein targets. To further encourage this diffusion-like process, we use node-level normalization schemes and prevent bias in the convolutions.
The result of a forward pass are drug-specific protein embeddings, generated by a diffusion-like process (modeled by graph convolutions) from drug->protein. This approach removes the need to make link predictions using both source and target node embeddings, and formulates it as a node-prediction problem. The notable limitation of this process is we can only predict DTIs for a single drug per forward pass.
Importantly, the knowledge graph includes all training DTIs, and therefore every training observation has a direct and simple route from drug->protein, which is likely to make prediction trivial and non-generalizable to unseen DTIs. To ensure that our GNN model learns to predict based on alternative routes in the KG, for a given DTI training observation (drug_i -> protein_j), we remove that DTI from the knowledge graph for the forward pass. Notably, this works well for generalization to unseen DTIs, however, at evaluation time prediction of training DTIs (without edge removal) will result in untrustworthy results, note that this does not apply to novel DTI predictions held-out DTIs (since they are not in the KG to begin with).
In this framework, to predict all new DTIs requires running N_drugs forward passes (assuming unbatched), where each forward passes computes all the predicted DTIs from a single given node. This is tractable and convenient, however, as we mentioned in Graph augmentation during training, training edges that are present in the knowledge graph will result in near zero edge probabilities and therefore all training edges will appear negative. To accurately infer training partition edges requires removing the training edge from the knowledge graph prior to predicting. In practice, this is not necessary as we usually do not care about characterizing known edge probabilities. This framework enables tractable prediction of novel DTIs, as it scales with the number of drugs in the knowlege graph rather than the number of possible drug -> protein edges.
Using previously established methods such as GNNExplainer we can identify the links that are critically involved in the prediction of a given observation. In practice, we found that a reinforcement learning implementation of the GNNExplainer premise works tractably and repeatably with out methods. This is additionally convenient as it enably binary selection of edges rather than edge weights, which are more rational explanations of subgraphs. Additionally the policy represents edge weight probabilities which we can use as importance scores, lending a convenient edge specific explanation. In some cases, it's useful to compute the path weights by summing edge scores over a given path, which can enable more intuitive insight about which paths are integral in a DTI prediction.
NOTE: Gatv2 is not compatible with GNNExplainer.
While we can conveniently batch observations, it does require unique knowledge graphs for each obs and therefore consumes significant memory during training. In practice, this limits the batch size to less than 10. Additionally, using large models (e.g., GAT with channels > 64) is usually prohibitively expensive (>24GB VRAM). Future work could mitigate this issue by training on subgraphs of the full knowledge graph.
- Currently every triple (drug-target link) is treated as a unique observation, however, this could be improved by aggregating all DTIs that involve a given drug by setting a multiple nonzero targets (
yvalues). It is unclear how much of an impact this would have on training as there are only ~1000 observations and ~700 drugs, suggesting that most drugs have only 1-2 known drug targets.