RECOVERcoalition · EmaDulj · Dec 10, 2023 · Dec 10, 2023 · Dec 10, 2023 · Dec 10, 2023
diff --git a/.gitignore b/.gitignore
@@ -1,5 +1,6 @@
+Reservoir
 RayLogs
 __pycache__
 *pyc
 *egg-info
-.DS_Store
+.DS_Store
diff --git a/README.md b/README.md
@@ -1,23 +1,20 @@
-# RECOVER: sequential model optimization platform for combination drug repurposing identifies novel synergistic compounds *in vitro*
+# Machine Learning Driven Candidate Compound Generation for Drug Repurposing
+Based on RECOVER: sequential model optimization platform for combination drug repurposing identifies novel synergistic compounds *in vitro*
 [![DOI](https://zenodo.org/badge/320327566.svg)](https://zenodo.org/badge/latestdoi/320327566)
 
-RECOVER is a platform that can guide wet lab experiments to quickly discover synergistic drug combinations active 
-against a cancer cell line, requiring substantially less screening than an exhaustive evaluation 
-([preprint](https://arxiv.org/abs/2202.04202)).
+This repository is an implementation of RECOVER, a platform that can guide wet lab experiments to quickly discover synergistic drug combinations,
+([preprint](https://arxiv.org/abs/2202.04202)), howerver instead of using an ensemble model to get Synergy predictions with uncertainty, we used multiple realization of a Bayesian Neural Network model. 
+Since the weights are drawn from a distribution, they differ for every run of a trained model and hence give different results. The goal was to get a more precise uncertainty and achieve i quicker since the model doesn't have to be trained multiple times. 
 
-
-![Overview](docs/images/overview.png "Overview")
+## Bayesian Before and After Merge
+This branch is refering to a model using Bayesian modeling in both single drug MLP and Combination MLP. The predictors with all Bayesian layers are added in `Recover/recover/models/predictors.py`. The `train.py` was updated with a train_epoch_bayesian function that trains the model using KL loss and test_epoch for testing of the model. In Bayesian Basic Trainer test_epoch is used to test the trained model and easily get the mean and the standard deviation of synergy predictions.   
+In Bayesian Active Trainer, realizations of the trained model are used instead of the Ensemble Model to get the acqusition function scores and rank the drug combinations. Probability of Improvement and Expected Improvement acquistion functions were added to `Recover/recover/acquisition/acquisition.py` since we are now working with Bayesian Optimization.  
+Config files are also updated to be use BNNs. In this branch there are separate bayesian config files, while in the **master** use of bayesian layers feature was added to existing config files.
 
 ## Environment setup
 
-**Requirements**: Anaconda (https://www.anaconda.com/) and Git LFS (https://git-lfs.github.com/). Please make sure 
-both are installed on the system prior to running installation.
-
-**Installation**: enter the command `source install.sh` and follow the instructions. This will create a conda 
-environment named **recover** and install all the required packages including the 
-[reservoir](https://github.com/RECOVERcoalition/Reservoir) package that stores the primary data acquisition scripts.
-
-In case you have any issue with the installation procedure of the *reservoir*, you can access and download all files directly from this [google drive](https://drive.google.com/drive/folders/1MYeDoAi0-qnhSJTvs68r861iMOdoqYki?usp=share_link).
+**Requirements and Installation**: 
+For all the requirements and installation steps check th orginal RECOVER repository (https://github.com/RECOVERcoalition/Recover.git). 
 
 ## Running the pipeline
 
@@ -31,9 +28,3 @@ For example, to run the pipeline with configuration from
 the file `model_evaluation.py`, run `python train.py --config model_evaluation`.
 
 Log files will automatically be created to save the results of the experiments.
-
-## Note
-
-This Recover repository is based on research funded by (or in part by) the Bill & Melinda Gates Foundation. The 
-findings and conclusions contained within are those of the authors and do not necessarily reflect positions or policies 
-of the Bill & Melinda Gates Foundation.
diff --git a/docs/images/ProjectInfographics.png b/docs/images/ProjectInfographics.png
diff --git a/experiments/data/cell_line_transfer_bayesian-result.json b/experiments/data/cell_line_transfer_bayesian-result.json
diff --git a/experiments/data/cell_line_transfer_no_permut_invariance_bayesian-result.json b/experiments/data/cell_line_transfer_no_permut_invariance_bayesian-result.json
diff --git a/experiments/data/cell_line_transfer_shuffled_bayesian-result.json b/experiments/data/cell_line_transfer_shuffled_bayesian-result.json
diff --git a/experiments/data/model_evaluation_bayesian-result.json b/experiments/data/model_evaluation_bayesian-result.json
diff --git a/experiments/data/model_evaluation_multi_cell_line_bayesian-result.json b/experiments/data/model_evaluation_multi_cell_line_bayesian-result.json
diff --git a/experiments/data/model_evaluation_multi_cell_line_no_permut_invariance_bayesian-result.json b/experiments/data/model_evaluation_multi_cell_line_no_permut_invariance_bayesian-result.json
diff --git a/experiments/data/model_evaluation_multi_cell_line_shuffled_bayesian-result.json b/experiments/data/model_evaluation_multi_cell_line_shuffled_bayesian-result.json
diff --git a/experiments/data/model_evaluation_no_permut_invariance_bayesian-result.json b/experiments/data/model_evaluation_no_permut_invariance_bayesian-result.json
diff --git a/experiments/data/model_evaluation_shuffled_bayesian-result.json b/experiments/data/model_evaluation_shuffled_bayesian-result.json
diff --git a/experiments/data/one_hidden_drug_split_bayesian-result.json b/experiments/data/one_hidden_drug_split_bayesian-result.json
diff --git a/experiments/data/one_hidden_drug_split_no_permut_invariance_bayesian-result.json b/experiments/data/one_hidden_drug_split_no_permut_invariance_bayesian-result.json
diff --git a/experiments/data/one_hidden_drug_split_shuffled_bayesian-result.json b/experiments/data/one_hidden_drug_split_shuffled_bayesian-result.json
diff --git a/experiments/stats.py b/experiments/stats.py
@@ -0,0 +1,39 @@
+import json
+import numpy as np
+
+# Specify the path to your JSON file
+json_file_path = "result.json"
+
+# Load data from the JSON file
+with open(json_file_path, "r") as json_file:
+    # Wrap the content in square brackets to form a JSON array
+    json_data = "[" + json_file.read().replace("}\n{", "},\n{") + "]"
+
+    # Parse the JSON array
+    data = json.loads(json_data)
+
+# Extract relevant values
+spearman_values = [entry["eval/spearman"] for entry in data]
+rsquared_values = [entry["eval/comb_r_squared"] for entry in data]
+
+# Calculate mean and standard deviation
+mean_spearman = np.mean(spearman_values)
+std_dev_spearman = np.std(spearman_values)
+
+mean_rsquared = np.mean(rsquared_values)
+std_dev_rsquared = np.std(rsquared_values)
+
+# Print the results
+print(f"Mean eval/rsquared: {round(mean_rsquared, 3)}, Standard Deviation: {round(std_dev_rsquared, 3)}")
+print(f"Mean eval/spearman: {round(mean_spearman, 3)}, Standard Deviation: {round(std_dev_spearman, 3)}")
+
+last_entry = data[-1]
+mean_r_squared_last = last_entry["mean_r_squared"]
+std_r_squared_last = last_entry["std_r_squared"]
+
+mean_spearman_last = last_entry["mean_spearman"]
+std_spearman_last = last_entry["std_spearman"]
+
+# Print values for the last entry
+print(f"Mean test/rsquared: {round(mean_r_squared_last, 3)}, Standard Deviation: {round(std_r_squared_last, 3)}")
+print(f"Mean test/spearman: {round(mean_spearman_last, 3)}, Standard Deviation: {round(std_spearman_last, 3)}")
diff --git a/recover/acquisition/acquisition.py b/recover/acquisition/acquisition.py
@@ -1,4 +1,7 @@
 import torch
+import numpy as np
+from scipy.special import erf
+from scipy.stats import norm
 
 ########################################################################################################################
 # Abstract Acquisition
@@ -18,23 +21,66 @@ def get_scores(self, output):
         raise NotImplementedError
 
     def get_mean_and_std(self, output):
+        output = torch.tensor(output)
         mean = output.mean(dim=1)
         std = output.std(dim=1)
-
         return mean, std
 
+    """
+    The max synergy is considered as current best since we don't have access to ground truth
+    """
+    def get_current_best(self, output):
+        best, _ = output.max(dim=1)
+
+
+        return best
+
 
 ########################################################################################################################
 # Acquisition functions
 ########################################################################################################################
 
+class ExpectedImprovementAcquisition(AbstractAcquisition):
+    def __init__(self, config):
+        super().__init__(config)
+
+    def get_scores(self, output):
+        mean, std = self.get_mean_and_std(output)
+        best = self.get_current_best(output)
+        epsilon = 1e-6
+
+        z = (mean-best-epsilon)/(std+epsilon)
+        phi = np.exp(-0.5*(z**2))/np.sqrt(2*np.pi)
+        Phi = 0.5*(1+erf(z/np.sqrt(2)))
+        scores = (mean-best)*Phi+std*phi
+
+        return scores.to("cpu")
+
 
 class RandomAcquisition(AbstractAcquisition):
     def __init__(self, config):
         super().__init__(config)
 
     def get_scores(self, output):
         return torch.randn(output.shape[0])
+
+
+class ProbabilityOfImprovementAcquisition(AbstractAcquisition):
+    """
+    Probability of Improvement Aquisition Function
+
+    """
+    def __init__(self, config):
+        super().__init__(config)
+
+    def get_scores(self, output):
+        mean, std = self.get_mean_and_std(output)
+        current_best = self.get_current_best(output)
+
+        z = (mean - current_best) / std
+        prob_of_improvement_scores = norm.cdf(z)
+
+        return torch.tensor(prob_of_improvement_scores).to("cpu")
 
 
 class UCB(AbstractAcquisition):

diff --git a/recover/config/active_learning_UCB_bayesian.py b/recover/config/active_learning_UCB_bayesian.py
@@ -0,0 +1,112 @@
+from recover.datasets.drugcomb_matrix_data import DrugCombMatrix
+from recover.models.models import Baseline, EnsembleModel
+from recover.models.predictors import BilinearFilmMLPPredictor, \
+    BilinearMLPPredictor, BilinearFilmWithFeatMLPPredictor, BayesianBilinearMLPPredictor #, BilinearCellLineInputMLPPredictor
+from recover.utils.utils import get_project_root
+from recover.acquisition.acquisition import RandomAcquisition, GreedyAcquisition, UCB, ExpectedImprovementAcquisition
+from recover.train import train_epoch_bayesian, eval_epoch, test_epoch, BayesianBasicTrainer, BayesianActiveTrainer
+import os
+from ray import tune
+
+########################################################################################################################
+# Configuration
+########################################################################################################################
+
+
+pipeline_config = {
+    "use_tune": True,
+    "num_epoch_without_tune": 500,  # Used only if "use_tune" == False
+    "seed": tune.grid_search([1, 2, 3]),
+    # Optimizer config
+    "lr": 1e-4,
+    "weight_decay": 1e-2,
+    "batch_size": 128,
+    # Train epoch and eval_epoch to use
+    "train_epoch": train_epoch_bayesian,
+    "eval_epoch": eval_epoch,
+    "test_epoch": test_epoch,
+}
+
+predictor_config = {
+    "predictor": BayesianBilinearMLPPredictor,
+    "predictor_layers":
+        [
+            2048, 
+            128,
+            64,
+            1,
+        ],
+    "merge_n_layers_before_the_end": 2,  # Computation on the sum of the two drug embeddings for the last n layers
+    "allow_neg_eigval": True,
+    "stop": {"training_iteration": 1000, 'patience': 10}
+}
+
+model_config = {
+    "model": Baseline,
+    # Loading pretrained model
+    "load_model_weights": False,  # tune.grid_search([True, False]),
+    "model_weights_file": "",
+}
+
+"""
+List of cell line names:
+
+['786-0', 'A498', 'A549', 'ACHN', 'BT-549', 'CAKI-1', 'EKVX', 'HCT-15', 'HCT116', 'HOP-62', 'HOP-92', 'HS 578T', 'HT29',
+ 'IGROV1', 'K-562', 'KM12', 'LOX IMVI', 'MALME-3M', 'MCF7', 'MDA-MB-231', 'MDA-MB-468', 'NCI-H226', 'NCI-H460', 
+ 'NCI-H522', 'NCIH23', 'OVCAR-4', 'OVCAR-5', 'OVCAR-8', 'OVCAR3', 'PC-3', 'RPMI-8226', 'SF-268', 'SF-295', 'SF-539', 
+ 'SK-MEL-2', 'SK-MEL-28', 'SK-MEL-5', 'SK-OV-3', 'SNB-75', 'SR', 'SW-620', 'T-47D', 'U251', 'UACC-257', 'UACC62', 
+ 'UO-31']
+"""
+
+dataset_config = {
+    "dataset": DrugCombMatrix,
+    "study_name": 'ALMANAC',
+    "in_house_data": 'without',
+    "rounds_to_include": [],
+    "cell_line": 'MCF7',  # Restrict to a specific cell line
+    "val_set_prop": 0.1,
+    "test_set_prop": 0.,
+    "test_on_unseen_cell_line": False,
+    "split_valid_train": "pair_level",  # either "cell_line_level" or "pair_level"
+    "cell_lines_in_test": None,  # ['MCF7', 'PC-3'],
+    "target": "bliss_max",
+    "fp_bits": 1024,
+    "fp_radius": 2
+}
+
+active_learning_config = {
+    "ensemble_size": 10,
+    "acquisition": tune.grid_search([GreedyAcquisition, UCB, RandomAcquisition, ExpectedImprovementAcquisition]),
+    "patience_max": 8,
+    "kappa": 1,
+    "kappa_decrease_factor": 1,
+    "n_epoch_between_queries": 500,
+    "acquire_n_at_a_time": 30,
+    "n_initial": 30,
+    "realizations": 10, #define the number of realizations instead of Ensemble Model
+}
+
+########################################################################################################################
+# Configuration that will be loaded
+########################################################################################################################
+
+configuration = {
+    "trainer": BayesianActiveTrainer,  # PUT NUM GPU BACK TO 1
+    "trainer_config": {
+        **pipeline_config,
+        **predictor_config,
+        **model_config,
+        **dataset_config,
+        **active_learning_config
+    },
+    "summaries_dir": os.path.join(get_project_root(), "RayLogs"),
+    "memory": 1800,
+    "stop": {"training_iteration": 1000, 'all_space_explored': 1},
+    "checkpoint_score_attr": 'eval/comb_r_squared',
+    "keep_checkpoints_num": 1,
+    "checkpoint_at_end": False,
+    "checkpoint_freq": 1,
+    "resources_per_trial": {"cpu": 32, "gpu": 2},
+    "scheduler": None,
+    "search_alg": None,
+}
diff --git a/recover/config/cell_line_transfer_bayesian.py b/recover/config/cell_line_transfer_bayesian.py
@@ -0,0 +1,87 @@
+from recover.datasets.drugcomb_matrix_data import DrugCombMatrix
+from recover.models.models import Baseline
+from recover.models.predictors import BayesianBilinearFilmMLPPredictor, BayesianBilinearLinFilmWithFeatMLPPredictor #adding Bayesian predictor
+from recover.utils.utils import get_project_root
+from recover.train import train_epoch_bayesian, eval_epoch, test_epoch, BayesianBasicTrainer #adding Bayesian training and trainer and including a tesing epoch
+import os
+from ray import tune
+from importlib import import_module
+
+########################################################################################################################
+# Configuration
+########################################################################################################################
+
+
+pipeline_config = {
+    "use_tune": True,
+    "num_epoch_without_tune": 500,  # Used only if "use_tune" == False
+    "seed": tune.grid_search([2, 3, 4]),
+    # Optimizer config
+    "lr": 1e-4,
+    "weight_decay": 1e-2,
+    "batch_size": 128,
+    # Train epoch and eval_epoch to use
+    "train_epoch": train_epoch_bayesian, #updated train function, where KL divergence is used
+    "eval_epoch": eval_epoch,
+    "test_epoch": test_epoch, #added a Bayesian test epoch, used for differebt realizations
+}
+
+predictor_config = {
+    "predictor": BayesianBilinearLinFilmWithFeatMLPPredictor,
+    "predictor_layers":
+        [
+            2048,
+            128,
+            64,
+            1,
+        ],
+    "merge_n_layers_before_the_end": 2,  # Computation on the sum of the two drug embeddings for the last n layers
+    "allow_neg_eigval": True,
+    "stop": {"training_iteration": 1000, 'patience': 10}, #in order to check when the training in over, we parse these arguments
+    "realizations": 10 #define the number of realizations
+}
+
+model_config = {
+    "model": Baseline,
+    "load_model_weights": False,
+}
+
+dataset_config = {
+    "dataset": DrugCombMatrix,
+    "study_name": 'ALMANAC',
+    "in_house_data": 'without',
+    "rounds_to_include": [],
+    "val_set_prop": 0.2,
+    "test_set_prop": 0.1,
+    "test_on_unseen_cell_line": True,
+    "cell_lines_in_test": ['MCF7'],
+    "split_valid_train": "cell_line_level",
+    "cell_line": None,  # 'PC-3',
+    "target": "bliss_max",  # tune.grid_search(["css", "bliss", "zip", "loewe", "hsa"]),
+    "fp_bits": 1024,
+    "fp_radius": 2
+}
+
+########################################################################################################################
+# Configuration that will be loaded
+########################################################################################################################
+
+configuration = {
+    "trainer": BayesianBasicTrainer,  #Adding Bayesian trainer
+    "trainer_config": {
+        **pipeline_config,
+        **predictor_config,
+        **model_config,
+        **dataset_config,
+    },
+    "summaries_dir": os.path.join(get_project_root(), "RayLogs"),
+    "memory": 1800,
+    "stop": {"training_iteration": 1000, 'patience': 10},
+    "checkpoint_score_attr": 'eval/comb_r_squared',
+    "keep_checkpoints_num": 1,
+    "checkpoint_at_end": False,
+    "checkpoint_freq": 1,
+    "resources_per_trial": {"cpu": 8, "gpu": 0},
+    "scheduler": None,
+    "search_alg": None,
+}