llnl · landajuela · Jan 21, 2025 · Jan 18, 2025 · Jan 18, 2025 · Jan 18, 2025
diff --git a/.gitignore b/.gitignore
@@ -3,4 +3,5 @@
 .bash_history
 lp_solution*
 .vscode
-.coverage
+.coverage
+combined_scores.csv
diff --git a/README.md b/README.md
@@ -8,38 +8,34 @@
   </h2>
 </div>
 
-
 ![Status](https://img.shields.io/badge/Status-Active-green.svg)
 ![Python](https://img.shields.io/badge/Python-3.9-blue.svg)
 [![Paper](https://img.shields.io/badge/Paper-Download-green.svg)](https://www.biorxiv.org/content/10.1101/2024.11.03.621763v1)
 ![CI](https://github.com/LLNL/protlib-designer/actions/workflows/ci.yml/badge.svg)
 
 ## Introduction
 
-Welcome to the `protlib-designer` repository! This repository contains a Python package that designs diverse protein libraries by seeding linear programming with deep mutational scanning data (or any other data that can be represented as a matrix of scores per single-point mutation). The software takes as input the score matrix, where each row corresponds to a mutation and each column corresponds to a different source of scores, and outputs a subset of mutations that maximize the diversity of the library while Pareto-optimizing the scores from the different sources. 
+Welcome to the `protlib-designer` repository! This repository contains a lightweight python library for designing diverse protein libraries by seeding linear programming with deep mutational scanning data (or any other data that can be represented as a matrix of scores per single-point mutation). The software takes as input the score matrix, where each row corresponds to a mutation and each column corresponds to a different source of scores, and outputs a subset of mutations that maximize the diversity of the library while Pareto-optimizing the scores from the different sources.
 
 The paper [Antibody Library Design by Seeding Linear Programming with Inverse Folding and Protein Language Models](https://www.biorxiv.org/content/10.1101/2024.11.03.621763v1) uses this software to design diverse antibody libraries by seeding linear programming with scores computed by Protein Language Models (PLMs) and Inverse Folding models.
 
 <figure>
-  <img src="images/method_diagram.png" width="800">
-  <figcaption>
-    <p class="figure-caption text-center">
-	<em>
-	protlib-designer designs diverse protein libraries by seeding linear programming with deep mutational scanning data.
-	(a) The input to the method is an antibody-antigen complex and a target antibody sequence. (b) We generate in silico deep mutational scanning data using protein language and inverse folding models. (c) The result is fed into a multi-objective linear programming solver. (d) The solver generates a library of antibodies that are co-optimized for the in silico scores while satisfying diversity constraints.
-	</em>
-	</p>
-  </figcaption>
+<img src="images/method_diagram.png" width="800">
+<figcaption>
+<p class="figure-caption text-center">
+<em> protlib-designer designs diverse protein libraries by seeding linear programming with deep mutational scanning data. (a) The input to the method is target protein sequence and, if available, a structure of the protein or protein complex (in this case, the antibody trastuzumab in complex with the HER2 receptor). (b) We generate in silico deep mutational scanning data using protein language and inverse folding models. (c) The result is fed into a multi-objective linear programming solver. (d) The solver generates a library of antibodies that are co-optimized for the in silico scores while satisfying diversity constraints. 
+</em>
+</p>
+</figcaption>
 </figure>
 
-
 ## Getting Started
 
 In this section, we provide instructions on how to install the software and run the code.
 
 ### Installation
 
-Create an environment with Python >=3.9 and install the dependencies:
+Create an environment with Python >=3.7,<3.11 and install the dependencies:
 ```bash
 python -m venv .venv
 source .venv/bin/activate
@@ -52,12 +48,10 @@ pip install -e .[dev]
 ```
 which will allow you to run the tests and the linter. You can run the linting with:
 ```bash
-black -S -t py39 protlib_designer scripts 
+black -S -t py39 protlib_designer scripts && \
 flake8 --ignore=E501,E203,W503 protlib_designer scripts
 ```
 
-
-
 ### Run the code
 
 To run the code to create a diverse protein library of size 10 from the example data, run the following command:
@@ -66,26 +60,31 @@ To run the code to create a diverse protein library of size 10 from the example
 protlib-designer ./example_data/trastuzumab_spm.csv 10
 ```
 
-We provide a rich set of command-line arguments to customize the behavior of `protlib-designer`. For example, the following command runs `protlib-designer` with a range of 3 to 5 mutations per sequence, enforcing the interleaving of the mutant order and balancing the mutant order, and using a weighted multi-objective optimization:
+We provide a rich set of command-line arguments to customize the behavior of `protlib-designer`. For example, the following command runs `protlib-designer` with a range of 3 to 5 mutations per sequence, enforcing the interleaving of the mutant order and balancing the mutant order, allowing for each mutation to appear at most `1` time and a position to be mutated at most `4` times,
+and using a weighted multi-objective optimization:
 
 ```bash
 protlib-designer ./example_data/trastuzumab_spm.csv 10 \
---min-mut 3 --max-mut 5 --interleave-mutant-order True --force-mutant-order-balance True \
---weighted-multi-objective True
+  --min-mut 3 \
+  --max-mut 5 \
+  --interleave-mutant-order True \
+  --force-mutant-order-balance True \
+  --schedule 2 \
+  --schedule-param '1,4' \
+  --weighted-multi-objective True
 ```
 
-
 For more information on the command-line arguments, run:
 
 ```bash
 protlib-designer --help
 ```
 
-### Input data
+### Input data : In silico deep mutational scanning data
 
 The input to the software is a matrix of per-mutation scores (the csv file `trastuzumab_spm.csv` in the example above). Typically, the score matrix is defined by *in silico* deep mutational scanning data, where each row corresponds to a mutation and each column corresponds to the score computed by a deep learning model. See the example data in the `example_data` directory for an example of the input data format. The structure of the input data is shown below:
 
-| MutationHL | score-1 | score-2 | ... | score-N |
+| Mutation | score-1 | score-2 | ... | score-N |
 |------------|--------|--------|-----|--------|
 | AH106C     | -0.1    | 0.2    | ... | 0.3    |
 | AH106D     | 0.2    | -0.3    | ... | -0.4    |
@@ -95,7 +94,7 @@ The input to the software is a matrix of per-mutation scores (the csv file `tras
 
 Important notes about the input data:
 
-• The `MutationHL` column contains the mutation in the format : `WT_residue` + `chain` + `position_index` + `mutant_residue`. For example, `A+H+106+C = AH106C` represents the mutation of the residue at position 106 in chain H from alanine to cysteine.
+• The `Mutation` column contains the mutation in the format : `WT_residue` + `chain` + `position_index` + `mutant_residue`. For example, `A+H+106+C = AH106C` represents the mutation of the residue at position 106 in chain H from alanine to cysteine.
 
 • The `score-1`, `score-2`, ..., `score-N` columns contain the scores computed by the deep learning models for each mutation. Typically, the scores are the negative log-likelihoods ratios of the mutant residue and the wild-type residue, computed by the deep learning model: 
 
@@ -105,27 +104,35 @@ s_{ij}^{\text{PLM}} =  -\log \left( \frac{p(x_i = a_j | w)}{p(x_i = w_i | w)} \r
 
 where $w$ is the wild-type sequence, and $p(x_i = a_j | w)$ is the probability of the mutant residue $a_j$ at position $i$ given the wild-type sequence $w$ as estimated by a Protein Language Model (PLM) or an Inverse Folding model (or any other deep learning model). For example, in [Antibody Library Design by Seeding Linear Programming with Inverse Folding and Protein Language Models](https://www.biorxiv.org/content/10.1101/2024.11.03.621763v1), we used the scores computed by the [ProtBert](https://pubmed.ncbi.nlm.nih.gov/34232869/) and [AntiFold](https://arxiv.org/abs/2405.03370) models.
 
+### Scoring functions
+
+We provide a set of scoring functions that can be used to compute the scores for the input data. The scoring functions are defined in the `protlib_designer/scorer` module. To use this functionality, you need to install additional dependencies:
+
+```bash
+pip install -e .[plm]
+```
+
+After installing the dependencies, you can use the scoring functions to compute the scores for the input data. For example, we can compute the scores using `Rostlab/prot_bert` and `facebook/esm2_t6_8M_UR50D` models, and then, call `protlib-designer` to design a diverse protein library of size 10:
+
+```bash
+protlib-plm-scorer EVQLVESGGGLVQPGGSLRLSCAASGFNIKDTYIHWVRQAPGKGLEWVARIYPTNGYTRYADSVKGRFTISADTSKNTAYLQMNSLRAEDTAVYYCSRWGGDGFYAMDYWGQGTLVTVSS WH99 GH100 GH101 DH102 GH103 FH104 YH105 AH106 MH107 DH108 \
+--models Rostlab/prot_bert --models facebook/esm2_t6_8M_UR50D \
+--chain-type heavy \
+--score-type minus_llr \
+--mask \
+--output-file combined_scores.csv \
+&& protlib-designer combined_scores.csv 10 --weighted-multi-objective True
+```
+
 ## Contributing
 
 Please read [CONTRIBUTING.md](./CONTRIBUTING.md) for details on our code of conduct, and the process for submitting pull requests to us.
 
-## Citing This Work
+## Citation
 
 If you use this software in your research, please cite the following paper:
 
-```latex
-@article {Hayes2024.11.03.621763,
-	author = {Hayes, Conor F. and Magana-Zook, Steven A. and Gon{\c c}alves, Andre and Solak, Ahmet Can and Faissol, Daniel and Landajuela, Mikel},
-	title = {Antibody Library Design by Seeding Linear Programming with Inverse Folding and Protein Language Models},
-	elocation-id = {2024.11.03.621763},
-	year = {2024},
-	doi = {10.1101/2024.11.03.621763},
-	publisher = {Cold Spring Harbor Laboratory},
-	URL = {https://www.biorxiv.org/content/early/2024/11/03/2024.11.03.621763},
-	eprint = {https://www.biorxiv.org/content/early/2024/11/03/2024.11.03.621763.full.pdf},
-	journal = {bioRxiv}
-}
-```
+Hayes, C. F., Magana-Zook, S. A., Gonçalves, A., Solak, A. C., Faissol, D., & Landajuela, M. (2024). *Antibody Library Design by Seeding Linear Programming with Inverse Folding and Protein Language Models*. **bioRxiv**. [https://doi.org/10.1101/2024.11.03.621763](https://doi.org/10.1101/2024.11.03.621763)
 
 ## License
 

diff --git a/example_data/trastuzumab_spm.csv b/example_data/trastuzumab_spm.csv
@@ -1,4 +1,4 @@
-MutationHL,antifold_antigen_neg_llr,protbert_neg_llr
+Mutation,antifold_antigen_neg_llr,protbert_neg_llr
 AH106C,5.8979,5.614669799804688
 AH106D,5.1441,3.7915658950805664
 AH106E,6.1135,4.867663383483887

diff --git a/protlib_designer/dataloader.py b/protlib_designer/dataloader.py
@@ -18,9 +18,9 @@ def extract_positions_and_wildtype_amino_from_data(df: pd.DataFrame):
     df : pd.DataFrame
         The dataframe containing the data.
     """
-    mutation_full = df["MutationHL"].values.tolist()
-    positions = []  # Positions that have mutations
-    wildtype_position_amino = {}  # Position to wild type amino acid mapping
+    mutation_full = df["Mutation"].values.tolist()
+    positions = []  # Positions that have mutations.
+    wildtype_position_amino = {}  # Position to wild type amino acid mapping.
     for mutation in mutation_full:
 
         wildtype_amino, position, _ = parse_mutation(mutation)
@@ -36,14 +36,14 @@ def extract_positions_and_wildtype_amino_from_data(df: pd.DataFrame):
             )
             exit()
 
-        # Save the wild type amino acid at this position
+        # Save the wild type amino acid at this position.
         wildtype_position_amino[position] = wildtype_amino
 
-    # Get distinct positions
+    # Get distinct positions.
     positions = list(set(positions))
 
-    # Order the positions in ascending order
-    # Consider positions like H28 < H100A
+    # Order the positions in ascending order.
+    # Consider positions like H28 < H100A.
     positions_df = pd.DataFrame.from_dict(
         {
             i: {
@@ -61,7 +61,7 @@ def extract_positions_and_wildtype_amino_from_data(df: pd.DataFrame):
         ascending=[True, True, True],
     )
 
-    # Get the order by merging the strings
+    # Get the order by merging the strings.
     positions = [
         f"{row['chain']}{row['pos']}{row['pos_extra']}"
         for _, row in positions_df.iterrows()
@@ -92,7 +92,7 @@ def load_data(self):
         logger.info(f"Detected wild type amino acid: {self.wildtype_position_amino}")
 
     def update_config_with_data(self, config: Dict[str, Any]):
-        # Check that max_mut is less than the number of positions
+        # Check that max_mut is less than the number of positions.
         if (
             config["max_mut"] > len(self.positions)
             and config["interleave_mutant_order"]

diff --git a/protlib_designer/filter/filter.py b/protlib_designer/filter/filter.py
@@ -3,7 +3,7 @@
 
 class Filter(ABC):
     @abstractmethod
-    def filter(self):
+    def filter(self, solution):
         pass
 
     @abstractmethod

diff --git a/protlib_designer/generator/generator.py b/protlib_designer/generator/generator.py
@@ -1,5 +1,3 @@
-# write a generator abstract class
-
 from abc import ABC, abstractmethod
 
 

diff --git a/protlib_designer/generator/ilp_generator.py b/protlib_designer/generator/ilp_generator.py
@@ -1,5 +1,6 @@
 import time
 from pathlib import Path
+import warnings
 
 import numpy as np
 import pandas as pd
@@ -10,6 +11,9 @@
 from protlib_designer.generator.generator import Generator
 from protlib_designer.utils import amino_acids, aromatic_amino_acids, parse_mutation
 
+# Ignore UserWarnings from pulp
+warnings.filterwarnings("ignore", category=UserWarning, module="pulp")
+
 
 class ILPGenerator(Generator):
     def __init__(self, data_loader, config):
@@ -100,15 +104,15 @@ def _prepare_variables_and_zero_pad_matrix(self):
                     self.forbidden_vars.append(x_var)
                     self.forbidden_vars_dict[mutation_name] = x_var
                 # Check if row exists in the input dataframe.
-                if mutation_name in data_df["MutationHL"].values:
+                if mutation_name in data_df["Mutation"].values:
                     # Extract the row from the dataframe in a dictionary format.
-                    row = data_df[data_df["MutationHL"] == mutation_name].to_dict(
+                    row = data_df[data_df["Mutation"] == mutation_name].to_dict(
                         "records"
                     )[0]
                     data_df_padded.append(row)
                 else:  # The row does not exist in the input dataframe.
                     # Add 0-vector row for the new mutation.
-                    new_row = {"MutationHL": mutation_name}
+                    new_row = {"Mutation": mutation_name}
                     # Save the position and aa to add X_pos_a = 0 constraint later in the script.
                     zero_enforced_mutations.append((wt, position, aa))
                     self.missing_vars.append(x_var)
@@ -141,13 +145,13 @@ def _check_data_and_variables_consistency(self):
             )
             exit()
 
-        # Check that data_df["MutationHL"].values is equivalent (ordered in the same way) as x_vars.
+        # Check that data_df["Mutation"].values is equivalent (ordered in the same way) as x_vars.
         for index, x_var in enumerate(self.x_vars):
             mutation_name = x_var.getName().split("_")[1]
-            if mutation_name != self.data_df["MutationHL"].values[index]:
+            if mutation_name != self.data_df["Mutation"].values[index]:
                 logger.error(
                     f"Error adding missing position-amino acid pairs. Expected {mutation_name}. \
-                    Got {self.data_df['MutationHL'].values[index]}"
+                    Got {self.data_df['Mutation'].values[index]}"
                 )
                 exit()
 
@@ -339,7 +343,9 @@ def generate_one_solution(self, iteration: int):
         status = self.problem.solve(self.solver)
 
         if status != 1:
-            logger.error(f"Error Status: {pulp.LpStatus[status]}")
+            logger.error(
+                f"Error in ILPGenerator when solving the problem. Status: {pulp.LpStatus[status]}"
+            )
             return None
 
         cpu_time = time.time() - cpu_time_start
Original file line number	Diff line number	Diff line change
		@@ -1,5 +1,3 @@
		# write a generator abstract class

		from abc import ABC, abstractmethod


Expand Down