Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -3,4 +3,5 @@
.bash_history
lp_solution*
.vscode
.coverage
.coverage
combined_scores.csv
81 changes: 44 additions & 37 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,38 +8,34 @@
</h2>
</div>


![Status](https://img.shields.io/badge/Status-Active-green.svg)
![Python](https://img.shields.io/badge/Python-3.9-blue.svg)
[![Paper](https://img.shields.io/badge/Paper-Download-green.svg)](https://www.biorxiv.org/content/10.1101/2024.11.03.621763v1)
![CI](https://github.com/LLNL/protlib-designer/actions/workflows/ci.yml/badge.svg)

## Introduction

Welcome to the `protlib-designer` repository! This repository contains a Python package that designs diverse protein libraries by seeding linear programming with deep mutational scanning data (or any other data that can be represented as a matrix of scores per single-point mutation). The software takes as input the score matrix, where each row corresponds to a mutation and each column corresponds to a different source of scores, and outputs a subset of mutations that maximize the diversity of the library while Pareto-optimizing the scores from the different sources.
Welcome to the `protlib-designer` repository! This repository contains a lightweight python library for designing diverse protein libraries by seeding linear programming with deep mutational scanning data (or any other data that can be represented as a matrix of scores per single-point mutation). The software takes as input the score matrix, where each row corresponds to a mutation and each column corresponds to a different source of scores, and outputs a subset of mutations that maximize the diversity of the library while Pareto-optimizing the scores from the different sources.

The paper [Antibody Library Design by Seeding Linear Programming with Inverse Folding and Protein Language Models](https://www.biorxiv.org/content/10.1101/2024.11.03.621763v1) uses this software to design diverse antibody libraries by seeding linear programming with scores computed by Protein Language Models (PLMs) and Inverse Folding models.

<figure>
<img src="images/method_diagram.png" width="800">
<figcaption>
<p class="figure-caption text-center">
<em>
protlib-designer designs diverse protein libraries by seeding linear programming with deep mutational scanning data.
(a) The input to the method is an antibody-antigen complex and a target antibody sequence. (b) We generate in silico deep mutational scanning data using protein language and inverse folding models. (c) The result is fed into a multi-objective linear programming solver. (d) The solver generates a library of antibodies that are co-optimized for the in silico scores while satisfying diversity constraints.
</em>
</p>
</figcaption>
<img src="images/method_diagram.png" width="800">
<figcaption>
<p class="figure-caption text-center">
<em> protlib-designer designs diverse protein libraries by seeding linear programming with deep mutational scanning data. (a) The input to the method is target protein sequence and, if available, a structure of the protein or protein complex (in this case, the antibody trastuzumab in complex with the HER2 receptor). (b) We generate in silico deep mutational scanning data using protein language and inverse folding models. (c) The result is fed into a multi-objective linear programming solver. (d) The solver generates a library of antibodies that are co-optimized for the in silico scores while satisfying diversity constraints.
</em>
</p>
</figcaption>
</figure>


## Getting Started

In this section, we provide instructions on how to install the software and run the code.

### Installation

Create an environment with Python >=3.9 and install the dependencies:
Create an environment with Python >=3.7,<3.11 and install the dependencies:
```bash
python -m venv .venv
source .venv/bin/activate
Expand All @@ -52,12 +48,10 @@ pip install -e .[dev]
```
which will allow you to run the tests and the linter. You can run the linting with:
```bash
black -S -t py39 protlib_designer scripts
black -S -t py39 protlib_designer scripts && \
flake8 --ignore=E501,E203,W503 protlib_designer scripts
```



### Run the code

To run the code to create a diverse protein library of size 10 from the example data, run the following command:
Expand All @@ -66,26 +60,31 @@ To run the code to create a diverse protein library of size 10 from the example
protlib-designer ./example_data/trastuzumab_spm.csv 10
```

We provide a rich set of command-line arguments to customize the behavior of `protlib-designer`. For example, the following command runs `protlib-designer` with a range of 3 to 5 mutations per sequence, enforcing the interleaving of the mutant order and balancing the mutant order, and using a weighted multi-objective optimization:
We provide a rich set of command-line arguments to customize the behavior of `protlib-designer`. For example, the following command runs `protlib-designer` with a range of 3 to 5 mutations per sequence, enforcing the interleaving of the mutant order and balancing the mutant order, allowing for each mutation to appear at most `1` time and a position to be mutated at most `4` times,
and using a weighted multi-objective optimization:

```bash
protlib-designer ./example_data/trastuzumab_spm.csv 10 \
--min-mut 3 --max-mut 5 --interleave-mutant-order True --force-mutant-order-balance True \
--weighted-multi-objective True
--min-mut 3 \
--max-mut 5 \
--interleave-mutant-order True \
--force-mutant-order-balance True \
--schedule 2 \
--schedule-param '1,4' \
--weighted-multi-objective True
```


For more information on the command-line arguments, run:

```bash
protlib-designer --help
```

### Input data
### Input data : In silico deep mutational scanning data

The input to the software is a matrix of per-mutation scores (the csv file `trastuzumab_spm.csv` in the example above). Typically, the score matrix is defined by *in silico* deep mutational scanning data, where each row corresponds to a mutation and each column corresponds to the score computed by a deep learning model. See the example data in the `example_data` directory for an example of the input data format. The structure of the input data is shown below:

| MutationHL | score-1 | score-2 | ... | score-N |
| Mutation | score-1 | score-2 | ... | score-N |
|------------|--------|--------|-----|--------|
| AH106C | -0.1 | 0.2 | ... | 0.3 |
| AH106D | 0.2 | -0.3 | ... | -0.4 |
Expand All @@ -95,7 +94,7 @@ The input to the software is a matrix of per-mutation scores (the csv file `tras

Important notes about the input data:

• The `MutationHL` column contains the mutation in the format : `WT_residue` + `chain` + `position_index` + `mutant_residue`. For example, `A+H+106+C = AH106C` represents the mutation of the residue at position 106 in chain H from alanine to cysteine.
• The `Mutation` column contains the mutation in the format : `WT_residue` + `chain` + `position_index` + `mutant_residue`. For example, `A+H+106+C = AH106C` represents the mutation of the residue at position 106 in chain H from alanine to cysteine.

• The `score-1`, `score-2`, ..., `score-N` columns contain the scores computed by the deep learning models for each mutation. Typically, the scores are the negative log-likelihoods ratios of the mutant residue and the wild-type residue, computed by the deep learning model:

Expand All @@ -105,27 +104,35 @@ s_{ij}^{\text{PLM}} = -\log \left( \frac{p(x_i = a_j | w)}{p(x_i = w_i | w)} \r

where $w$ is the wild-type sequence, and $p(x_i = a_j | w)$ is the probability of the mutant residue $a_j$ at position $i$ given the wild-type sequence $w$ as estimated by a Protein Language Model (PLM) or an Inverse Folding model (or any other deep learning model). For example, in [Antibody Library Design by Seeding Linear Programming with Inverse Folding and Protein Language Models](https://www.biorxiv.org/content/10.1101/2024.11.03.621763v1), we used the scores computed by the [ProtBert](https://pubmed.ncbi.nlm.nih.gov/34232869/) and [AntiFold](https://arxiv.org/abs/2405.03370) models.

### Scoring functions

We provide a set of scoring functions that can be used to compute the scores for the input data. The scoring functions are defined in the `protlib_designer/scorer` module. To use this functionality, you need to install additional dependencies:

```bash
pip install -e .[plm]
```

After installing the dependencies, you can use the scoring functions to compute the scores for the input data. For example, we can compute the scores using `Rostlab/prot_bert` and `facebook/esm2_t6_8M_UR50D` models, and then, call `protlib-designer` to design a diverse protein library of size 10:

```bash
protlib-plm-scorer EVQLVESGGGLVQPGGSLRLSCAASGFNIKDTYIHWVRQAPGKGLEWVARIYPTNGYTRYADSVKGRFTISADTSKNTAYLQMNSLRAEDTAVYYCSRWGGDGFYAMDYWGQGTLVTVSS WH99 GH100 GH101 DH102 GH103 FH104 YH105 AH106 MH107 DH108 \
--models Rostlab/prot_bert --models facebook/esm2_t6_8M_UR50D \
--chain-type heavy \
--score-type minus_llr \
--mask \
--output-file combined_scores.csv \
&& protlib-designer combined_scores.csv 10 --weighted-multi-objective True
```

## Contributing

Please read [CONTRIBUTING.md](./CONTRIBUTING.md) for details on our code of conduct, and the process for submitting pull requests to us.

## Citing This Work
## Citation

If you use this software in your research, please cite the following paper:

```latex
@article {Hayes2024.11.03.621763,
author = {Hayes, Conor F. and Magana-Zook, Steven A. and Gon{\c c}alves, Andre and Solak, Ahmet Can and Faissol, Daniel and Landajuela, Mikel},
title = {Antibody Library Design by Seeding Linear Programming with Inverse Folding and Protein Language Models},
elocation-id = {2024.11.03.621763},
year = {2024},
doi = {10.1101/2024.11.03.621763},
publisher = {Cold Spring Harbor Laboratory},
URL = {https://www.biorxiv.org/content/early/2024/11/03/2024.11.03.621763},
eprint = {https://www.biorxiv.org/content/early/2024/11/03/2024.11.03.621763.full.pdf},
journal = {bioRxiv}
}
```
Hayes, C. F., Magana-Zook, S. A., Gonçalves, A., Solak, A. C., Faissol, D., & Landajuela, M. (2024). *Antibody Library Design by Seeding Linear Programming with Inverse Folding and Protein Language Models*. **bioRxiv**. [https://doi.org/10.1101/2024.11.03.621763](https://doi.org/10.1101/2024.11.03.621763)

## License

Expand Down
2 changes: 1 addition & 1 deletion example_data/trastuzumab_spm.csv
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
MutationHL,antifold_antigen_neg_llr,protbert_neg_llr
Mutation,antifold_antigen_neg_llr,protbert_neg_llr
AH106C,5.8979,5.614669799804688
AH106D,5.1441,3.7915658950805664
AH106E,6.1135,4.867663383483887
Expand Down
18 changes: 9 additions & 9 deletions protlib_designer/dataloader.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,9 +18,9 @@ def extract_positions_and_wildtype_amino_from_data(df: pd.DataFrame):
df : pd.DataFrame
The dataframe containing the data.
"""
mutation_full = df["MutationHL"].values.tolist()
positions = [] # Positions that have mutations
wildtype_position_amino = {} # Position to wild type amino acid mapping
mutation_full = df["Mutation"].values.tolist()
positions = [] # Positions that have mutations.
wildtype_position_amino = {} # Position to wild type amino acid mapping.
for mutation in mutation_full:

wildtype_amino, position, _ = parse_mutation(mutation)
Expand All @@ -36,14 +36,14 @@ def extract_positions_and_wildtype_amino_from_data(df: pd.DataFrame):
)
exit()

# Save the wild type amino acid at this position
# Save the wild type amino acid at this position.
wildtype_position_amino[position] = wildtype_amino

# Get distinct positions
# Get distinct positions.
positions = list(set(positions))

# Order the positions in ascending order
# Consider positions like H28 < H100A
# Order the positions in ascending order.
# Consider positions like H28 < H100A.
positions_df = pd.DataFrame.from_dict(
{
i: {
Expand All @@ -61,7 +61,7 @@ def extract_positions_and_wildtype_amino_from_data(df: pd.DataFrame):
ascending=[True, True, True],
)

# Get the order by merging the strings
# Get the order by merging the strings.
positions = [
f"{row['chain']}{row['pos']}{row['pos_extra']}"
for _, row in positions_df.iterrows()
Expand Down Expand Up @@ -92,7 +92,7 @@ def load_data(self):
logger.info(f"Detected wild type amino acid: {self.wildtype_position_amino}")

def update_config_with_data(self, config: Dict[str, Any]):
# Check that max_mut is less than the number of positions
# Check that max_mut is less than the number of positions.
if (
config["max_mut"] > len(self.positions)
and config["interleave_mutant_order"]
Expand Down
2 changes: 1 addition & 1 deletion protlib_designer/filter/filter.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@

class Filter(ABC):
@abstractmethod
def filter(self):
def filter(self, solution):
pass

@abstractmethod
Expand Down
2 changes: 0 additions & 2 deletions protlib_designer/generator/generator.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,3 @@
# write a generator abstract class

from abc import ABC, abstractmethod


Expand Down
20 changes: 13 additions & 7 deletions protlib_designer/generator/ilp_generator.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
import time
from pathlib import Path
import warnings

import numpy as np
import pandas as pd
Expand All @@ -10,6 +11,9 @@
from protlib_designer.generator.generator import Generator
from protlib_designer.utils import amino_acids, aromatic_amino_acids, parse_mutation

# Ignore UserWarnings from pulp
warnings.filterwarnings("ignore", category=UserWarning, module="pulp")


class ILPGenerator(Generator):
def __init__(self, data_loader, config):
Expand Down Expand Up @@ -100,15 +104,15 @@ def _prepare_variables_and_zero_pad_matrix(self):
self.forbidden_vars.append(x_var)
self.forbidden_vars_dict[mutation_name] = x_var
# Check if row exists in the input dataframe.
if mutation_name in data_df["MutationHL"].values:
if mutation_name in data_df["Mutation"].values:
# Extract the row from the dataframe in a dictionary format.
row = data_df[data_df["MutationHL"] == mutation_name].to_dict(
row = data_df[data_df["Mutation"] == mutation_name].to_dict(
"records"
)[0]
data_df_padded.append(row)
else: # The row does not exist in the input dataframe.
# Add 0-vector row for the new mutation.
new_row = {"MutationHL": mutation_name}
new_row = {"Mutation": mutation_name}
# Save the position and aa to add X_pos_a = 0 constraint later in the script.
zero_enforced_mutations.append((wt, position, aa))
self.missing_vars.append(x_var)
Expand Down Expand Up @@ -141,13 +145,13 @@ def _check_data_and_variables_consistency(self):
)
exit()

# Check that data_df["MutationHL"].values is equivalent (ordered in the same way) as x_vars.
# Check that data_df["Mutation"].values is equivalent (ordered in the same way) as x_vars.
for index, x_var in enumerate(self.x_vars):
mutation_name = x_var.getName().split("_")[1]
if mutation_name != self.data_df["MutationHL"].values[index]:
if mutation_name != self.data_df["Mutation"].values[index]:
logger.error(
f"Error adding missing position-amino acid pairs. Expected {mutation_name}. \
Got {self.data_df['MutationHL'].values[index]}"
Got {self.data_df['Mutation'].values[index]}"
)
exit()

Expand Down Expand Up @@ -339,7 +343,9 @@ def generate_one_solution(self, iteration: int):
status = self.problem.solve(self.solver)

if status != 1:
logger.error(f"Error Status: {pulp.LpStatus[status]}")
logger.error(
f"Error in ILPGenerator when solving the problem. Status: {pulp.LpStatus[status]}"
)
return None

cpu_time = time.time() - cpu_time_start
Expand Down
Loading
Loading