ComIN is a universal framework that learns biomolecular interface representations via contrastive learning on interface atom graphs, jointly trained on protein-protein, protein-peptide, and protein-small molecule interactions. ComIN uses a geometry-aware VisNet encoder to extract invariant representations from atomic graphs of receptor-ligand interfaces, optimized via InfoNCE loss to discriminate binding patterns.
You can install the environment using env.yaml/requirements.txt or manually follow these steps:
# Create and activate environment
conda create -n graph python=3.8
conda activate graph
# Basic packages
pip install pandas==2.0.3 numpy==1.24.4 scikit-learn tqdm
# PyTorch (CUDA 11.8)
pip install torch==2.1.1+cu118 --extra-index-url https://download.pytorch.org/whl/cu118
# PyG extensions
pip install torch_scatter torch_cluster -f https://pytorch-geometric.com/whl/torch-2.1.1+cu118.html
pip install torch-geometric
# Plotting and notebooks
pip install matplotlib seaborn jupyter notebook
# Structural tools
pip install freesasa
conda install mmseqs2=17.* -c conda-forge -c bioconda
Main Datasets
-
Main datasets are curated from the sequence non-redundant set of Q-BioLiP. See
data_preparationfolder. -
data_preparation/pkls: Data for ComIN-Base (4.5 Å proximity threshold). -
data_preparation/pkls_large: Data for ComIN-Large (6.0 Å proximity threshold).
Downstream Datasets
- Processed data for downstream tasks, sourced from LIT-PCBA, HLA3DB, CPset, and SAbDab. See
downstream_datafolder.
Due to size limits, they are hosted on Google Drive.
All source code and configurations are located in the src directory.
src/configs/: YAML files for model hyperparameters.src/scripts/: Training viabash train.sh.
Trained weights are stored in the ckpts directory.
| Model | Configuration | Checkpoint Path |
|---|---|---|
| ComIN-Base | src/configs/train.yaml |
ckpts/default |
| ComIN-Large | src/configs/train_large.yaml |
ckpts/large |
To evaluate the model on the Test sets:
cd src/scripts
bash test.shWe provide notebooks in notebooks for specialized evaluations:
-
Protein-small molecule: Pocket classification & Virtual screening (LIT-PCBA).
-
Peptide-HLA: Binding prediction (HLA3DB).
-
Protein-cyclic peptide: Target region prediction (CPset).
-
Antibody-antigen: Antibody-specific epitope identification (SAbDab).
To-do list:
- [] upload
downstream_dataand notebooks
