SUPERMAGO: Protein Function Prediction based on Transformers Embeddings

The paper is available here.

Description

SUPERMAGO is a machine learning-based approach designed for protein function prediction using embeddings generated by Transformer-based models, multilayer perceptrons trained on these embeddings, and a stacking classifier. SUPERMAGO+ is an ensemble method that combines predictions from SUPERMAGO and DIAMOND, a local alignment tool. Both approaches predict protein function for the Biological Process Ontology (BPO), Cellular Component Ontology (CCO), and Molecular Function Ontology (MFO).

Instalation

To install and set up SUPERMAGO and SUPERMAGO+, follow the steps below:

Clone the repository:

git clone https://github.com/your-username/supermago.git
cd supermago

Install the dependencies:

pip install -r requirements.txt

Dataset

The dataset for this work is available here. The IC values used in evaluation is available here.

Models

Our layer-base models for each ontology are available here.

Predictions

The predictions of SUPERMAGO and SUPERMAGO+ are available here.

Reproducibility

Navigate to src folder and run setup.py.
Download the dataset and IC values, and place them into base folder.
In the src folder, run the following command:

python main.py --ont ontology

where ontology can be bp, cc, or mf for Biological Process, Cellular Component and Molecular Function, respectively.

main.py executes the pipeline of SUPERMAGO and SUPERMAGO+ as follows:

extract.py extracts the embeddings given the model name (esm or t5) and ontology (bp, cc or mf).
layer_classificaton.py runs the neural network for a specific layer (36, 35, 34, 33, 32 for ESM2 T36; 24, 23, 22, 21, 20 for ProtT5) and ontology (bp, cc or mf).
stacking.py runs the stacking model for a specific ontology (bp, cc or mf) and generates the prediction of SUPERMAGO.
diamond.py runs DIAMOND for a specific ontology (bp, cc or mf).
ensemble.py generates the final prediction of SUPERMAGO+ for a specific ontology (bp, cc or mf).
evaluate.py evaluates the predictions.

Dataset Adaptation

If you need to run SUPERMAGO and SUPERMAGO+ on your own dataset, you must create a dataset with the same structure as ours. This includes a CSV file for each ontology, with the first column containing the protein ID, the second column containing the protein sequence, and the remaining columns containing terms in one-hot encoding format. You should also calculate the IC values for evaluation and save it in a csv file.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
src		src
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SUPERMAGO: Protein Function Prediction based on Transformers Embeddings

Description

Instalation

Dataset

Models

Predictions

Reproducibility

Dataset Adaptation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SUPERMAGO: Protein Function Prediction based on Transformers Embeddings

Description

Instalation

Dataset

Models

Predictions

Reproducibility

Dataset Adaptation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages