The paper is available here.
SUPERMAGO is a machine learning-based approach designed for protein function prediction using embeddings generated by Transformer-based models, multilayer perceptrons trained on these embeddings, and a stacking classifier. SUPERMAGO+ is an ensemble method that combines predictions from SUPERMAGO and DIAMOND, a local alignment tool. Both approaches predict protein function for the Biological Process Ontology (BPO), Cellular Component Ontology (CCO), and Molecular Function Ontology (MFO).
To install and set up SUPERMAGO and SUPERMAGO+, follow the steps below:
- Clone the repository:
git clone https://github.com/your-username/supermago.git
cd supermago- Install the dependencies:
pip install -r requirements.txtThe dataset for this work is available here. The IC values used in evaluation is available here.
Our layer-base models for each ontology are available here.
The predictions of SUPERMAGO and SUPERMAGO+ are available here.
- Navigate to
srcfolder and runsetup.py. - Download the dataset and IC values, and place them into
basefolder. - In the
srcfolder, run the following command:
python main.py --ont ontologywhere ontology can be bp, cc, or mf for Biological Process, Cellular Component and Molecular Function, respectively.
main.py executes the pipeline of SUPERMAGO and SUPERMAGO+ as follows:
extract.pyextracts the embeddings given the model name (esmort5) and ontology (bp,ccormf).layer_classificaton.pyruns the neural network for a specific layer (36,35,34,33,32for ESM2 T36;24,23,22,21,20for ProtT5) and ontology (bp,ccormf).stacking.pyruns the stacking model for a specific ontology (bp,ccormf) and generates the prediction of SUPERMAGO.diamond.pyruns DIAMOND for a specific ontology (bp,ccormf).ensemble.pygenerates the final prediction of SUPERMAGO+ for a specific ontology (bp,ccormf).evaluate.pyevaluates the predictions.
If you need to run SUPERMAGO and SUPERMAGO+ on your own dataset, you must create a dataset with the same structure as ours. This includes a CSV file for each ontology, with the first column containing the protein ID, the second column containing the protein sequence, and the remaining columns containing terms in one-hot encoding format. You should also calculate the IC values for evaluation and save it in a csv file.