Skip to content

A Machine Learning Framework for Comprehensive Classification of Plant Transcription Factors

License

Notifications You must be signed in to change notification settings

Bioinformatics-UM6P/MegaPlantTF

Repository files navigation



Webserver available online at: https://bioinformatics.um6p.ma/MegaPlantTF
MegaPlantTF: a comprehensive machine learning framework for the identification and classification of plant transcription factors.

DOI:10.1093/bioinformatics/btaf678 Hugging Face Conda


MegaPlantTF

MegaPlantTF is the first machine learning–based framework designed to identify and classify plant transcription factors (TFs) across multiple species. The project leverages curated data from PlantTFDB and advanced k-mer–based feature representations to train robust, family-specific binary classifiers. With MegaPlantTF, you can:

  • Predict Transcription Factors: Identify and classify TF families from plant proteomes using pretrained binary and stacking models.
  • Comprehensive Evaluation: Generate detailed classification reports with accuracy, precision, recall, F1-score, and confidence thresholds.
  • Flexible Inference Options: Apply max-voting or two-stage stacking classifiers for improved family-level predictions.

Step 1 - Install MegaPlantTF Conda Environment

Step 1: Create & Activate Conda Environment

Open your terminal in the current folder then create the MegaPlantTF environment from the provided YAML file.

cd MegaPlantTF
conda env create -f MegaPlantTF.yml

Activate the MegaPlantTF environment.

conda activate MegaPlantTF

Step 2: Register Environment in Jupyter

python -m ipykernel install --user --name MegaPlantTF --display-name "MegaPlantTF"

Step 2 - Use MegaPlantTF for TF prediction in plant

1. Running online Webserver

📦 Quick Start

The easiest way to use MegaPlantTF is through the online web server available at: https://bioinformatics.um6p.ma/MegaPlantTF. You can also watch a short demo showing how it works below:

Watch the video

2. Running locally

Before proceeding, make sure you’ve completed Step 1 and correctly set up the MegaPlantTF conda environment.
In this step, you’ll download the pretrained model weights, copy them to the right folders, and start the prediction workflow.

Install Git Large File Storage to be able to download the model weights

sudo  apt-get install git-lfs
git lfs install

Download Pretrained Model Weights & testset for lab

# cd into MegaPlantTF folder if it not the case yet
cd MegaPlantTF

# Download all the binary models in temp folder
tmpdir="$(mktemp -d)"
git lfs clone https://huggingface.co/Genereux-akotenou/genomics-tf-prediction "$tmpdir/repo"

# copy models from temp folder to `models`folder in MegaPlantTF.
rsync -av --delete "$tmpdir/repo/Binary-Classifier/" "./models/Binary-Classifier/"
rsync -av --delete "$tmpdir/repo/MetaClassifier/"   "./models/MetaClassifier/"
rm -rf "$tmpdir"

Start Jupyter-lab

Once setup is complete, start JupyterLab to explore the example notebooks.

jupyter-lab

Identify and classify TFs

You can start directly with the notebook test/1-Start-With-MegaPlantTF.ipynb Or, you can create your own Python script or notebook. First, make sure the project’s root directory is added to sys.path

import sys, os
current_directory = os.getcwd()
root_directory = os.path.abspath(os.path.join(current_directory, os.pardir))
sys.path.append(root_directory)

Then, import the predictor classes and run inference:

from pretrained.predictor import SingleKModel, MultiKModel

# Example for SingleKModel
kmodel = SingleKModel(kmer_size=3)
kmodel.load("Ach_pep_kiwi.fas", format="fasta")
genboard = kmodel.predict()
genboard.display()

genbaord beta image



Step 3 - Inspect and Reproduce our Results / Train on your own data

You can find the results in the notebook/Output directory. Here's what you will find:

  1. Reports:
    • Located in models/Train-Reports (Acutual report are available).
    • Each report is specific to a gene family.
    • Reports include:
      • Model architecture and parameters.
      • Learning curve.
      • Train set class distribution.
      • Classification metrics: F1 score, recall, accuracy, precision.
      • Confusion matrix for each k-mer size.
  2. Model Files:
    • Located in notebook/Output/Model (after training).
    • Inside this directory, you will find folders named after gene families.
    • Each gene family folder contains:
      • Model .h5 files for various k-mer sizes.
      • feature_mask.json files.

Build pretrained model

We have to move into notebook folder and execute the python file named pyrunner

cd notebook

The python file should look like this. Depending on if we wanna run the program using multiprocess we have to set either multiprocess=True or multiprocess=False.

import os
import json
import multiprocessing
import papermill as pm

# Utils
def run_notebook(gene):
    input_notebook = "01-approach2_kmer_neural_network.ipynb"
    notebook_name = os.path.splitext(input_notebook)[0]
    gene_ = gene.replace('/', '__')
    output_notebook = f"AutoSave/{notebook_name}-{gene_}.ipynb"

    # Run the notebook with the specified gene
    pm.execute_notebook(
        input_notebook,
        output_notebook,
        parameters=dict(gene_familly=gene),
        timeout=-1,
        kernel_name='pygenomics'
    )

if __name__ == "__main__":
    # List of genes 
    gene_info_path = "../data/gene_info.json"
    with open(gene_info_path, 'r') as json_file:
        gene_info = json.load(json_file)

    # Output directory
    os.makedirs("AutoSave", exist_ok=True)

    # EXEC NATURE
    multiprocess = False

    if multiprocess:
        # Run notebooks concurrently using multiprocessing
        num_processes = multiprocessing.cpu_count()
        print('NUMBER OF PROCESSES: ', num_processes)
        with multiprocessing.Pool(num_processes) as pool:
            pool.map(run_notebook, gene_info.keys())
    else:
        # Run notebooks sequentially
        for gene in gene_info.keys():
            run_notebook(gene)

The next step is to run this file then till the program finish

python pyrunner

Citation

If you have used MegaPlantTF in your research, please kindly cite the following publication:

@article{10.1093/bioinformatics/btaf678,
    author = {Akotenou, Genereux and Hassan, Asmaa H and Mokhtar, Morad M and El Allali, Achraf},
    title = {MegaPlantTF: a machine learning framework for comprehensive identification and classification of plant transcription factors},
    journal = {Bioinformatics},
    volume = {42},
    number = {1},
    pages = {btaf678},
    year = {2025},
    month = {12},
    issn = {1367-4811},
    doi = {10.1093/bioinformatics/btaf678},
    url = {https://doi.org/10.1093/bioinformatics/btaf678},
    eprint = {https://academic.oup.com/bioinformatics/article-pdf/42/1/btaf678/66129808/btaf678.pdf},
}

About

A Machine Learning Framework for Comprehensive Classification of Plant Transcription Factors

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •