GitHub - Bioinformatics-UM6P/MegaPlantTF: A Machine Learning Framework for Comprehensive Classification of Plant Transcription Factors

Webserver available online at: https://bioinformatics.um6p.ma/MegaPlantTF
MegaPlantTF: a comprehensive machine learning framework for the identification and classification of plant transcription factors.

MegaPlantTF

MegaPlantTF is the first machine learning–based framework designed to identify and classify plant transcription factors (TFs) across multiple species. The project leverages curated data from PlantTFDB and advanced k-mer–based feature representations to train robust, family-specific binary classifiers. With MegaPlantTF, you can:

Predict Transcription Factors: Identify and classify TF families from plant proteomes using pretrained binary and stacking models.
Comprehensive Evaluation: Generate detailed classification reports with accuracy, precision, recall, F1-score, and confidence thresholds.
Flexible Inference Options: Apply max-voting or two-stage stacking classifiers for improved family-level predictions.

Step 1: Create & Activate Conda Environment

Open your terminal in the current folder then create the MegaPlantTF environment from the provided YAML file.

cd MegaPlantTF
conda env create -f MegaPlantTF.yml

Activate the MegaPlantTF environment.

conda activate MegaPlantTF

Step 2: Register Environment in Jupyter

python -m ipykernel install --user --name MegaPlantTF --display-name "MegaPlantTF"

1. Running online Webserver

📦 Quick Start

The easiest way to use MegaPlantTF is through the online web server available at: https://bioinformatics.um6p.ma/MegaPlantTF. You can also watch a short demo showing how it works below:

2. Running locally

Before proceeding, make sure you’ve completed Step 1 and correctly set up the MegaPlantTF conda environment.
In this step, you’ll download the pretrained model weights, copy them to the right folders, and start the prediction workflow.

Install Git Large File Storage to be able to download the model weights

sudo  apt-get install git-lfs
git lfs install

Download Pretrained Model Weights & testset for lab

# cd into MegaPlantTF folder if it not the case yet
cd MegaPlantTF

# Download all the binary models in temp folder
tmpdir="$(mktemp -d)"
git lfs clone https://huggingface.co/Genereux-akotenou/genomics-tf-prediction "$tmpdir/repo"

# copy models from temp folder to `models`folder in MegaPlantTF.
rsync -av --delete "$tmpdir/repo/Binary-Classifier/" "./models/Binary-Classifier/"
rsync -av --delete "$tmpdir/repo/MetaClassifier/"   "./models/MetaClassifier/"
rm -rf "$tmpdir"

Start Jupyter-lab

Once setup is complete, start JupyterLab to explore the example notebooks.

jupyter-lab

Identify and classify TFs

You can start directly with the notebook test/1-Start-With-MegaPlantTF.ipynb Or, you can create your own Python script or notebook. First, make sure the project’s root directory is added to sys.path

import sys, os
current_directory = os.getcwd()
root_directory = os.path.abspath(os.path.join(current_directory, os.pardir))
sys.path.append(root_directory)

Then, import the predictor classes and run inference:

from pretrained.predictor import SingleKModel, MultiKModel

# Example for SingleKModel
kmodel = SingleKModel(kmer_size=3)
kmodel.load("Ach_pep_kiwi.fas", format="fasta")
genboard = kmodel.predict()
genboard.display()

You can find the results in the notebook/Output directory. Here's what you will find:

Reports:
- Located in models/Train-Reports (Acutual report are available).
- Each report is specific to a gene family.
- Reports include:
  - Model architecture and parameters.
  - Learning curve.
  - Train set class distribution.
  - Classification metrics: F1 score, recall, accuracy, precision.
  - Confusion matrix for each k-mer size.
Model Files:
- Located in notebook/Output/Model (after training).
- Inside this directory, you will find folders named after gene families.
- Each gene family folder contains:
  - Model .h5 files for various k-mer sizes.
  - feature_mask.json files.

Build pretrained model

We have to move into notebook folder and execute the python file named pyrunner

cd notebook

The python file should look like this. Depending on if we wanna run the program using multiprocess we have to set either multiprocess=True or multiprocess=False.

import os
import json
import multiprocessing
import papermill as pm

# Utils
def run_notebook(gene):
    input_notebook = "01-approach2_kmer_neural_network.ipynb"
    notebook_name = os.path.splitext(input_notebook)[0]
    gene_ = gene.replace('/', '__')
    output_notebook = f"AutoSave/{notebook_name}-{gene_}.ipynb"

    # Run the notebook with the specified gene
    pm.execute_notebook(
        input_notebook,
        output_notebook,
        parameters=dict(gene_familly=gene),
        timeout=-1,
        kernel_name='pygenomics'
    )

if __name__ == "__main__":
    # List of genes 
    gene_info_path = "../data/gene_info.json"
    with open(gene_info_path, 'r') as json_file:
        gene_info = json.load(json_file)

    # Output directory
    os.makedirs("AutoSave", exist_ok=True)

    # EXEC NATURE
    multiprocess = False

    if multiprocess:
        # Run notebooks concurrently using multiprocessing
        num_processes = multiprocessing.cpu_count()
        print('NUMBER OF PROCESSES: ', num_processes)
        with multiprocessing.Pool(num_processes) as pool:
            pool.map(run_notebook, gene_info.keys())
    else:
        # Run notebooks sequentially
        for gene in gene_info.keys():
            run_notebook(gene)

The next step is to run this file then till the program finish

python pyrunner

Citation

If you have used MegaPlantTF in your research, please kindly cite the following publication:

@article{10.1093/bioinformatics/btaf678,
    author = {Akotenou, Genereux and Hassan, Asmaa H and Mokhtar, Morad M and El Allali, Achraf},
    title = {MegaPlantTF: a machine learning framework for comprehensive identification and classification of plant transcription factors},
    journal = {Bioinformatics},
    volume = {42},
    number = {1},
    pages = {btaf678},
    year = {2025},
    month = {12},
    issn = {1367-4811},
    doi = {10.1093/bioinformatics/btaf678},
    url = {https://doi.org/10.1093/bioinformatics/btaf678},
    eprint = {https://academic.oup.com/bioinformatics/article-pdf/42/1/btaf678/66129808/btaf678.pdf},
}

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
analysis		analysis
data		data
models		models
notebook		notebook
pretrained		pretrained
processing		processing
run-as-script		run-as-script
test		test
workshop		workshop
.gitignore		.gitignore
LICENSE		LICENSE
MegaPlantTF.yml		MegaPlantTF.yml
README.md		README.md
analysis.pdf		analysis.pdf
demo.png		demo.png
genboard.png		genboard.png
logo.png		logo.png
output.png		output.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MegaPlantTF

Step 1: Create & Activate Conda Environment

Step 2: Register Environment in Jupyter

1. Running online Webserver

📦 Quick Start

2. Running locally

Install Git Large File Storage to be able to download the model weights

Download Pretrained Model Weights & testset for lab

Start Jupyter-lab

Identify and classify TFs

Build pretrained model

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

Bioinformatics-UM6P/MegaPlantTF

Folders and files

Latest commit

History

Repository files navigation

MegaPlantTF

Step 1: Create & Activate Conda Environment

Step 2: Register Environment in Jupyter

1. Running online Webserver

📦 Quick Start

2. Running locally

Install Git Large File Storage to be able to download the model weights

Download Pretrained Model Weights & testset for lab

Start Jupyter-lab

Identify and classify TFs

Build pretrained model

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages