Skip to content

BDP-Identifier: Genomic Language Model-Based Prediction of Bidirectional Promoter Activity in Plant Species

Notifications You must be signed in to change notification settings

AIBreeding/BDP-identifier

Repository files navigation

BDP-Identifier: Genomic Language Model-Based Prediction of Bidirectional Promoter Activity

This repository contains the code and data associated with the study "Genomic language model-driven decoding of gene regulation: a case-study on predicting bidirectional promoter activity across plant species" (https://github.com/AIBreeding/BDP-identifier). The workflow leverages genomic language models (gLMs) and machine learning (ML) to predict bidirectional promoter activity (BDP activity) from DNA sequences in plant species.


📁 Directory Structure

BDP-Identifier/  
├── 01.generate_input_sequence.py        # Generate input DNA sequences for BDP candidates  
├── 02.generate_embedding.py             # Generate Evo2 embeddings for DNA sequences  
├── 03.grid_search_emb_semiML.py         # Grid search for optimal ML model parameters using embeddings  
├── 04.emb_ML.py                         # Train ML models using Evo2 embeddings  
├── 05.freq_ML.py                        # Train ML models using di-nucleotide frequency features  
├── BDP_candidates.csv                   # Dataset of candidate BDPs from Arabidopsis, rice, maize, and wheat  
└── README.md                            # This file

🧪 Overview of the Workflow

• Identify BDP Candidates:
   • Extract DNA sequences flanking bidirectional promoters (≤1,000 bp) from plant genomes (Arabidopsis, rice, maize, wheat).
   • Quantify BDP activity using Pearson correlation coefficients (PCC) from RNA-seq data.
• Feature Generation:
   • Evo2 Embeddings: Use the pre-trained genomic language model Evo2 to generate position-specific embeddings.
   • Di-Nucleotide Frequencies: Compute di-nucleotide frequencies from six genomic regions (CDS, UTRs, introns, etc.).
• Model Training:
   • Train LightGBM models to predict BDP activity using embeddings or di-nucleotide frequency features.
   • Evaluate performance using AUROC, Accuracy, and F1-score.
• Cross-Species Generalization:
   • Test models trained on 1-3 species to predict BDP activity in a held-out species (e.g., Spartina alterniflora).
• Experimental Validation:
   • Validate predicted high-activity BDPs using transient dual-reporter assays in Nicotiana benthamiana.


🛠️ Installation & Requirements

• Python Environment:
   • Python 3.8+
   • Required packages: numpy, pandas, scikit-learn, lightgbm, torch, biopython
• Evo2 Model:
   • Download the pre-trained Evo2_7B model (https://github.com/your-repo/evo2-models).


🚀 Usage

• Generate Input Sequences:

python 01.generate_input_sequence.py --species Arabidopsis --output_dir data/sequences  

• Generate Evo2 Embeddings:

python 02.generate_embedding.py --input_file data/sequences/Arabidopsis.bed --output_file data/embeddings/Arabidopsis.npy

• Train ML Models:   • Using embeddings:

python 04.emb_ML.py --embedding_file data/embeddings/Arabidopsis.npy --label_file data/labels/Arabidopsis.csv  

  • Using di-nucleotide frequencies:

python 05.freq_ML.py --frequency_file data/frequencies/Arabidopsis.csv --label_file data/labels/Arabidopsis.csv  

📊 Data Description

• BDP_candidates.csv:
   • Contains 4,524 high-confidence BDP candidates from Arabidopsis, rice, maize, and wheat.
• Public RNA-seq Data:
   • Expression data from the Plant Public RNA-seq Database (PPRD) (Yu et al., 2022).


📝 Cite This Work

If you use this code or data, please cite our paper:
Genomic language model-driven decoding of gene regulation: a case-study on predicting bidirectional promoter activity across plant species [DOI: XXXXXXX]


📧 Contact

For questions or collaboration requests, contact:
Huihui Li - lihuihui@caas.cn


🌟 Acknowledgments

• Evo2 (Benegas et al., 2025) for the genomic language model.
• PPRD (Yu et al., 2022) for public RNA-seq data.
• LightGBM (Ke et al., 2017) for efficient gradient boosting.


License: MIT License.

About

BDP-Identifier: Genomic Language Model-Based Prediction of Bidirectional Promoter Activity in Plant Species

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages