GitHub - AIBreeding/BDP-identifier: BDP-Identifier: Genomic Language Model-Based Prediction of Bidirectional Promoter Activity in Plant Species

BDP-Identifier: Genomic Language Model-Based Prediction of Bidirectional Promoter Activity

This repository contains the code and data associated with the study "Genomic language model-driven decoding of gene regulation: a case-study on predicting bidirectional promoter activity across plant species" (https://github.com/AIBreeding/BDP-identifier). The workflow leverages genomic language models (gLMs) and machine learning (ML) to predict bidirectional promoter activity (BDP activity) from DNA sequences in plant species.

📁 Directory Structure

BDP-Identifier/  
├── 01.generate_input_sequence.py        # Generate input DNA sequences for BDP candidates  
├── 02.generate_embedding.py             # Generate Evo2 embeddings for DNA sequences  
├── 03.grid_search_emb_semiML.py         # Grid search for optimal ML model parameters using embeddings  
├── 04.emb_ML.py                         # Train ML models using Evo2 embeddings  
├── 05.freq_ML.py                        # Train ML models using di-nucleotide frequency features  
├── BDP_candidates.csv                   # Dataset of candidate BDPs from Arabidopsis, rice, maize, and wheat  
└── README.md                            # This file

🧪 Overview of the Workflow

• Identify BDP Candidates:
• Extract DNA sequences flanking bidirectional promoters (≤1,000 bp) from plant genomes (Arabidopsis, rice, maize, wheat).
• Quantify BDP activity using Pearson correlation coefficients (PCC) from RNA-seq data.
• Feature Generation:
• Evo2 Embeddings: Use the pre-trained genomic language model Evo2 to generate position-specific embeddings.
• Di-Nucleotide Frequencies: Compute di-nucleotide frequencies from six genomic regions (CDS, UTRs, introns, etc.).
• Model Training:
• Train LightGBM models to predict BDP activity using embeddings or di-nucleotide frequency features.
• Evaluate performance using AUROC, Accuracy, and F1-score.
• Cross-Species Generalization:
• Test models trained on 1-3 species to predict BDP activity in a held-out species (e.g., Spartina alterniflora).
• Experimental Validation:
• Validate predicted high-activity BDPs using transient dual-reporter assays in Nicotiana benthamiana.

🛠️ Installation & Requirements

• Python Environment:
• Python 3.8+
• Required packages: numpy, pandas, scikit-learn, lightgbm, torch, biopython
• Evo2 Model:
• Download the pre-trained Evo2_7B model (https://github.com/your-repo/evo2-models).

🚀 Usage

• Generate Input Sequences:

python 01.generate_input_sequence.py --species Arabidopsis --output_dir data/sequences

• Generate Evo2 Embeddings:

python 02.generate_embedding.py --input_file data/sequences/Arabidopsis.bed --output_file data/embeddings/Arabidopsis.npy

• Train ML Models: • Using embeddings:

python 04.emb_ML.py --embedding_file data/embeddings/Arabidopsis.npy --label_file data/labels/Arabidopsis.csv

• Using di-nucleotide frequencies:

python 05.freq_ML.py --frequency_file data/frequencies/Arabidopsis.csv --label_file data/labels/Arabidopsis.csv

📊 Data Description

• BDP_candidates.csv:
• Contains 4,524 high-confidence BDP candidates from Arabidopsis, rice, maize, and wheat.
• Public RNA-seq Data:
• Expression data from the Plant Public RNA-seq Database (PPRD) (Yu et al., 2022).

📝 Cite This Work

If you use this code or data, please cite our paper:
Genomic language model-driven decoding of gene regulation: a case-study on predicting bidirectional promoter activity across plant species [DOI: XXXXXXX]

📧 Contact

For questions or collaboration requests, contact:
Huihui Li - lihuihui@caas.cn

🌟 Acknowledgments

• Evo2 (Benegas et al., 2025) for the genomic language model.
• PPRD (Yu et al., 2022) for public RNA-seq data.
• LightGBM (Ke et al., 2017) for efficient gradient boosting.

License: MIT License.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

BDP-Identifier: Genomic Language Model-Based Prediction of Bidirectional Promoter Activity

📁 Directory Structure

🧪 Overview of the Workflow

🛠️ Installation & Requirements

🚀 Usage

📊 Data Description

📝 Cite This Work

📧 Contact

🌟 Acknowledgments

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
01.generate_input_sequence		01.generate_input_sequence
01.generate_input_sequence.py		01.generate_input_sequence.py
02.generate_embedding.py		02.generate_embedding.py
03.grid_seach_emb_semiML.py		03.grid_seach_emb_semiML.py
04.emb_ML.py		04.emb_ML.py
05.freq_ML.py		05.freq_ML.py
README.md		README.md

AIBreeding/BDP-identifier

Folders and files

Latest commit

History

Repository files navigation

BDP-Identifier: Genomic Language Model-Based Prediction of Bidirectional Promoter Activity

📁 Directory Structure

🧪 Overview of the Workflow

🛠️ Installation & Requirements

🚀 Usage

📊 Data Description

📝 Cite This Work

📧 Contact

🌟 Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages