TropicBERT-LLMs_One_stop_tutorial is a fully open-source pipeline that transfers the "pre-training and fine-tuning" paradigm from natural language processing (NLP) to genomics.
Based on this training pipeline, we developed the TropicBERT model series. TropicBERT is the first genomic foundation model specifically designed for 10 tropical fruit species, aiming to bring Transformer techniques from NLP to DNA sequence analysis.The original genomic data is sourced fromοΌ tropicalfruit-omics database
Key Features:
- 𧬠Genome Pre-training: Learns genomic sequence features via Masked Language Modeling (MLM).
- π― Downstream Task Fine-tuning: Supports regression and classification tasks such as promoter strength prediction and chromatin accessibility prediction.
- π End-to-End Support: Provides complete code for raw FASTA processing, model training, and result visualization.
Suitable for researchers from diverse backgrounds to get started quickly.
- Complete Workflow: Raw FASTA β Standardization β Pre-training β Fine-tuning β Visualization
- Fully Open Source: Code, model weights, tokenizer, and tutorials are all publicly available.
- Reproducible: Standardized data processing and training scripts for researchers of all backgrounds.
- Tropical Fruit Pre-trained Model: The first genomic pre-trained model for tropical fruit crops.
- Multi-scale Models: Provides pre-trained model variants based on 1 / 5 / 10 species, with strong cross-species transferability.
TropicBERT-LLMs_One_stop_tutorial/
βββ 00-info/ # Environment setup and data preprocessing
βββ 01-pretrain_all/ # Pre-training module
β βββ 01-code-pretrain/ # Core pre-training code
β βββ 02-model_pretrain_example/ # Pre-trained model examples
β βββ 03-data_pretrain_example/ # Pre-training data examples
βββ 02-finetune_all/ # Fine-tuning module
β βββ 01-code-finetune/ # Core fine-tuning code
β βββ 02-model_finetune_example/ # Fine-tuned model examples
β βββ 03-data_finetune_example/ # Fine-tuning data examples
βββ 03-Result-Processing_Plotting/ # Result processing and visualization
βββ 04-Data-processing/ # Raw data preprocessing scripts
Recommended Configuration:
- Python >= 3.11
- Conda >= 25.7.0
- NVIDIA GPU with β₯ 16GB VRAM
- CUDA 12.1 (or compatible with your PyTorch version)
Main Dependencies:
- Conda 25.7.0
- Python 3.11.11
- PyTorch 2.8.0+cu128
- Transformers 4.57.0See environment.yml for the full list.
# Go to the directory containing environment files
cd 00-info/
# Option 1: Create a new environment with conda (recommended)
conda env create -f environment.yml -n AI_env
conda activate AI_env
# Option 2: Install dependencies with pip
pip install -r requirements.txtYou can set up the environment using either conda (recommended) or pip.
Use Masked Language Modeling (MLM) for self-supervised learning on genomic sequences.
Preparation Work:
- Model and Tokenizer Download: Download the model files and tokenizer from HuggingFace.
- Model and Tokenizer Adaptation:
- If the downloaded model is TropicBERTs, there is no need to modify the tokenizer; you can directly use the downloaded tokenizer.
- If using other NLP models, only use their model files, and for the tokenizer, use the one provided in the example files.
- File Placement: Place the downloaded model files and tokenizer (or the tokenizer file from the example) in the
01-pretrain_all/02-model_pretrain_example/directory. For TropicBERTs models, also follow this path for placement. - Data Preparation: You can use the data in the example dataset
01-pretrain_all/03-data_pretrain_example/, or modify it according to your actual needs to use your own dataset.
Example Run:
cd 01-pretrain_all/01-code-pretrain
# Option 1: Run via bash command line
python 01-pretrain.py \
--model_name_or_path ../02-model_pretrain_example/bert-6MER-retokenizer \
--train_data ../03-data_pretrain_example/pretrain_data.txt \
--output_dir output_pretrain
# Option 2: Run via bash script (recommended for project management)
bash 02-pretrain.shFine-tune the model on downstream tasks (e.g., promoter prediction).
Preparation:
- Copy the pre-trained model checkpoint to:
02-finetune_all/02-model_finetune_example/ - Use the sample data in
02-finetune_all/03-data_finetune_example/or replace with your own dataset.
Run Fine-tuning:
cd 02-finetune_all/01-code-finetune
# Option 1: Run via bash command line
python 01-finetune.py \
--model_name_or_path ../02-model_finetune_example/pretrained_bert_6MER_ckpt40 \
--train_task regression \
--reinit_classifier_layer True \
--train_data ../03-data_finetune_example/regression_train_data.csv \
--eval_data ../03-data_finetune_example/regression_dev_data.csv \
--test_data ../03-data_finetune_example/regression_test_data.csv \
--output_dir output_finetune \
--run_name runs_finetune \
--model_max_length 512 \
--gradient_accumulation_steps 4 \
--per_device_train_batch_size 16 \
--per_device_eval_batch_size 16 \
--num_train_epochs 5 \
--logging_steps 50 \
--eval_steps 50 \
--save_steps 1000 \
--warmup_ratio 0.05 \
--weight_decay 0.01 \
--learning_rate 1e-5 \
--save_total_limit 5
# Option 2: Run via bash script (recommended for project management)
bash 02-finetune.shProvides scripts for extracting training logs, evaluation results, and plotting figures:
cd 03-Result-Processing_Plotting
# Extract metrics and plot (modify paths in scripts as needed)
python 01_sort_out_result_files.py
python 02_test_metrics_extract_csv.py
python 03_Extract_draw_trainer_state.py(You may need to modify input/output paths in the scripts according to your directory structure.)
The 04-Data-processing/ directory contains scripts for pre-training data processing
ATCGATCGATCG...
GCTAGCTAGCTA...If you want to use your own FASTA data for training, please preprocess your data first. This step cleans, splits, and converts FASTA sequences into TXT format readable by the model. Genome FASTA files are extracted, sequences shorter than 10,000 bp are filtered out, all bases are converted to uppercase, 'N' bases are randomly replaced with one of ATGC, and sequences are split into non-overlapping chunks of 3060 bp (according to BERT's max input tokens), saved as txt files for model training.
cd 04-Data-processing
# Run scripts in order (edit file_paths in scripts as needed)
python 01_pretrain_data_one_step.py # Sequence extraction and filtering
python 02_pretrain_data_two_step.py # Base normalization and replacement
python 03_pretrain_data_three_step.py # Sequence splitting (Sliding Window)
python 04_pretrain_data_txt_merge.py # Merge into training setData Format: The final pre-training data is in
.txtformat, one sequence per line (e.g., 512bp or 3060bp), all bases uppercase.
(You may need to modify file input/output paths in the scripts according to your directory structure), for exampleοΌ
file_paths = [
"~/01_fasta_data/genome_ID1.fasta",
"~/01_fasta_data/genome_ID2.fasta"]
output_path = "~/01_fasta_data/01_reads_output"Open-source datasets are standardized to retain only sequence and label columns; other descriptive information is omitted.
label,sequence
0.1,ATCGATCGATCGATCG...
0.9,GCTAGCTAGCTAGCTA...Downstream TasksοΌ
Fine-tuning data and the corresponding base models are used for downstream genomic analysis tasks. Based on plant-genomic-benchmark and PDLLMs, six types of downstream tasks (plant genomics) are supported. Datasets are split into train:dev:test β 8:1:1, including 9 datasetsοΌ
| Task Type | Dataset Type | Sample Size | Description |
|---|---|---|---|
| Regression | Promoter strength | 72158/75808 | Promoter strength prediction (leaf/protoplast) |
| 3-class | Open chromatin prediction | 148500 | Open chromatin prediction |
| Binary | lncRNA prediction | 50308 | Long non-coding RNA prediction |
| Binary | Sequence conservation | 70400 | Sequence conservation prediction |
| Binary | Core promoter detection | 83200 | Core promoter detection |
| Binary | Histone modification | 102400/102400/99684 | Histone modification prediction (H3K4me3/H3K27ac/H3K27me3) |
π HuggingFace Hubπ€ : https://huggingface.co/yang0104/TropicBERT
Includes:
- All 13 pre-trained variants (1 / 5 / 10 species)
- Custom DNA Tokenizer
If you use TropicBERT or this pipeline, please citeοΌ
For questions, please submit an Issue or contact:
π© zqiangx@gmail.com, 1264894293yl@gmail.com
Democratizing plant genomic LLMsβstarting from tropical fruits.
