Porcine MutBERT: A family of Lightweight Genomic Foundation Models for Functional Element Prediction in Pigs
The repo contains the official implementation of "Porcine MutBERT: A family of Lightweight Genomic Foundation Model for Functional Element Prediction in Pigs".
The pig (Sus scrofa) is both an economically important livestock species and a valuable biomedical model. Its genome bears regulatory features shaped by domestication and selection that are often poorly captured by genomic language models (gLMs) trained on human or model-organism data. To address these challenges, we developed Porcine MutBERT, a lightweight gLM with 86 million parameters that employs a probabilistic masking strategy targeting evolutionarily informative single nucleotide polymorphisms (SNPs). This design captures population-specific variation while reducing computational cost. We further propose PorcineBench, a benchmark that evaluates gLM performance across porcine functional genomics tasks, including chromatin accessibility (ATAC-seq), CTCF binding, and histone modifications (H3K27ac, H3K4me1, H3K4me3). Results show that Porcine MutBERT can consistently outperforms larger generalist models while offering faster training and greater deployment feasibility. These findings underscore the advantages of species-adapted, efficient architectures in agricultural genomics and demonstrate that compact gLMs can expand accessibility and impact in resource-constrained settings.
The all 2 pre-trained models are available at Huggingface as CompBioDSA/pig-mutbert-var, CompBioDSA/pig-mutbert-ref. Link to HuggingFace ModelHub.
# create and activate virtual python environment
conda create -n pigmutbert python=3.12
conda activate pigmutbert
# install required packages
pip install -r requirements.txtOur model is easy to use with the transformers package.
To load the model from huggingface:
from transformers import AutoTokenizer, AutoModel
model_name = "CompBioDSA/pig-mutbert-var"
# Optional: CompBioDSA/pig-mutbert-ref
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name, trust_remote_code=True)
cls_model = AutoModelForSequenceClassification.from_pretrained(model_name, trust_remote_code=True, num_labels=2)To get the embeddings of a dna sequence
import torch
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModel
model_name = "CompBioDSA/pig-mutbert-var"
# Optional: CompBioDSA/pig-mutbert-ref
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name, trust_remote_code=True)
dna = "ATCGGGGCCCATTA"
inputs = tokenizer(dna, return_tensors='pt')["input_ids"]
mut_inputs = F.one_hot(inputs, num_classes=len(tokenizer)).float().to("cpu") # len(tokenizer) is vocab size
last_hidden_state = model(mut_inputs).last_hidden_state # [1, sequence_length, 768]
# or: last_hidden_state = model(mut_inputs)[0] # [1, sequence_length, 768]
# embedding with mean pooling
embedding_mean = torch.mean(last_hidden_state[0], dim=0)
print(embedding_mean.shape) # expect to be 768
# embedding with max pooling
embedding_max = torch.max(last_hidden_state[0], dim=0)[0]
print(embedding_max.shape) # expect to be 768Allowed types for RoPE scaling are: linear and dynamic. To extend the model's context window you need to add rope_scaling parameter.
If you want to scale your model context by 2x:
from transformers import AutoModel
model_name = "CompBioDSA/pig-mutbert-var"
# Optional: CompBioDSA/pig-mutbert-ref
model = AutoModel.from_pretrained(model_name,
trust_remote_code=True,
rope_scaling={'type': 'dynamic','factor': 2.0}
) # 2.0 for x2 scaling, 4.0 for x4, etc..The RAW training data is available:
- Sus scrofa Reference Genome: Download
Sus_scrofa.Sscrofa11.1.dna.toplevel.fa.gz - mutation data (40 samples)
After download raw data, we used seqkit to process .fa file . Link to script
We used bcftools to process VCFfiles. Link to script
You can follow 5 steps to prepare data:
- csv_post_process(): add header of csv files
- fa2npy(): extract sequence data from hg38.fa.gz, save as chr_name.npy
- split_by_n(): split sequence data by "N" from chr_name.npy, save as chr_name_part_i.npy
- create_sm_matrix(): map str to float number, create smooth matrix from chr_name_part_i.npy (str) and clean.chr_name.csv, save as chr_name_part_i.npy (float)
- cat_all_npy(): concatenate all the interval smooth matrix from chr_name_part_i.npy (float), save as train_data.npy and test_data.npy
We used and modified run_mlm_no_trainer.py at here.
Firstly, open your terminal and run:
accelerate configFollow the guideline you can config accelerate.
After that, run pretrain.sh.
bash train_ref.sh
# or bash train_mut40.shWe prepared 7 tasks in finetune/data/{tasks} at google drive
You should download the model configs and checkpoints into finetune/models/{model_name}
Run the scripts directly, it will automatically load the datasets and perform finetuning.
cd finetune
# bash run_pig.sh {model_name} {cuda_id} {use_lora}
bash run_pig.sh pig_mutbert_ref 0 0
# bash run_pig.sh DNABERT2 0 0
# bash run_pig.sh ntv2_500m 0 0

