PyTorch framework for uncertainty-aware genomic sequence classification with genomic language models (GLMs). This framework for epinet can also be reused for other projects.
This project supports:
- fine-tuning a pretrained genomic language model for classification,
- training an epinet uncertainty head on top of a frozen base model,
- fitting a temperature scaling factor for calibration,
- running inference with multiple uncertainty methods.
Supported backbones:
DNABERT2NT_transformerhyenaDNACARMANIA
Supported uncertainty methods:
base: no additional uncertaintybase_scaled: temp scaling based calibrationmc_dropout: monte-carlo dropout on all dropout layers.epinet: custom epinet implementation of epistemic neural network
Main folders:
nn_proj/common/: shared utilitiesnn_proj/models/<MODEL>/: model-specifictrain_base.py,train_epinet.py,scaling.py, andinference.pynn_proj/models/epinet/: shared Epinet implementationscripts/: shell scripts for the main workflow
Run all shell scripts from inside the scripts/ directory.
Edit train_base_model.sh and set:
DATACHECKPOINTMODELSEEDLRMAX_LENGTH
Then run:
bash train_base_model.shEdit train_epinet_model.sh and set:
DATABASE_CKPTEPI_CKPTMODELSEEDLRMAX_LENGTH
Then run:
bash train_epinet_model.shEdit get_temp_factor.sh and set:
DATABASE_CKPTMODELMAX_LENGTHSEED
Then run:
bash get_temp_factor.shCopy the printed temperature value into test_model.sh when using UQ_method="base_scaled".
Edit test_model.sh and set:
DATANUM_LABELSBASE_CKPTOUT_PATHMODELMAX_LENGTHSEEDUQ_methodTEMPK_SAMPLES
Then run:
bash test_model.shResults are written to:
<OUT_PATH>/inference_uncertainty.csv
Use one of these in test_model.sh:
base: standard predictionbase_scaled: temperature-scaled base predictionmc_dropout: dropout-based uncertainty with repeated forward passesepinet: Epinet-based uncertainty
For epinet, make sure BASE_CKPT points to the trained Epinet checkpoint, not the original base checkpoint.
The provided scripts use:
DATA="InstaDeepAI/nucleotide_transformer_downstream_tasks_revised/promoter_all"
MODEL="DNABERT2"Typical run order:
cd scripts
bash train_base_model.sh
bash train_epinet_model.sh
bash get_temp_factor.sh
bash test_model.sh- Train the base model first.
- Train Epinet second using the saved base checkpoint.
- Fit temperature scaling after base training.
- Run all scripts from the
scripts/directory. - This project has only been tested with Python 3.11 and Nvidia a100 GPUs. Your configuration may vary.