Context-Aware Regularization with Markovian Integration for Attention-Based Nucleotide Analysis
CARMANIA is a self-supervised genomic language model framework that augments next-token prediction with a transition-matrix regularization loss. This integration improves biological sequence modeling by aligning predicted transitions with empirical bigram(2-mer) statistics, allowing for better long-range dependency modeling and functional interpretation.
The following models are already available for use on Hugging Face Hub:
- 🦠🧬
MsAlEhR/carmania-big-10k-prok-genome - 🦠🧬
MsAlEhR/carmania-4k-scp-gene-taxa - 👤🧬
MsAlEhR/carmania-160k-seqlen-human
from transformers import AutoModel, AutoTokenizer
import torch
model = AutoModel.from_pretrained(
"MsAlEhR/carmania-160k-seqlen-human",
trust_remote_code=True,
torch_dtype=torch.float16, # fixed dtype (or autocast)
).to("cuda")
tokenizer = AutoTokenizer.from_pretrained(
"MsAlEhR/carmania-160k-seqlen-human",
trust_remote_code=True,
model_max_length=160000,
)
inputs = tokenizer("ACGTAGGCTA", return_tensors="pt").to("cuda")
outputs = model(**inputs)An experimental notebook exploring CARMANIA-driven sequence optimization using Enformer scores is now available.
This lightweight module perturbs input DNA sequences and uses Enformer’s predicted regulatory signals as a scoring function to iteratively generate variants with improved activity.
📄 Notebook:
carmania_enformer_guided_generation.ipynb
@article{refahi2025context,
title= {Context-Aware Regularization with Markovian Integration for Attention-Based Nucleotide Analysis},
author= {Refahi, Mohammadsaleh and Abavisani, Mahdi and Sokhansanj, Bahrad A. and Brown, James R. and Rosen, Gail},
journal= {arXiv preprint arXiv:2507.09378},
year= {2025}
}
