Releases: genbio-ai/ModelGenerator
v0.1.3.post0
What's Changed
- AIDO.Cell FlashAttention fallback by @Markkunitomi in #21
- Cell utils clarification by @cnellington in #24
- Major refactor by @cnellington in #26
Key Changes
Infrastructure & Developer Experience
- Pre-commit hooks: Ruff formatter, YAML validation, trailing whitespace checks
- Poetry migration: Replaced pip-compile with Poetry for reproducible dependency management
- Docker: Upgraded to CUDA 12.4.0 with matching PyTorch Geometric dependencies
- Documentation: New kwargs docstring inheritance system for API documentation
New Features
- Scimilarity model: Full integration with 28K+ gene mappings for cell/gene expression analysis
- RNA inverse folding: Zero-shot analysis pipeline with iterative denoising and ablation studies
- Protein stability: New stability prediction experiment configuration
- Flash-attn fallback: Allow AIDO.Cell to run on CPU
Code Quality
- Applied Ruff formatting (100 char line limit) across entire codebase
- Fixed trailing whitespace and EOF issues throughout
- Updated all README examples with proper data column specifications
- New embedding caching documentation guide
Refactoring
- Major backbone architecture updates (1,996 lines across key files)
- Two-phase initialization (
__init__+setup()) - Structured outputs via
SequenceBackboneOutputdataclass - Embedding caching infrastructure (10-100x speedup)
- Two-phase initialization (
- Data module improvements (1,522 lines)
- New
ClassificationDataModulebase class - Unified column handling with
rename_cols - Automatic class weighting for imbalanced datasets
- New
- Task system enhancements (1,133 lines)
@once_onlydecorator for reliability- Data-dependent loss configuration
- Stage-specific data requirements
Cleanup
- Removed obsolete AIDO.Cell Jupyter notebook
- Updated experiment configurations across all domains
Testing
- Added new backbone base tests (311 lines)
- Updated existing test suites for refactored code
- All tests passing
Breaking Changes
None for most users - changes are backward compatible via legacy adapter system.
Custom backbone/task/data module developers will need to implement new required methods.
Migration Notes for Contributors
- Run
pre-commit installto enable new commit hooks - Use
poetry lockinstead ofpip-compilefor dependency updates
For Developers
| Component | Must Implement | Why |
|---|---|---|
| Custom backbones | setup(), process_batch(), required_data_columns() |
New architecture |
| Custom tasks | Use @once_only, extract .last_hidden_state, implement required_data_columns(stage) |
Reliability + validation |
| Custom data modules | provided_columns() property |
Column validation |
Full Changelog: v0.1.2...v0.1.3
🔧 Refactored Components
1. Backbone Architecture Refactoring (1,996 lines)
Old vs New Pattern
BEFORE:
class MyBackbone(SequenceBackboneInterface):
def __init__(self, ...):
super().__init__()
self.model = AutoModel.from_pretrained(...) # Loaded immediately
def forward(self, input_ids, attention_mask):
return self.model(input_ids, attention_mask) # Returns TensorAFTER:
class MyBackbone(SequenceBackboneInterface):
def __init__(self, ...):
super().__init__()
self.model_name = model_name # Store config only
def setup(self):
self.model = AutoModel.from_pretrained(self.model_name) # Lazy loading
def process_batch(self, batch, device):
return self.tokenize(batch["sequences"]) # Unified interface
def forward(self, input_ids, attention_mask):
output = self.model(input_ids, attention_mask)
return SequenceBackboneOutput( # Structured output
last_hidden_state=output.last_hidden_state,
hidden_states=output.hidden_states,
attention_mask=attention_mask
)Key Changes
1. Two-Phase Initialization
__init__(): Store configuration onlysetup(): Load actual models/weights- Benefit: Faster instantiation, better memory management, distributed training support
2. Structured Output: SequenceBackboneOutput
@dataclass
class SequenceBackboneOutput:
last_hidden_state: Tensor
hidden_states: Optional[List[Tensor]] = None
attention_mask: Optional[Tensor] = None
special_tokens_mask: Optional[Tensor] = None
def concat(cls, outputs, padding_value=0): # Batch concatenation
def __getitem__(self, idx): # Slicing support
def to_device(self, device): # Device movement
def to_dict(self) / from_dict(cls, d): # Serialization- Benefit: Type safety, IDE autocomplete, consistent interface
3. Embedding Caching Infrastructure
# Enable caching for frozen backbones
backbone = aido_rna_650m(
cache_config={
"cache_dir": "/path/to/cache",
"storage_backend": "lmdb", # or "indexed"
"enable_profiling": True
}
)- Two backends: LMDB (memory-mapped) and Indexed (append-only)
- Built-in profiling:
_CacheProfilertracks hits/misses - Benefit: 10-100x speedup for frozen backbone experiments
4. Required Data Columns
def required_data_columns(self, stage: str) -> List[str]:
return ["sequences"] if stage != "predict" else []- Benefit: Framework validates data availability before training
Migration Impact
⚠️ Custom backbones must implement:setup(),process_batch(),required_data_columns()⚠️ All backbone fixtures updated to call.setup()
2. Data Module Refactoring (1,522 lines)
Architecture Change
BEFORE:
SequenceClassificationDataModule(
x_col="sequence",
extra_cols=["metadata", "organism"],
extra_col_aliases=["meta", "org"] # Parallel list - error-prone
)AFTER:
# Introduced base class for reusability
ClassificationDataModule(...) # Generic base
SequenceClassificationDataModule(...) # Inherits from base
# Cleaner API
SequenceClassificationDataModule(
x_col=["sequence", "metadata", "organism"], # Multiple inputs
rename_cols={"metadata": "meta", "organism": "org"}, # Explicit mapping
generate_uid=True # For caching support
)Key Improvements
1. Class Hierarchy
BaseDataModule
└── ClassificationDataModule (NEW)
└── SequenceClassificationDataModule
- Benefit: Easier to create non-sequence classification tasks
2. Unified Column Handling
x_col: Can be string or list (multi-input support)rename_cols: Dictionary mapping (clearer than parallel lists)- Benefit: Supports multi-modal inputs, less error-prone
3. New Features
provided_columns()property: Declares available columnsgenerate_uidparameter: Auto-generates unique IDs for cachingclass_weightproperty: Automatic class weighting for imbalanced datasets
dm = SequenceClassificationDataModule(..., generate_uid=True)
dm.class_weight # Tensor([0.5, 2.0, 1.0]) for weighted lossMigration Impact
⚠️ Replaceextra_cols+extra_col_aliaseswithrename_cols⚠️ Custom data modules must implementprovided_columns()- ✅ All README examples updated with new syntax
3. Task System Refactoring (1,133 lines)
Initialization Pattern Change
BEFORE:
class MLM(TaskInterface):
def __init__(self, backbone, ...):
self.backbone_fn = backbone # Store factory
self.loss = nn.CrossEntropyLoss() # Created in __init__
def configure_model(self):
self.backbone = self.backbone_fn(...) # Create here
# Could be called multiple times → bug!AFTER:
class MLM(TaskInterface):
def __init__(self, backbone, ...):
self.backbone = backbone(...) # Create immediately
# Loss moved to configure_model
@once_only # Ensures single execution
def configure_model(self):
self.backbone.setup() # Load weights
self.loss = nn.CrossEntropyLoss(
weight=self.data_module.class_weight # Data-dependent
)Key Changes
1. @once_only Decorator
- Prevents double-initialization in distributed training
- Benefit: Eliminates subtle bugs from multiple
configure_model()calls
2. Backbone Created in __init__
- Enables early introspection
- Actual model loading deferred to
setup() - Benefit: Configuration validation without loading weights
3. Unified Batch Processing
# BEFORE
def transform(self, batch, batch_idx):
tokenized = self.backbone.tokenize(batch["sequences"])
return {"input_ids": tokenized["input_ids"].to(self.device), ...}
# AFTER
def transform(self, batch, batch_idx):
return self.backbone.process_batch(batch, device=self.device)
# Handles tokenization + device movement- Benefit: Less boilerplate, consistent across tasks
4. Data-Dependent Loss Configuration
class SequenceClassification(TaskInterface):
def __init__(self, ..., weighted_loss=False):
self.weighted_loss = weighted_loss
@once_only
def configure_model(self):
self.loss = nn.CrossEntropyLoss(
weight=self.data_module.class_weight if self.weighted_loss else None
)- Benefit: Automatic class weighting for imbalanced datasets
5. Stage-Specific Data Requirements
def required_data_columns(self, s...v0.1.2 - AIDO.Tissue, AIDO.StructurePrediction, AIDO.Protein-RAG, and open-source backbones
What's Changed
- Added AIDO.Tissue, a context-aware cell model incorporating spatial information for SOTA tissue modeling
- Added AIDO.StructurePrediction, an efficient complex structure prediction model setting a new SOTA for antibodies and nanobodies.
- Added AIDO.Protein-RAG, a evolved protein language model, setting a new SOTA on the ProteinGym leaderboard
- Added open-source backbones for ESM, Enformer, Borzoi, Flashzoi, Geneformer, and scFoundation
- Added
Huggingfacebackbone for no-code HF integration - Added tutorials for
Ultra-efficient protein folding using AIDO.Protein2StructureToken and AIDO.StructureTokenizer
Predicting complex multi-molecular structures using AIDO.StructurePrediction
mRNA vaccine development using AIDO.RNA, AIDO.Protein, AIDO.Protein2StructureToken, and AIDO.StructureTokenizer.
Simulated Knockouts for Therapeutic Target Identification using AIDO.Cell
Multi-modal fusion for isoform expression prediction using AIDO.RNA, Enformer, and ESM2
A new AIDO.Cell quickstart, now hosted in collaboration with CZI
Minor Updates
- API Reference overhaul to expose no-code convenience classes for backbones, datasets, tasks, and adapters.
- New dev tools including pytest tests and continuous-integration testing.
Full Changelog: v0.1.1...v0.1.2
Inverse Folding
- Added tools, documentation, and experiments for Protein and RNA inverse folding
- Linked to new AIDO.RNAIF-16B and AIDO.ProteinIF-16B model releases on HF
- Added documentation and experiments for RNA secondary structure prediction and mean ribosome load prediction
AIDO.ModelGenerator
AIDO.ModelGenerator is a software stack powering the development of an AI-driven Digital Organism by enabling researchers to adapt pretrained models and generate finetuned models for downstream tasks.
To read more about AIDO.ModelGenerator's integral role in building the world's first AI-driven Digital Organism, see AIDO.
AIDO.ModelGenerator is open-sourced as an opinionated plug-and-play research framework for cross-disciplinary teams in ML & Bio.
It is designed to enable rapid and reproducible prototyping with four kinds of experiments in mind:
- Applying pre-trained foundation models to new data
- Developing new finetuning and inference tasks for foundation models
- Benchmarking foundation models and creating leaderboards
- Testing new architectures for finetuning performance
while also scaling with hardware and integrating with larger data pipelines or research workflows.
AIDO.ModelGenerator is built on PyTorch, HuggingFace, and Lightning, and works seamlessly with these ecosystems.
See the AIDO.ModelGenerator documentation for installation, usage, tutorials, and API reference.
Test release, please ignore
v0.1.1-3 update description, remove direct dependencies for pypi release
Test release, please ignore
v0.1.1-2 update version for pypi release
Test release, please ignore
v0.1.1-1 update version, remove direct dependencies for release
Test release, please ignore
v0.1.1 upgrade version
Test release, please ignore
v0.1.0 temporarily remove openfold and dllogger for pypi upload