Releases · genbio-ai/ModelGenerator

12 Dec 22:15

cnellington

v0.1.3.post0

db97570

v0.1.3.post0 Latest

Latest

What's Changed

AIDO.Cell FlashAttention fallback by @Markkunitomi in #21
Cell utils clarification by @cnellington in #24
Major refactor by @cnellington in #26

Key Changes

Infrastructure & Developer Experience

Pre-commit hooks: Ruff formatter, YAML validation, trailing whitespace checks
Poetry migration: Replaced pip-compile with Poetry for reproducible dependency management
Docker: Upgraded to CUDA 12.4.0 with matching PyTorch Geometric dependencies
Documentation: New kwargs docstring inheritance system for API documentation

New Features

Scimilarity model: Full integration with 28K+ gene mappings for cell/gene expression analysis
RNA inverse folding: Zero-shot analysis pipeline with iterative denoising and ablation studies
Protein stability: New stability prediction experiment configuration
Flash-attn fallback: Allow AIDO.Cell to run on CPU

Code Quality

Applied Ruff formatting (100 char line limit) across entire codebase
Fixed trailing whitespace and EOF issues throughout
Updated all README examples with proper data column specifications
New embedding caching documentation guide

Refactoring

Major backbone architecture updates (1,996 lines across key files)
- Two-phase initialization (__init__ + setup())
- Structured outputs via SequenceBackboneOutput dataclass
- Embedding caching infrastructure (10-100x speedup)
Data module improvements (1,522 lines)
- New ClassificationDataModule base class
- Unified column handling with rename_cols
- Automatic class weighting for imbalanced datasets
Task system enhancements (1,133 lines)
- @once_only decorator for reliability
- Data-dependent loss configuration
- Stage-specific data requirements

Cleanup

Removed obsolete AIDO.Cell Jupyter notebook
Updated experiment configurations across all domains

Testing

Added new backbone base tests (311 lines)
Updated existing test suites for refactored code
All tests passing

Breaking Changes

None for most users - changes are backward compatible via legacy adapter system.

Custom backbone/task/data module developers will need to implement new required methods.

Migration Notes for Contributors

Run pre-commit install to enable new commit hooks
Use poetry lock instead of pip-compile for dependency updates

For Developers

Component	Must Implement	Why
Custom backbones	`setup()`, `process_batch()`, `required_data_columns()`	New architecture
Custom tasks	Use `@once_only`, extract `.last_hidden_state`, implement `required_data_columns(stage)`	Reliability + validation
Custom data modules	`provided_columns()` property	Column validation

Full Changelog: v0.1.2...v0.1.3

🔧 Refactored Components

1. Backbone Architecture Refactoring (1,996 lines)

Old vs New Pattern

BEFORE:

class MyBackbone(SequenceBackboneInterface):
    def __init__(self, ...):
        super().__init__()
        self.model = AutoModel.from_pretrained(...)  # Loaded immediately

    def forward(self, input_ids, attention_mask):
        return self.model(input_ids, attention_mask)  # Returns Tensor

AFTER:

class MyBackbone(SequenceBackboneInterface):
    def __init__(self, ...):
        super().__init__()
        self.model_name = model_name  # Store config only

    def setup(self):
        self.model = AutoModel.from_pretrained(self.model_name)  # Lazy loading

    def process_batch(self, batch, device):
        return self.tokenize(batch["sequences"])  # Unified interface

    def forward(self, input_ids, attention_mask):
        output = self.model(input_ids, attention_mask)
        return SequenceBackboneOutput(  # Structured output
            last_hidden_state=output.last_hidden_state,
            hidden_states=output.hidden_states,
            attention_mask=attention_mask
        )

Key Changes

1. Two-Phase Initialization

__init__(): Store configuration only
setup(): Load actual models/weights
Benefit: Faster instantiation, better memory management, distributed training support

2. Structured Output: SequenceBackboneOutput

@dataclass
class SequenceBackboneOutput:
    last_hidden_state: Tensor
    hidden_states: Optional[List[Tensor]] = None
    attention_mask: Optional[Tensor] = None
    special_tokens_mask: Optional[Tensor] = None

    def concat(cls, outputs, padding_value=0):  # Batch concatenation
    def __getitem__(self, idx):  # Slicing support
    def to_device(self, device):  # Device movement
    def to_dict(self) / from_dict(cls, d):  # Serialization

Benefit: Type safety, IDE autocomplete, consistent interface

3. Embedding Caching Infrastructure

# Enable caching for frozen backbones
backbone = aido_rna_650m(
    cache_config={
        "cache_dir": "/path/to/cache",
        "storage_backend": "lmdb",  # or "indexed"
        "enable_profiling": True
    }
)

Two backends: LMDB (memory-mapped) and Indexed (append-only)
Built-in profiling: _CacheProfiler tracks hits/misses
Benefit: 10-100x speedup for frozen backbone experiments

4. Required Data Columns

def required_data_columns(self, stage: str) -> List[str]:
    return ["sequences"] if stage != "predict" else []

Benefit: Framework validates data availability before training

Migration Impact

⚠️ Custom backbones must implement: setup(), process_batch(), required_data_columns()
⚠️ All backbone fixtures updated to call .setup()

2. Data Module Refactoring (1,522 lines)

Architecture Change

BEFORE:

SequenceClassificationDataModule(
    x_col="sequence",
    extra_cols=["metadata", "organism"],
    extra_col_aliases=["meta", "org"]  # Parallel list - error-prone
)

AFTER:

# Introduced base class for reusability
ClassificationDataModule(...)  # Generic base
SequenceClassificationDataModule(...)  # Inherits from base

# Cleaner API
SequenceClassificationDataModule(
    x_col=["sequence", "metadata", "organism"],  # Multiple inputs
    rename_cols={"metadata": "meta", "organism": "org"},  # Explicit mapping
    generate_uid=True  # For caching support
)

Key Improvements

1. Class Hierarchy

BaseDataModule
└── ClassificationDataModule (NEW)
    └── SequenceClassificationDataModule

Benefit: Easier to create non-sequence classification tasks

2. Unified Column Handling

x_col: Can be string or list (multi-input support)
rename_cols: Dictionary mapping (clearer than parallel lists)
Benefit: Supports multi-modal inputs, less error-prone

3. New Features

provided_columns() property: Declares available columns
generate_uid parameter: Auto-generates unique IDs for caching
class_weight property: Automatic class weighting for imbalanced datasets

dm = SequenceClassificationDataModule(..., generate_uid=True)
dm.class_weight  # Tensor([0.5, 2.0, 1.0]) for weighted loss

Migration Impact

⚠️ Replace extra_cols + extra_col_aliases with rename_cols
⚠️ Custom data modules must implement provided_columns()
✅ All README examples updated with new syntax

3. Task System Refactoring (1,133 lines)

Initialization Pattern Change

BEFORE:

class MLM(TaskInterface):
    def __init__(self, backbone, ...):
        self.backbone_fn = backbone  # Store factory
        self.loss = nn.CrossEntropyLoss()  # Created in __init__

    def configure_model(self):
        self.backbone = self.backbone_fn(...)  # Create here
        # Could be called multiple times → bug!

AFTER:

class MLM(TaskInterface):
    def __init__(self, backbone, ...):
        self.backbone = backbone(...)  # Create immediately
        # Loss moved to configure_model

    @once_only  # Ensures single execution
    def configure_model(self):
        self.backbone.setup()  # Load weights
        self.loss = nn.CrossEntropyLoss(
            weight=self.data_module.class_weight  # Data-dependent
        )

Key Changes

1. @once_only Decorator

Prevents double-initialization in distributed training
Benefit: Eliminates subtle bugs from multiple configure_model() calls

2. Backbone Created in __init__

Enables early introspection
Actual model loading deferred to setup()
Benefit: Configuration validation without loading weights

3. Unified Batch Processing

# BEFORE
def transform(self, batch, batch_idx):
    tokenized = self.backbone.tokenize(batch["sequences"])
    return {"input_ids": tokenized["input_ids"].to(self.device), ...}

# AFTER
def transform(self, batch, batch_idx):
    return self.backbone.process_batch(batch, device=self.device)
    # Handles tokenization + device movement

Benefit: Less boilerplate, consistent across tasks

4. Data-Dependent Loss Configuration

class SequenceClassification(TaskInterface):
    def __init__(self, ..., weighted_loss=False):
        self.weighted_loss = weighted_loss

    @once_only
    def configure_model(self):
        self.loss = nn.CrossEntropyLoss(
            weight=self.data_module.class_weight if self.weighted_loss else None
        )

Benefit: Automatic class weighting for imbalanced datasets

5. Stage-Specific Data Requirements

def required_data_columns(self, s...

Contributors

Markkunitomi and cnellington

Assets 2

13 May 21:22

cnellington

v0.1.2

ec0f708

v0.1.2 - AIDO.Tissue, AIDO.StructurePrediction, AIDO.Protein-RAG, and open-source backbones

What's Changed

Added AIDO.Tissue, a context-aware cell model incorporating spatial information for SOTA tissue modeling
Added AIDO.StructurePrediction, an efficient complex structure prediction model setting a new SOTA for antibodies and nanobodies.
Added AIDO.Protein-RAG, a evolved protein language model, setting a new SOTA on the ProteinGym leaderboard
Added open-source backbones for ESM, Enformer, Borzoi, Flashzoi, Geneformer, and scFoundation
Added Huggingface backbone for no-code HF integration
Added tutorials for
Ultra-efficient protein folding using AIDO.Protein2StructureToken and AIDO.StructureTokenizer
Predicting complex multi-molecular structures using AIDO.StructurePrediction
mRNA vaccine development using AIDO.RNA, AIDO.Protein, AIDO.Protein2StructureToken, and AIDO.StructureTokenizer.
Simulated Knockouts for Therapeutic Target Identification using AIDO.Cell
Multi-modal fusion for isoform expression prediction using AIDO.RNA, Enformer, and ESM2
A new AIDO.Cell quickstart, now hosted in collaboration with CZI

Minor Updates

API Reference overhaul to expose no-code convenience classes for backbones, datasets, tasks, and adapters.
New dev tools including pytest tests and continuous-integration testing.

Full Changelog: v0.1.1...v0.1.2

Assets 2

21 Dec 06:16

cnellington

v0.1.1-5

3a41213

Inverse Folding

Added tools, documentation, and experiments for Protein and RNA inverse folding
Linked to new AIDO.RNAIF-16B and AIDO.ProteinIF-16B model releases on HF
Added documentation and experiments for RNA secondary structure prediction and mean ribosome load prediction

Assets 2

11 Dec 07:11

cnellington

v0.1.1-4

d7ed78b

AIDO.ModelGenerator

AIDO.ModelGenerator is a software stack powering the development of an AI-driven Digital Organism by enabling researchers to adapt pretrained models and generate finetuned models for downstream tasks.
To read more about AIDO.ModelGenerator's integral role in building the world's first AI-driven Digital Organism, see AIDO.

AIDO.ModelGenerator is open-sourced as an opinionated plug-and-play research framework for cross-disciplinary teams in ML & Bio.
It is designed to enable rapid and reproducible prototyping with four kinds of experiments in mind:

Applying pre-trained foundation models to new data
Developing new finetuning and inference tasks for foundation models
Benchmarking foundation models and creating leaderboards
Testing new architectures for finetuning performance

while also scaling with hardware and integrating with larger data pipelines or research workflows.

AIDO.ModelGenerator is built on PyTorch, HuggingFace, and Lightning, and works seamlessly with these ecosystems.

See the AIDO.ModelGenerator documentation for installation, usage, tutorials, and API reference.

Assets 2

11 Dec 06:42

cnellington

v0.1.1-3

ad2a927

Test release, please ignore

v0.1.1-3

update description, remove direct dependencies for pypi release

Assets 2

10 Dec 21:37

cnellington

v0.1.1-2

683395f

Test release, please ignore

v0.1.1-2

update version for pypi release

Assets 2

10 Dec 18:20

cnellington

v0.1.1-1

f5b77d9

Test release, please ignore

v0.1.1-1

update version, remove direct dependencies for release

Assets 2

10 Dec 07:38

cnellington

v0.1.1

e9d98a8

Test release, please ignore

v0.1.1

upgrade version

Assets 2

10 Dec 05:58

cnellington

v0.1.0

01bee70

Test release, please ignore

v0.1.0

temporarily remove openfold and dllogger for pypi upload

Assets 2

Releases: genbio-ai/ModelGenerator

v0.1.3.post0

What's Changed

Key Changes

Infrastructure & Developer Experience

New Features

Code Quality

Refactoring

Cleanup

Testing

Breaking Changes

Migration Notes for Contributors

For Developers

🔧 Refactored Components

1. Backbone Architecture Refactoring (1,996 lines)

Old vs New Pattern

Key Changes

Migration Impact

2. Data Module Refactoring (1,522 lines)

Architecture Change

Key Improvements

Migration Impact

3. Task System Refactoring (1,133 lines)

Initialization Pattern Change

Key Changes

Contributors

Uh oh!

v0.1.2 - AIDO.Tissue, AIDO.StructurePrediction, AIDO.Protein-RAG, and open-source backbones

What's Changed

Minor Updates

Uh oh!

Inverse Folding

Uh oh!

AIDO.ModelGenerator

Uh oh!

Test release, please ignore

Uh oh!

Test release, please ignore

Uh oh!

Test release, please ignore

Uh oh!

Test release, please ignore

Uh oh!

Test release, please ignore

Uh oh!