Skip to content

Releases: genbio-ai/ModelGenerator

v0.1.3.post0

12 Dec 22:15

Choose a tag to compare

What's Changed

Key Changes

Infrastructure & Developer Experience

  • Pre-commit hooks: Ruff formatter, YAML validation, trailing whitespace checks
  • Poetry migration: Replaced pip-compile with Poetry for reproducible dependency management
  • Docker: Upgraded to CUDA 12.4.0 with matching PyTorch Geometric dependencies
  • Documentation: New kwargs docstring inheritance system for API documentation

New Features

  • Scimilarity model: Full integration with 28K+ gene mappings for cell/gene expression analysis
  • RNA inverse folding: Zero-shot analysis pipeline with iterative denoising and ablation studies
  • Protein stability: New stability prediction experiment configuration
  • Flash-attn fallback: Allow AIDO.Cell to run on CPU

Code Quality

  • Applied Ruff formatting (100 char line limit) across entire codebase
  • Fixed trailing whitespace and EOF issues throughout
  • Updated all README examples with proper data column specifications
  • New embedding caching documentation guide

Refactoring

  • Major backbone architecture updates (1,996 lines across key files)
    • Two-phase initialization (__init__ + setup())
    • Structured outputs via SequenceBackboneOutput dataclass
    • Embedding caching infrastructure (10-100x speedup)
  • Data module improvements (1,522 lines)
    • New ClassificationDataModule base class
    • Unified column handling with rename_cols
    • Automatic class weighting for imbalanced datasets
  • Task system enhancements (1,133 lines)
    • @once_only decorator for reliability
    • Data-dependent loss configuration
    • Stage-specific data requirements

Cleanup

  • Removed obsolete AIDO.Cell Jupyter notebook
  • Updated experiment configurations across all domains

Testing

  • Added new backbone base tests (311 lines)
  • Updated existing test suites for refactored code
  • All tests passing

Breaking Changes

None for most users - changes are backward compatible via legacy adapter system.

Custom backbone/task/data module developers will need to implement new required methods.

Migration Notes for Contributors

  1. Run pre-commit install to enable new commit hooks
  2. Use poetry lock instead of pip-compile for dependency updates

For Developers

Component Must Implement Why
Custom backbones setup(), process_batch(), required_data_columns() New architecture
Custom tasks Use @once_only, extract .last_hidden_state, implement required_data_columns(stage) Reliability + validation
Custom data modules provided_columns() property Column validation

Full Changelog: v0.1.2...v0.1.3

🔧 Refactored Components

1. Backbone Architecture Refactoring (1,996 lines)

Old vs New Pattern

BEFORE:

class MyBackbone(SequenceBackboneInterface):
    def __init__(self, ...):
        super().__init__()
        self.model = AutoModel.from_pretrained(...)  # Loaded immediately

    def forward(self, input_ids, attention_mask):
        return self.model(input_ids, attention_mask)  # Returns Tensor

AFTER:

class MyBackbone(SequenceBackboneInterface):
    def __init__(self, ...):
        super().__init__()
        self.model_name = model_name  # Store config only

    def setup(self):
        self.model = AutoModel.from_pretrained(self.model_name)  # Lazy loading

    def process_batch(self, batch, device):
        return self.tokenize(batch["sequences"])  # Unified interface

    def forward(self, input_ids, attention_mask):
        output = self.model(input_ids, attention_mask)
        return SequenceBackboneOutput(  # Structured output
            last_hidden_state=output.last_hidden_state,
            hidden_states=output.hidden_states,
            attention_mask=attention_mask
        )

Key Changes

1. Two-Phase Initialization

  • __init__(): Store configuration only
  • setup(): Load actual models/weights
  • Benefit: Faster instantiation, better memory management, distributed training support

2. Structured Output: SequenceBackboneOutput

@dataclass
class SequenceBackboneOutput:
    last_hidden_state: Tensor
    hidden_states: Optional[List[Tensor]] = None
    attention_mask: Optional[Tensor] = None
    special_tokens_mask: Optional[Tensor] = None

    def concat(cls, outputs, padding_value=0):  # Batch concatenation
    def __getitem__(self, idx):  # Slicing support
    def to_device(self, device):  # Device movement
    def to_dict(self) / from_dict(cls, d):  # Serialization
  • Benefit: Type safety, IDE autocomplete, consistent interface

3. Embedding Caching Infrastructure

# Enable caching for frozen backbones
backbone = aido_rna_650m(
    cache_config={
        "cache_dir": "/path/to/cache",
        "storage_backend": "lmdb",  # or "indexed"
        "enable_profiling": True
    }
)
  • Two backends: LMDB (memory-mapped) and Indexed (append-only)
  • Built-in profiling: _CacheProfiler tracks hits/misses
  • Benefit: 10-100x speedup for frozen backbone experiments

4. Required Data Columns

def required_data_columns(self, stage: str) -> List[str]:
    return ["sequences"] if stage != "predict" else []
  • Benefit: Framework validates data availability before training

Migration Impact

  • ⚠️ Custom backbones must implement: setup(), process_batch(), required_data_columns()
  • ⚠️ All backbone fixtures updated to call .setup()

2. Data Module Refactoring (1,522 lines)

Architecture Change

BEFORE:

SequenceClassificationDataModule(
    x_col="sequence",
    extra_cols=["metadata", "organism"],
    extra_col_aliases=["meta", "org"]  # Parallel list - error-prone
)

AFTER:

# Introduced base class for reusability
ClassificationDataModule(...)  # Generic base
SequenceClassificationDataModule(...)  # Inherits from base

# Cleaner API
SequenceClassificationDataModule(
    x_col=["sequence", "metadata", "organism"],  # Multiple inputs
    rename_cols={"metadata": "meta", "organism": "org"},  # Explicit mapping
    generate_uid=True  # For caching support
)

Key Improvements

1. Class Hierarchy

BaseDataModule
└── ClassificationDataModule (NEW)
    └── SequenceClassificationDataModule
  • Benefit: Easier to create non-sequence classification tasks

2. Unified Column Handling

  • x_col: Can be string or list (multi-input support)
  • rename_cols: Dictionary mapping (clearer than parallel lists)
  • Benefit: Supports multi-modal inputs, less error-prone

3. New Features

  • provided_columns() property: Declares available columns
  • generate_uid parameter: Auto-generates unique IDs for caching
  • class_weight property: Automatic class weighting for imbalanced datasets
dm = SequenceClassificationDataModule(..., generate_uid=True)
dm.class_weight  # Tensor([0.5, 2.0, 1.0]) for weighted loss

Migration Impact

  • ⚠️ Replace extra_cols + extra_col_aliases with rename_cols
  • ⚠️ Custom data modules must implement provided_columns()
  • ✅ All README examples updated with new syntax

3. Task System Refactoring (1,133 lines)

Initialization Pattern Change

BEFORE:

class MLM(TaskInterface):
    def __init__(self, backbone, ...):
        self.backbone_fn = backbone  # Store factory
        self.loss = nn.CrossEntropyLoss()  # Created in __init__

    def configure_model(self):
        self.backbone = self.backbone_fn(...)  # Create here
        # Could be called multiple times → bug!

AFTER:

class MLM(TaskInterface):
    def __init__(self, backbone, ...):
        self.backbone = backbone(...)  # Create immediately
        # Loss moved to configure_model

    @once_only  # Ensures single execution
    def configure_model(self):
        self.backbone.setup()  # Load weights
        self.loss = nn.CrossEntropyLoss(
            weight=self.data_module.class_weight  # Data-dependent
        )

Key Changes

1. @once_only Decorator

  • Prevents double-initialization in distributed training
  • Benefit: Eliminates subtle bugs from multiple configure_model() calls

2. Backbone Created in __init__

  • Enables early introspection
  • Actual model loading deferred to setup()
  • Benefit: Configuration validation without loading weights

3. Unified Batch Processing

# BEFORE
def transform(self, batch, batch_idx):
    tokenized = self.backbone.tokenize(batch["sequences"])
    return {"input_ids": tokenized["input_ids"].to(self.device), ...}

# AFTER
def transform(self, batch, batch_idx):
    return self.backbone.process_batch(batch, device=self.device)
    # Handles tokenization + device movement
  • Benefit: Less boilerplate, consistent across tasks

4. Data-Dependent Loss Configuration

class SequenceClassification(TaskInterface):
    def __init__(self, ..., weighted_loss=False):
        self.weighted_loss = weighted_loss

    @once_only
    def configure_model(self):
        self.loss = nn.CrossEntropyLoss(
            weight=self.data_module.class_weight if self.weighted_loss else None
        )
  • Benefit: Automatic class weighting for imbalanced datasets

5. Stage-Specific Data Requirements

def required_data_columns(self, s...
Read more

v0.1.2 - AIDO.Tissue, AIDO.StructurePrediction, AIDO.Protein-RAG, and open-source backbones

13 May 21:22

Choose a tag to compare

What's Changed

Minor Updates

  • API Reference overhaul to expose no-code convenience classes for backbones, datasets, tasks, and adapters.
  • New dev tools including pytest tests and continuous-integration testing.

Full Changelog: v0.1.1...v0.1.2

Inverse Folding

21 Dec 06:16

Choose a tag to compare

  • Added tools, documentation, and experiments for Protein and RNA inverse folding
  • Linked to new AIDO.RNAIF-16B and AIDO.ProteinIF-16B model releases on HF
  • Added documentation and experiments for RNA secondary structure prediction and mean ribosome load prediction

AIDO.ModelGenerator

11 Dec 07:11

Choose a tag to compare

AIDO.ModelGenerator is a software stack powering the development of an AI-driven Digital Organism by enabling researchers to adapt pretrained models and generate finetuned models for downstream tasks.
To read more about AIDO.ModelGenerator's integral role in building the world's first AI-driven Digital Organism, see AIDO.

AIDO.ModelGenerator is open-sourced as an opinionated plug-and-play research framework for cross-disciplinary teams in ML & Bio.
It is designed to enable rapid and reproducible prototyping with four kinds of experiments in mind:

  1. Applying pre-trained foundation models to new data
  2. Developing new finetuning and inference tasks for foundation models
  3. Benchmarking foundation models and creating leaderboards
  4. Testing new architectures for finetuning performance

while also scaling with hardware and integrating with larger data pipelines or research workflows.

AIDO.ModelGenerator is built on PyTorch, HuggingFace, and Lightning, and works seamlessly with these ecosystems.

See the AIDO.ModelGenerator documentation for installation, usage, tutorials, and API reference.

Test release, please ignore

11 Dec 06:42

Choose a tag to compare

v0.1.1-3

update description, remove direct dependencies for pypi release

Test release, please ignore

10 Dec 21:37

Choose a tag to compare

v0.1.1-2

update version for pypi release

Test release, please ignore

10 Dec 18:20

Choose a tag to compare

v0.1.1-1

update version, remove direct dependencies for release

Test release, please ignore

10 Dec 07:38

Choose a tag to compare

v0.1.1

upgrade version

Test release, please ignore

10 Dec 05:58

Choose a tag to compare

v0.1.0

temporarily remove openfold and dllogger for pypi upload