Skip to content

Conversation

@cnellington
Copy link
Contributor

Summary

November 25th update consolidating infrastructure improvements, Scimilarity model support, RNA inverse folding analysis, and comprehensive code quality enhancements.

Scale: 208 files changed, 42,130 insertions(+), 3,924 deletions(-)

Key Changes

Infrastructure & Developer Experience

  • Pre-commit hooks: Ruff formatter, YAML validation, trailing whitespace checks
  • Poetry migration: Replaced pip-compile with Poetry for reproducible dependency management
  • Docker: Upgraded to CUDA 12.4.0 with matching PyTorch Geometric dependencies
  • Documentation: New kwargs docstring inheritance system for API documentation

New Features

  • Scimilarity model: Full integration with 28K+ gene mappings for cell/gene expression analysis
  • RNA inverse folding: Zero-shot analysis pipeline with iterative denoising and ablation studies
  • Protein stability: New xTrimo stability prediction experiment configuration

Code Quality

  • Applied Ruff formatting (100 char line limit) across entire codebase
  • Fixed trailing whitespace and EOF issues throughout
  • Updated all README examples with proper data column specifications
  • New embedding caching documentation guide

Refactoring

  • Major backbone architecture updates (1,996 lines across key files)
    • Two-phase initialization (__init__ + setup())
    • Structured outputs via SequenceBackboneOutput dataclass
    • Embedding caching infrastructure (10-100x speedup)
  • Data module improvements (1,522 lines)
    • New ClassificationDataModule base class
    • Unified column handling with rename_cols
    • Automatic class weighting for imbalanced datasets
  • Task system enhancements (1,133 lines)
    • @once_only decorator for reliability
    • Data-dependent loss configuration
    • Stage-specific data requirements

Cleanup

  • Removed obsolete AIDO.Cell Jupyter notebook
  • Updated experiment configurations across all domains

Testing

  • Added new backbone base tests (311 lines)
  • Updated existing test suites for refactored code
  • All tests passing

Breaking Changes

None for most users - changes are backward compatible via legacy adapter system.

Custom backbone/task/data module developers will need to implement new required methods. See migration guide in PR_SUMMARY_nov25.md.

Migration Notes for Contributors

  1. Run pre-commit install to enable new commit hooks
  2. Use poetry lock instead of pip-compile for dependency updates

For Developers

Component Must Implement Why
Custom backbones setup(), process_batch(), required_data_columns() New architecture
Custom tasks Use @once_only, extract .last_hidden_state, implement required_data_columns(stage) Reliability + validation
Custom data modules provided_columns() property Column validation

🔧 Refactored Components

1. Backbone Architecture Refactoring (1,996 lines)

Old vs New Pattern

BEFORE:

class MyBackbone(SequenceBackboneInterface):
    def __init__(self, ...):
        super().__init__()
        self.model = AutoModel.from_pretrained(...)  # Loaded immediately

    def forward(self, input_ids, attention_mask):
        return self.model(input_ids, attention_mask)  # Returns Tensor

AFTER:

class MyBackbone(SequenceBackboneInterface):
    def __init__(self, ...):
        super().__init__()
        self.model_name = model_name  # Store config only

    def setup(self):
        self.model = AutoModel.from_pretrained(self.model_name)  # Lazy loading

    def process_batch(self, batch, device):
        return self.tokenize(batch["sequences"])  # Unified interface

    def forward(self, input_ids, attention_mask):
        output = self.model(input_ids, attention_mask)
        return SequenceBackboneOutput(  # Structured output
            last_hidden_state=output.last_hidden_state,
            hidden_states=output.hidden_states,
            attention_mask=attention_mask
        )

Key Changes

1. Two-Phase Initialization

  • __init__(): Store configuration only
  • setup(): Load actual models/weights
  • Benefit: Faster instantiation, better memory management, distributed training support

2. Structured Output: SequenceBackboneOutput

@dataclass
class SequenceBackboneOutput:
    last_hidden_state: Tensor
    hidden_states: Optional[List[Tensor]] = None
    attention_mask: Optional[Tensor] = None
    special_tokens_mask: Optional[Tensor] = None

    def concat(cls, outputs, padding_value=0):  # Batch concatenation
    def __getitem__(self, idx):  # Slicing support
    def to_device(self, device):  # Device movement
    def to_dict(self) / from_dict(cls, d):  # Serialization
  • Benefit: Type safety, IDE autocomplete, consistent interface

3. Embedding Caching Infrastructure

# Enable caching for frozen backbones
backbone = aido_rna_650m(
    cache_config={
        "cache_dir": "/path/to/cache",
        "storage_backend": "lmdb",  # or "indexed"
        "enable_profiling": True
    }
)
  • Two backends: LMDB (memory-mapped) and Indexed (append-only)
  • Built-in profiling: _CacheProfiler tracks hits/misses
  • Benefit: 10-100x speedup for frozen backbone experiments

4. Required Data Columns

def required_data_columns(self, stage: str) -> List[str]:
    return ["sequences"] if stage != "predict" else []
  • Benefit: Framework validates data availability before training

Migration Impact

  • ⚠️ Custom backbones must implement: setup(), process_batch(), required_data_columns()
  • ⚠️ All backbone fixtures updated to call .setup()

2. Data Module Refactoring (1,522 lines)

Architecture Change

BEFORE:

SequenceClassificationDataModule(
    x_col="sequence",
    extra_cols=["metadata", "organism"],
    extra_col_aliases=["meta", "org"]  # Parallel list - error-prone
)

AFTER:

# Introduced base class for reusability
ClassificationDataModule(...)  # Generic base
SequenceClassificationDataModule(...)  # Inherits from base

# Cleaner API
SequenceClassificationDataModule(
    x_col=["sequence", "metadata", "organism"],  # Multiple inputs
    rename_cols={"metadata": "meta", "organism": "org"},  # Explicit mapping
    generate_uid=True  # For caching support
)

Key Improvements

1. Class Hierarchy

BaseDataModule
└── ClassificationDataModule (NEW)
    └── SequenceClassificationDataModule
  • Benefit: Easier to create non-sequence classification tasks

2. Unified Column Handling

  • x_col: Can be string or list (multi-input support)
  • rename_cols: Dictionary mapping (clearer than parallel lists)
  • Benefit: Supports multi-modal inputs, less error-prone

3. New Features

  • provided_columns() property: Declares available columns
  • generate_uid parameter: Auto-generates unique IDs for caching
  • class_weight property: Automatic class weighting for imbalanced datasets
dm = SequenceClassificationDataModule(..., generate_uid=True)
dm.class_weight  # Tensor([0.5, 2.0, 1.0]) for weighted loss

Migration Impact

  • ⚠️ Replace extra_cols + extra_col_aliases with rename_cols
  • ⚠️ Custom data modules must implement provided_columns()
  • ✅ All README examples updated with new syntax

3. Task System Refactoring (1,133 lines)

Initialization Pattern Change

BEFORE:

class MLM(TaskInterface):
    def __init__(self, backbone, ...):
        self.backbone_fn = backbone  # Store factory
        self.loss = nn.CrossEntropyLoss()  # Created in __init__

    def configure_model(self):
        self.backbone = self.backbone_fn(...)  # Create here
        # Could be called multiple times → bug!

AFTER:

class MLM(TaskInterface):
    def __init__(self, backbone, ...):
        self.backbone = backbone(...)  # Create immediately
        # Loss moved to configure_model

    @once_only  # Ensures single execution
    def configure_model(self):
        self.backbone.setup()  # Load weights
        self.loss = nn.CrossEntropyLoss(
            weight=self.data_module.class_weight  # Data-dependent
        )

Key Changes

1. @once_only Decorator

  • Prevents double-initialization in distributed training
  • Benefit: Eliminates subtle bugs from multiple configure_model() calls

2. Backbone Created in __init__

  • Enables early introspection
  • Actual model loading deferred to setup()
  • Benefit: Configuration validation without loading weights

3. Unified Batch Processing

# BEFORE
def transform(self, batch, batch_idx):
    tokenized = self.backbone.tokenize(batch["sequences"])
    return {"input_ids": tokenized["input_ids"].to(self.device), ...}

# AFTER
def transform(self, batch, batch_idx):
    return self.backbone.process_batch(batch, device=self.device)
    # Handles tokenization + device movement
  • Benefit: Less boilerplate, consistent across tasks

4. Data-Dependent Loss Configuration

class SequenceClassification(TaskInterface):
    def __init__(self, ..., weighted_loss=False):
        self.weighted_loss = weighted_loss

    @once_only
    def configure_model(self):
        self.loss = nn.CrossEntropyLoss(
            weight=self.data_module.class_weight if self.weighted_loss else None
        )
  • Benefit: Automatic class weighting for imbalanced datasets

5. Stage-Specific Data Requirements

def required_data_columns(self, stage: str) -> List[str]:
    if stage == "predict":
        return []  # No labels needed
    return ["target_sequences"]  # Labels required for train/val
  • Benefit: Framework validates before training starts

Migration Impact

  • ⚠️ Custom tasks must update transform() to use process_batch()
  • ⚠️ forward() must extract .last_hidden_state from SequenceBackboneOutput
  • ⚠️ Must implement required_data_columns(stage)
  • ⚠️ Move loss instantiation to configure_model() if data-dependent

4. Adapter System Changes (215 lines)

Changes

  • Mostly code formatting (Ruff 100-char limit)
  • MLPAdapterWithoutOutConcat: Memory optimization for pairwise tasks
  • Import cleanup: Removed unused imports
  • Type hints: Improved annotations
  • Docstring updates: Better documentation

Migration Impact

  • ✅ No breaking changes
  • ✅ Maintains compatibility with refactored backbone/task interfaces

5. Documentation System Refactoring (776 lines)

Problem Being Solved

OLD SITUATION:

class Parent:
    def __init__(self, arg1: int, arg2: str):
        """Parent class.

        Args:
            arg1: First argument
            arg2: Second argument
        """

class Child(Parent):
    def __init__(self, arg3: bool, **kwargs):
        """Child class.

        Args:
            arg3: Third argument
            **kwargs: ??? (User has no idea what's valid)
        """
        super().__init__(**kwargs)

# API docs show Child.__init__(arg3, **kwargs)
# User doesn't know arg1, arg2 are accepted!

NEW SOLUTION:

class Child(Parent, metaclass=GoogleKwargsDocstringInheritanceInitMeta):
    def __init__(self, arg3: bool, **kwargs):
        """Child class.

        Args:
            arg3: Third argument
        """
        super().__init__(**kwargs)

# API docs now show: Child.__init__(arg3, arg1, arg2)
# Docstring automatically includes arg1, arg2 descriptions!

How It Works

Two-Part System:

  1. Runtime: GoogleKwargsDocstringInheritanceInitMeta metaclass

    • Inspects parent __init__ signatures
    • Merges parent parameters into child
    • Updates __signature__ and __doc__
    • Stores in __griffe_signature__ for docs
  2. Build Time: KwargsDocstringInheritance Griffe extension

    • Reads __griffe_signature__ during doc generation
    • Updates MkDocs API reference
    • Shows complete parameter list

Benefits

  • DRY: Update parent docs once, all children inherit
  • IDE Support: Autocomplete shows all valid parameters
  • User Experience: Complete API documentation automatically
  • Maintainability: No duplicate docstrings

Used Throughout Codebase

  • All SequenceBackboneInterface subclasses
  • Data modules with **kwargs
  • Task classes inheriting complex parameters

6. Test Infrastructure Changes (868 lines)

New Tests

File: tests/backbones/test_base.py (311 lines)

1. SequenceBackboneOutput Tests

test_sequence_backbone_output_getitem()  # Indexing/slicing
test_sequence_backbone_output_concat()   # Batch concatenation
test_sequence_backbone_output_to_device()  # Device movement
test_sequence_backbone_output_with_none_fields()  # Optional handling

2. Caching Tests

test_backbone_cache_forward_first_call()  # Cache miss
test_backbone_cache_forward_cached()      # Cache hit
test_indexed_store_*()                    # Indexed backend
test_lmdb_store_*()                       # LMDB backend
test_cache_profiler_*()                   # Performance tracking

3. Fixture Updates

# All fixtures updated to call setup()
@pytest.fixture
def genbiobert(genbiobert_cls):
    backbone = genbiobert_cls(None, None)
    backbone.setup()  # NEW
    return backbone

Benefits

  • ✅ Comprehensive coverage for new features
  • ✅ Regression prevention
  • ✅ Tests serve as usage examples
  • ✅ CI/CD validation

📊 Breaking Changes & Migration Guide

For End Users

Area Old New Migration
Backbone usage output = backbone(...) returns Tensor Returns SequenceBackboneOutput Extract .last_hidden_state
Backbone setup Automatic Must call .setup() Add backbone.setup() before use
Data modules extra_cols + extra_col_aliases x_col list + rename_cols dict Update config files
Column names Hardcoded "sequences" Configurable via rename_cols Use explicit mapping

Example Migration

OLD CODE:

# Backbone
backbone = aido_rna_650m(None, None)
output = backbone(input_ids, attention_mask)  # Returns Tensor
embeddings = output

# Data module
dm = SequenceClassificationDataModule(
    x_col="sequence",
    extra_cols=["metadata"],
    extra_col_aliases=["meta"]
)

NEW CODE:

# Backbone
backbone = aido_rna_650m(None, None)
backbone.setup()  # Required!
output = backbone(input_ids, attention_mask)  # Returns SequenceBackboneOutput
embeddings = output.last_hidden_state

# Data module
dm = SequenceClassificationDataModule(
    x_col=["sequence", "metadata"],
    rename_cols={"metadata": "meta"}
)

Files Changed

1. Development Infrastructure Modernization

Pre-commit Hooks Integration

  • ✅ Added .pre-commit-config.yaml with:
    • Ruff formatter (max line length 100)
    • Trailing whitespace checks
    • YAML validation
    • Poetry dependency validation
  • ✅ Updated CONTRIBUTING.md with pre-commit setup instructions

Dependency Management Migration

  • ✅ Migrated from pip-compile to Poetry
  • ✅ Added poetry.lock file (7,934 lines) for reproducible builds
  • ✅ Better dependency resolution and lock file management

Docker Updates

  • ✅ Updated base CUDA image from 12.1.0 to 12.4.0
  • ✅ Updated PyTorch Geometric dependencies from cu121 to cu124
  • ✅ Fixed missing RUN command in Dockerfile (line 21)

2. New Model Support: Scimilarity

Model Integration

  • ✅ Added modelgenerator/huggingface_models/scimilarity/nn_models.py (206 lines)
  • ✅ Added modelgenerator/huggingface_models/scimilarity/model_v1.1/layer_sizes.json
  • ✅ Added gene list: modelgenerator/cell/gene_lists/scimilarity_genes.tsv (28,232 gene names)
  • ✅ Updated backbone support to include Scimilarity in cell/gene expression models
  • ✅ Modified modelgenerator/cell/utils.py (109 line changes)

3. RNA Inverse Folding Enhancements

New Zero-shot Analysis Pipeline

  • ✅ Added modelgenerator/rna_inv_fold/zeroshot_analyses.py (163 lines)
  • ✅ Implements iterative denoising for RNA inverse folding
  • ✅ Uses masked language modeling with configurable hyperparameters
  • ✅ Supports ablation studies for denoising steps and mask ratios
  • ✅ Uses AIDO.RNA-1.6B model with log-mixing approach (inspired by LM-Design paper)

RNA Task Updates

  • ✅ Modified modelgenerator/rna_ss/rna_ss_task.py (51 line changes)
  • ✅ Updated modelgenerator/rna_ss/rna_ss_data.py (11 line changes)
  • ✅ Minor fix in modelgenerator/rna_inv_fold/data_inverse_folding/dataset.py

4. Code Quality & Documentation

Documentation System Enhancements

  • ✅ Added modelgenerator/utils/kwargs_doc.py (465 lines)
  • ✅ Added modelgenerator/utils/griffe_kwargs_extension.py (311 lines)
  • ✅ Implements GoogleKwargsDocstringInheritanceInitMeta for automatic kwargs parameter inheritance
  • ✅ Added new documentation page: docs/docs/usage/embedding_caching.md
  • ✅ Extensive formatting fixes across all documentation (trailing whitespace, end-of-file fixes)
  • ✅ Updated all README examples to include proper data column mappings (x_col, y_col, rename_cols)

Code Formatting

  • ✅ Applied Ruff formatting across entire codebase (100 character line limit)
  • ✅ Fixed trailing whitespace and end-of-file issues throughout
  • ✅ Many files show only whitespace/formatting changes

5. Core Architecture Refactoring

Backbone Architecture (1,996 lines changed)

  • ✅ Major refactoring of modelgenerator/backbones/backbones.py (1,333 line changes)
  • ✅ Updated modelgenerator/backbones/base.py (663 line changes)
  • ✅ Updated modelgenerator/backbones/__init__.py (127 line changes)

Data Module Refactoring (1,522 lines changed)

  • ✅ Updated modelgenerator/data/data.py (1,522 line changes)
  • ✅ Updated modelgenerator/data/__init__.py (380 line changes)
  • ✅ Modified modelgenerator/data/base.py (33 line changes)

Task System Updates (1,133 lines changed)

  • ✅ Updated modelgenerator/tasks/tasks.py (1,133 line changes)
  • ✅ Updated modelgenerator/tasks/base.py (159 line changes)

Adapter Updates (215 lines changed)

  • ✅ Modified modelgenerator/adapters/adapters.py (79 line changes)
  • ✅ Updated modelgenerator/adapters/fusion.py (136 line changes)

6. Experiment Configuration Updates

New Configurations

  • ✅ Added experiments/AIDO.Protein/xTrimo/configs/stability_prediction.yaml

Configuration Updates

  • ✅ Updated numerous YAML configs across AIDO.Cell, AIDO.DNA, AIDO.Protein, AIDO.RNA, AIDO.Tissue
  • ✅ Generally minor formatting and parameter adjustments
  • ✅ Removed deleted Jupyter notebook from AIDO.Cell experiments

7. Testing Infrastructure

Test Organization

  • ✅ Added __init__.py files to test directories for better organization
  • ✅ Added tests/backbones/test_base.py (311 lines)
  • ✅ Updated existing test files with new assertions and test cases
  • ✅ Modified tests/conftest.py (57 line changes)

@cnellington cnellington merged commit f1f40e5 into main Nov 8, 2025
3 checks passed
@cnellington cnellington deleted the update/nov25 branch November 8, 2025 20:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants