Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
75 changes: 51 additions & 24 deletions models/moe_baseline/README.md
Original file line number Diff line number Diff line change
@@ -1,29 +1,29 @@
# MOE Baseline

Ridge and MLP regression models trained on MOE (Molecular Operating Environment) molecular descriptors.
Ridge, LightGBM, and MLP regression models trained on MOE (Molecular Operating Environment) molecular descriptors with nested cross-validation.

## Description

This baseline uses pre-computed MOE molecular descriptors to predict five antibody biophysical properties. Each property uses an optimized model configuration (Ridge or MLP) with property-specific feature sets selected through LASSO and Stabl methodologies.
This baseline uses pre-computed MOE molecular descriptors to predict five antibody biophysical properties. Each property uses an optimized model configuration with **per-fold feature selection** to prevent data leakage. Features are selected independently for each cross-validation fold using LASSO, XGBoost, and Consensus (union of LASSO and XGBoost) methodologies.

**Model configurations:**
- **HIC**: Ridge regression (11 features, α=79.1)
- **PR_CHO**: Ridge regression (10 features, α=59.6)
- **AC-SINS_pH7.4**: Ridge regression (23 features, α=1.5)
- **Titer**: Ridge regression (4 features, α=59.6)
- **Tm2**: MLP neural network (8 features, 1 hidden layer)
- **HIC**: Ridge regression (~30 Consensus features per fold)
- **PR_CHO**: LightGBM (~30 Consensus features per fold)
- **AC-SINS_pH7.4**: Ridge regression (~30 Consensus features per fold)
- **Titer**: MLP neural network (~31 Consensus features per fold)
- **Tm2**: LightGBM (~30 Consensus features per fold)

## Expected Performance

Based on 5-fold cross-validation on GDPa1:
Based on 5-fold nested cross-validation on GDPa1 (unbiased estimates with per-fold feature selection):

| Property | Model | Features | Spearman ρ (mean ± std) |
|----------|-------|----------|-------------------------|
| HIC | Ridge | 11 | 0.684 ± 0.091 |
| PR_CHO | Ridge | 10 | 0.436 ± 0.116 |
| AC-SINS_pH7.4 | Ridge | 23 | 0.514 ± 0.066 |
| Titer | Ridge | 4 | 0.308 ± 0.248 |
| Tm2 | MLP | 8 | 0.184 ± 0.181 |
| Property | Model | Avg Features | Spearman ρ (test) |
|----------|-------|--------------|-------------------|
| HIC | Ridge | ~30 | 0.656 |
| PR_CHO | LightGBM | ~30 | 0.353 |
| AC-SINS_pH7.4 | Ridge | ~30 | 0.424 |
| Titer | MLP | ~31 | 0.184 |
| Tm2 | LightGBM | ~30 | 0.107 |

## Requirements

Expand Down Expand Up @@ -97,23 +97,39 @@ MOE molecular descriptors capture structural, electrostatic, hydrophobic, geomet

### Feature Selection

Features were selected using two complementary methods:
- **LASSO**: L1 regularization for sparse feature selection
- **Stabl**: Bootstrap LASSO with FDR control for stable selection (Hédou et al., 2024)
Features are selected independently for each cross-validation fold using a nested CV approach to prevent data leakage:

Final feature sets range from 4 features (Titer) to 23 features (AC-SINS_pH7.4), balancing predictive power with model simplicity.
1. **Per-fold selection**: For each of the 5 folds, features are selected using only the training data (80% of samples) from that fold
2. **Three feature sets tested**:
- **All_MOE**: All 246 MOE descriptors (baseline)
- **LASSO**: Features selected by L1 regularization (alpha tuned via internal CV on training data)
- **Consensus**: Union of LASSO and XGBoost-derived features (XGBoost features: top features covering 90% cumulative SHAP importance)
3. **Pre-computed features**: Selected features for each fold are stored in JSON files (`*_fold_features_updated_feature_selection.json`)

Best models selected Consensus for most properties (HIC, PR_CHO, Tm2), LASSO for AC-SINS, and All_MOE for Titer. This approach ensures test data never influences feature selection, producing unbiased performance estimates.

### Model Selection

For each property, Ridge, XGBoost, LightGBM, and MLP models were compared across multiple feature sets. Best configurations were selected based on 5-fold cross-validation performance.
For each property, Ridge, XGBoost, LightGBM, and MLP models were compared across the three feature sets (All_MOE, LASSO, Consensus). Best configurations were selected based on 5-fold nested cross-validation performance:
- **Ridge**: Best for HIC and AC-SINS (linear relationships)
- **LightGBM**: Best for PR_CHO and Tm2 (captures non-linear patterns)
- **MLP**: Best for Titer (complex interactions, single hidden layer)

Models use Consensus features by default, which combine LASSO and XGBoost-derived selections for robust feature coverage.

### Prediction

Features are standardized using training set statistics. Ridge models apply linear regression; MLP models use a single hidden layer with early stopping for Tm2.
The model automatically detects which fold is being predicted based on the fold column in the input data:
1. **Fold detection**: Identifies test samples by missing values in the fold column
2. **Feature loading**: Loads the appropriate pre-selected features for that fold from JSON files
3. **Standardization**: Features are standardized using training set statistics from that fold
4. **Model application**: Applies the trained model (Ridge, LightGBM, or MLP) to generate predictions

This ensures each prediction uses only features that were selected from its corresponding training data, maintaining the integrity of nested cross-validation.

## Implementation

This baseline implements the `BaseModel` interface from `abdev_core`:
This baseline implements the `BaseModel` interface from `abdev_core` with nested cross-validation support:

```python
from abdev_core import BaseModel, load_features
Expand All @@ -122,15 +138,27 @@ class MoeBaselineModel(BaseModel):
def train(self, df: pd.DataFrame, run_dir: Path, *, seed: int = 42) -> None:
# Load MOE features from centralized store
moe_features = load_features("MOE_properties")
# Train 5 separate models with optimized configs

# Detect current fold from data
fold_id = self._get_fold_id(df)

# Load pre-selected features for this fold
fold_features = self._get_fold_features(property_name, fold_id)

# Train model (Ridge/LightGBM/MLP) with fold-specific features
# ...

def predict(self, df: pd.DataFrame, run_dir: Path) -> pd.DataFrame:
# Load models and MOE features
# Automatically detect fold and use corresponding features
# Generate predictions for all 5 properties
# ...
```

**Key files:**
- `model.py`: Main implementation with fold-aware feature loading
- `*_fold_features_updated_feature_selection.json`: Pre-computed per-fold features (5 files, one per property)

Features are managed centrally by `abdev_core`. See the [abdev_core documentation](../../libs/abdev_core/README.md) for details.

## Output
Expand All @@ -143,5 +171,4 @@ Predictions are written to `<out-dir>/predictions.csv` with columns:
## References

- **MOE descriptors**: Nels Thorsteinsen
- **Stabl selection**: Hédou et al. (2024), "Discovery of sparse, reliable omic biomarkers with Stabl", Nature Biotechnology
- **GDPa1 dataset**: [ginkgo-datapoints/GDPa1](https://huggingface.co/datasets/ginkgo-datapoints/GDPa1)