diff --git a/models/moe_baseline/README.md b/models/moe_baseline/README.md index 5c0b8fa..15f154b 100644 --- a/models/moe_baseline/README.md +++ b/models/moe_baseline/README.md @@ -1,29 +1,29 @@ # MOE Baseline -Ridge and MLP regression models trained on MOE (Molecular Operating Environment) molecular descriptors. +Ridge, LightGBM, and MLP regression models trained on MOE (Molecular Operating Environment) molecular descriptors with nested cross-validation. ## Description -This baseline uses pre-computed MOE molecular descriptors to predict five antibody biophysical properties. Each property uses an optimized model configuration (Ridge or MLP) with property-specific feature sets selected through LASSO and Stabl methodologies. +This baseline uses pre-computed MOE molecular descriptors to predict five antibody biophysical properties. Each property uses an optimized model configuration with **per-fold feature selection** to prevent data leakage. Features are selected independently for each cross-validation fold using LASSO, XGBoost, and Consensus (union of LASSO and XGBoost) methodologies. **Model configurations:** -- **HIC**: Ridge regression (11 features, α=79.1) -- **PR_CHO**: Ridge regression (10 features, α=59.6) -- **AC-SINS_pH7.4**: Ridge regression (23 features, α=1.5) -- **Titer**: Ridge regression (4 features, α=59.6) -- **Tm2**: MLP neural network (8 features, 1 hidden layer) +- **HIC**: Ridge regression (~30 Consensus features per fold) +- **PR_CHO**: LightGBM (~30 Consensus features per fold) +- **AC-SINS_pH7.4**: Ridge regression (~30 Consensus features per fold) +- **Titer**: MLP neural network (~31 Consensus features per fold) +- **Tm2**: LightGBM (~30 Consensus features per fold) ## Expected Performance -Based on 5-fold cross-validation on GDPa1: +Based on 5-fold nested cross-validation on GDPa1 (unbiased estimates with per-fold feature selection): -| Property | Model | Features | Spearman ρ (mean ± std) | -|----------|-------|----------|-------------------------| -| HIC | Ridge | 11 | 0.684 ± 0.091 | -| PR_CHO | Ridge | 10 | 0.436 ± 0.116 | -| AC-SINS_pH7.4 | Ridge | 23 | 0.514 ± 0.066 | -| Titer | Ridge | 4 | 0.308 ± 0.248 | -| Tm2 | MLP | 8 | 0.184 ± 0.181 | +| Property | Model | Avg Features | Spearman ρ (test) | +|----------|-------|--------------|-------------------| +| HIC | Ridge | ~30 | 0.656 | +| PR_CHO | LightGBM | ~30 | 0.353 | +| AC-SINS_pH7.4 | Ridge | ~30 | 0.424 | +| Titer | MLP | ~31 | 0.184 | +| Tm2 | LightGBM | ~30 | 0.107 | ## Requirements @@ -97,23 +97,39 @@ MOE molecular descriptors capture structural, electrostatic, hydrophobic, geomet ### Feature Selection -Features were selected using two complementary methods: -- **LASSO**: L1 regularization for sparse feature selection -- **Stabl**: Bootstrap LASSO with FDR control for stable selection (Hédou et al., 2024) +Features are selected independently for each cross-validation fold using a nested CV approach to prevent data leakage: -Final feature sets range from 4 features (Titer) to 23 features (AC-SINS_pH7.4), balancing predictive power with model simplicity. +1. **Per-fold selection**: For each of the 5 folds, features are selected using only the training data (80% of samples) from that fold +2. **Three feature sets tested**: + - **All_MOE**: All 246 MOE descriptors (baseline) + - **LASSO**: Features selected by L1 regularization (alpha tuned via internal CV on training data) + - **Consensus**: Union of LASSO and XGBoost-derived features (XGBoost features: top features covering 90% cumulative SHAP importance) +3. **Pre-computed features**: Selected features for each fold are stored in JSON files (`*_fold_features_updated_feature_selection.json`) + +Best models selected Consensus for most properties (HIC, PR_CHO, Tm2), LASSO for AC-SINS, and All_MOE for Titer. This approach ensures test data never influences feature selection, producing unbiased performance estimates. ### Model Selection -For each property, Ridge, XGBoost, LightGBM, and MLP models were compared across multiple feature sets. Best configurations were selected based on 5-fold cross-validation performance. +For each property, Ridge, XGBoost, LightGBM, and MLP models were compared across the three feature sets (All_MOE, LASSO, Consensus). Best configurations were selected based on 5-fold nested cross-validation performance: +- **Ridge**: Best for HIC and AC-SINS (linear relationships) +- **LightGBM**: Best for PR_CHO and Tm2 (captures non-linear patterns) +- **MLP**: Best for Titer (complex interactions, single hidden layer) + +Models use Consensus features by default, which combine LASSO and XGBoost-derived selections for robust feature coverage. ### Prediction -Features are standardized using training set statistics. Ridge models apply linear regression; MLP models use a single hidden layer with early stopping for Tm2. +The model automatically detects which fold is being predicted based on the fold column in the input data: +1. **Fold detection**: Identifies test samples by missing values in the fold column +2. **Feature loading**: Loads the appropriate pre-selected features for that fold from JSON files +3. **Standardization**: Features are standardized using training set statistics from that fold +4. **Model application**: Applies the trained model (Ridge, LightGBM, or MLP) to generate predictions + +This ensures each prediction uses only features that were selected from its corresponding training data, maintaining the integrity of nested cross-validation. ## Implementation -This baseline implements the `BaseModel` interface from `abdev_core`: +This baseline implements the `BaseModel` interface from `abdev_core` with nested cross-validation support: ```python from abdev_core import BaseModel, load_features @@ -122,15 +138,27 @@ class MoeBaselineModel(BaseModel): def train(self, df: pd.DataFrame, run_dir: Path, *, seed: int = 42) -> None: # Load MOE features from centralized store moe_features = load_features("MOE_properties") - # Train 5 separate models with optimized configs + + # Detect current fold from data + fold_id = self._get_fold_id(df) + + # Load pre-selected features for this fold + fold_features = self._get_fold_features(property_name, fold_id) + + # Train model (Ridge/LightGBM/MLP) with fold-specific features # ... def predict(self, df: pd.DataFrame, run_dir: Path) -> pd.DataFrame: # Load models and MOE features + # Automatically detect fold and use corresponding features # Generate predictions for all 5 properties # ... ``` +**Key files:** +- `model.py`: Main implementation with fold-aware feature loading +- `*_fold_features_updated_feature_selection.json`: Pre-computed per-fold features (5 files, one per property) + Features are managed centrally by `abdev_core`. See the [abdev_core documentation](../../libs/abdev_core/README.md) for details. ## Output @@ -143,5 +171,4 @@ Predictions are written to `/predictions.csv` with columns: ## References - **MOE descriptors**: Nels Thorsteinsen -- **Stabl selection**: Hédou et al. (2024), "Discovery of sparse, reliable omic biomarkers with Stabl", Nature Biotechnology - **GDPa1 dataset**: [ginkgo-datapoints/GDPa1](https://huggingface.co/datasets/ginkgo-datapoints/GDPa1)