ginkgobioworks · SSC9 · Nov 10, 2025
diff --git a/models/moe_baseline/README.md b/models/moe_baseline/README.md
@@ -1,29 +1,29 @@
 # MOE Baseline
 
-Ridge and MLP regression models trained on MOE (Molecular Operating Environment) molecular descriptors.
+Ridge, LightGBM, and MLP regression models trained on MOE (Molecular Operating Environment) molecular descriptors with nested cross-validation.
 
 ## Description
 
-This baseline uses pre-computed MOE molecular descriptors to predict five antibody biophysical properties. Each property uses an optimized model configuration (Ridge or MLP) with property-specific feature sets selected through LASSO and Stabl methodologies.
+This baseline uses pre-computed MOE molecular descriptors to predict five antibody biophysical properties. Each property uses an optimized model configuration with **per-fold feature selection** to prevent data leakage. Features are selected independently for each cross-validation fold using LASSO, XGBoost, and Consensus (union of LASSO and XGBoost) methodologies.
 
 **Model configurations:**
-- **HIC**: Ridge regression (11 features, α=79.1)
-- **PR_CHO**: Ridge regression (10 features, α=59.6)
-- **AC-SINS_pH7.4**: Ridge regression (23 features, α=1.5)
-- **Titer**: Ridge regression (4 features, α=59.6)
-- **Tm2**: MLP neural network (8 features, 1 hidden layer)
+- **HIC**: Ridge regression (~30 Consensus features per fold)
+- **PR_CHO**: LightGBM (~30 Consensus features per fold)
+- **AC-SINS_pH7.4**: Ridge regression (~30 Consensus features per fold)
+- **Titer**: MLP neural network (~31 Consensus features per fold)
+- **Tm2**: LightGBM (~30 Consensus features per fold)
 
 ## Expected Performance
 
-Based on 5-fold cross-validation on GDPa1:
+Based on 5-fold nested cross-validation on GDPa1 (unbiased estimates with per-fold feature selection):
 
-| Property | Model | Features | Spearman ρ (mean ± std) |
-|----------|-------|----------|-------------------------|
-| HIC | Ridge | 11 | 0.684 ± 0.091 |
-| PR_CHO | Ridge | 10 | 0.436 ± 0.116 |
-| AC-SINS_pH7.4 | Ridge | 23 | 0.514 ± 0.066 |
-| Titer | Ridge | 4 | 0.308 ± 0.248 |
-| Tm2 | MLP | 8 | 0.184 ± 0.181 |
+| Property | Model | Avg Features | Spearman ρ (test) |
+|----------|-------|--------------|-------------------|
+| HIC | Ridge | ~30 | 0.656 |
+| PR_CHO | LightGBM | ~30 | 0.353 |
+| AC-SINS_pH7.4 | Ridge | ~30 | 0.424 |
+| Titer | MLP | ~31 | 0.184 |
+| Tm2 | LightGBM | ~30 | 0.107 |
 
 ## Requirements
 
@@ -97,23 +97,39 @@ MOE molecular descriptors capture structural, electrostatic, hydrophobic, geomet
 
 ### Feature Selection
 
-Features were selected using two complementary methods:
-- **LASSO**: L1 regularization for sparse feature selection
-- **Stabl**: Bootstrap LASSO with FDR control for stable selection (Hédou et al., 2024)
+Features are selected independently for each cross-validation fold using a nested CV approach to prevent data leakage:
 
-Final feature sets range from 4 features (Titer) to 23 features (AC-SINS_pH7.4), balancing predictive power with model simplicity.
+1. **Per-fold selection**: For each of the 5 folds, features are selected using only the training data (80% of samples) from that fold
+2. **Three feature sets tested**:
+   - **All_MOE**: All 246 MOE descriptors (baseline)
+   - **LASSO**: Features selected by L1 regularization (alpha tuned via internal CV on training data)
+   - **Consensus**: Union of LASSO and XGBoost-derived features (XGBoost features: top features covering 90% cumulative SHAP importance)
+3. **Pre-computed features**: Selected features for each fold are stored in JSON files (`*_fold_features_updated_feature_selection.json`)
+
+Best models selected Consensus for most properties (HIC, PR_CHO, Tm2), LASSO for AC-SINS, and All_MOE for Titer. This approach ensures test data never influences feature selection, producing unbiased performance estimates.
 
 ### Model Selection
 
-For each property, Ridge, XGBoost, LightGBM, and MLP models were compared across multiple feature sets. Best configurations were selected based on 5-fold cross-validation performance.
+For each property, Ridge, XGBoost, LightGBM, and MLP models were compared across the three feature sets (All_MOE, LASSO, Consensus). Best configurations were selected based on 5-fold nested cross-validation performance:
+- **Ridge**: Best for HIC and AC-SINS (linear relationships)
+- **LightGBM**: Best for PR_CHO and Tm2 (captures non-linear patterns)
+- **MLP**: Best for Titer (complex interactions, single hidden layer)
+
+Models use Consensus features by default, which combine LASSO and XGBoost-derived selections for robust feature coverage.
 
 ### Prediction
 
-Features are standardized using training set statistics. Ridge models apply linear regression; MLP models use a single hidden layer with early stopping for Tm2.
+The model automatically detects which fold is being predicted based on the fold column in the input data:
+1. **Fold detection**: Identifies test samples by missing values in the fold column
+2. **Feature loading**: Loads the appropriate pre-selected features for that fold from JSON files
+3. **Standardization**: Features are standardized using training set statistics from that fold
+4. **Model application**: Applies the trained model (Ridge, LightGBM, or MLP) to generate predictions
+
+This ensures each prediction uses only features that were selected from its corresponding training data, maintaining the integrity of nested cross-validation.
 
 ## Implementation
 
-This baseline implements the `BaseModel` interface from `abdev_core`:
+This baseline implements the `BaseModel` interface from `abdev_core` with nested cross-validation support:
 
 ```python
 from abdev_core import BaseModel, load_features
@@ -122,15 +138,27 @@ class MoeBaselineModel(BaseModel):
     def train(self, df: pd.DataFrame, run_dir: Path, *, seed: int = 42) -> None:
         # Load MOE features from centralized store
         moe_features = load_features("MOE_properties")
-        # Train 5 separate models with optimized configs
+
+        # Detect current fold from data
+        fold_id = self._get_fold_id(df)
+
+        # Load pre-selected features for this fold
+        fold_features = self._get_fold_features(property_name, fold_id)
+
+        # Train model (Ridge/LightGBM/MLP) with fold-specific features
         # ...
 
     def predict(self, df: pd.DataFrame, run_dir: Path) -> pd.DataFrame:
         # Load models and MOE features
+        # Automatically detect fold and use corresponding features
         # Generate predictions for all 5 properties
         # ...
 ```
 
+**Key files:**
+- `model.py`: Main implementation with fold-aware feature loading
+- `*_fold_features_updated_feature_selection.json`: Pre-computed per-fold features (5 files, one per property)
+
 Features are managed centrally by `abdev_core`. See the [abdev_core documentation](../../libs/abdev_core/README.md) for details.
 
 ## Output
@@ -143,5 +171,4 @@ Predictions are written to `<out-dir>/predictions.csv` with columns:
 ## References
 
 - **MOE descriptors**: Nels Thorsteinsen
-- **Stabl selection**: Hédou et al. (2024), "Discovery of sparse, reliable omic biomarkers with Stabl", Nature Biotechnology
 - **GDPa1 dataset**: [ginkgo-datapoints/GDPa1](https://huggingface.co/datasets/ginkgo-datapoints/GDPa1)