Summary
Introduce an optional, cross-validation (CV)–based model selection mode in PERSEO to choose the best GAMLSS family for each feature by predictive performance rather than by in-sample information criteria alone. This mode will evaluate candidate families using out-of-sample log-score (and optionally CRPS) computed on the original data scale, ensuring fair comparisons even when families require different pre-fit transformations.
This complements the current Jacobian-corrected AIC/BIC/GAIC approach and provides a robust, model-agnostic fallback when information criteria are inconclusive or assumptions are borderline.
Motivation
PERSEO’s core philosophy is to let the model adapt to the data, not force the data to fit a fixed distribution or global normalization. Today we select families using penalized likelihood (AIC/BIC/GAIC). That’s fast and statistically principled, especially with Jacobian correction when transformations are applied.
However, even with corrections, predictive validity can be a stronger practical criterion in complex omics settings (e.g., small (n), outliers, distributional heterogeneity). Cross-validation directly asks: which family best predicts unseen observations given the covariates? This aligns tightly with the downstream goal of stable inference and generalization.
Why?
- Cross-validation splits the samples (for one gene/feature) into folds. We train on some folds and test on the held-out fold, rotating so every sample is predicted exactly once.
- For each candidate GAMLSS family, we fit the model on the training data, then compute how well it predicts the held-out values. We summarize predictive quality via:
- Log-score: the (sum of) log-probabilities the model assigns to the held-out observations. Higher is better. It rewards models that assign high probability to what actually occurs and penalizes overconfident mistakes.
- CRPS (Continuous Ranked Probability Score, optional): a proper scoring rule that compares the full predictive distribution to the observed value on the original scale. Lower is better.
- Why this is fair across families: if a family requires transforming the response (e.g., z-scoring or min–max for Beta), we still evaluate predictions on the original scale using the change-of-variables rule (i.e., adding the log of the derivative of the transform). That way, all families are judged by how well they model the real data we care about, not an internal working scale.
- Result: the selected family is the one that generalizes best, not merely the one that fits the training data most tightly.
Summary
Introduce an optional, cross-validation (CV)–based model selection mode in PERSEO to choose the best GAMLSS family for each feature by predictive performance rather than by in-sample information criteria alone. This mode will evaluate candidate families using out-of-sample log-score (and optionally CRPS) computed on the original data scale, ensuring fair comparisons even when families require different pre-fit transformations.
This complements the current Jacobian-corrected AIC/BIC/GAIC approach and provides a robust, model-agnostic fallback when information criteria are inconclusive or assumptions are borderline.
Motivation
PERSEO’s core philosophy is to let the model adapt to the data, not force the data to fit a fixed distribution or global normalization. Today we select families using penalized likelihood (AIC/BIC/GAIC). That’s fast and statistically principled, especially with Jacobian correction when transformations are applied.
However, even with corrections, predictive validity can be a stronger practical criterion in complex omics settings (e.g., small (n), outliers, distributional heterogeneity). Cross-validation directly asks: which family best predicts unseen observations given the covariates? This aligns tightly with the downstream goal of stable inference and generalization.
Why?