Add Cross-Validation–Based Model Selection

## Summary

Introduce an **optional, cross-validation (CV)–based model selection** mode in PERSEO to choose the best GAMLSS family for each feature by **predictive performance** rather than by in-sample information criteria alone. This mode will evaluate candidate families using **out-of-sample log-score** (and optionally **CRPS**) computed on the **original data scale**, ensuring fair comparisons even when families require different pre-fit transformations.

This complements the current Jacobian-corrected AIC/BIC/GAIC approach and provides a robust, model-agnostic fallback when information criteria are inconclusive or assumptions are borderline.

---

## Motivation

PERSEO’s core philosophy is to **let the model adapt to the data**, not force the data to fit a fixed distribution or global normalization. Today we select families using penalized likelihood (AIC/BIC/GAIC). That’s fast and statistically principled, especially with **Jacobian correction** when transformations are applied.

However, even with corrections, **predictive validity** can be a stronger practical criterion in complex omics settings (e.g., small \(n\), outliers, distributional heterogeneity). **Cross-validation** directly asks: _which family best predicts unseen observations given the covariates?_ This aligns tightly with the downstream goal of stable inference and generalization.

---

## Why?

- **Cross-validation** splits the samples (for one gene/feature) into folds. We train on some folds and test on the held-out fold, rotating so every sample is predicted exactly once.
- For each candidate **GAMLSS family**, we fit the model on the training data, then compute how well it predicts the **held-out** values. We summarize predictive quality via:
  - **Log-score**: the (sum of) log-probabilities the model assigns to the held-out observations. Higher is better. It rewards models that assign high probability to what actually occurs and penalizes overconfident mistakes.
  - **CRPS** (Continuous Ranked Probability Score, optional): a proper scoring rule that compares the full predictive distribution to the observed value on the original scale. Lower is better.
- **Why this is fair across families:** if a family requires transforming the response (e.g., z-scoring or min–max for Beta), we still **evaluate predictions on the original scale** using the change-of-variables rule (i.e., adding the log of the derivative of the transform). That way, all families are judged by how well they model the real data we care about, not an internal working scale.
- **Result:** the selected family is the one that generalizes best, not merely the one that fits the training data most tightly.




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Cross-Validation–Based Model Selection #7

Summary

Motivation

Why?

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Add Cross-Validation–Based Model Selection #7

Description

Summary

Motivation

Why?

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions