Fix AUC cross-fitting: use GRF for better calibration

boyercb · boyercb · commit 237278fce463 · 2026-01-23T10:29:43.000-05:00
Key changes:
- Fix .predict_grf() to work with formula containing '.'
- Update propensity score truncation bounds to [0.025, 0.975]
- Truncate outcome probabilities (q_hat) for stability
- Add warning when &gt;10% of propensity scores at truncation bounds
- Update vignette to recommend GRF over ranger for AUC estimation
- Document that GRF's 'honest' estimation produces well-calibrated
  probabilities needed for the doubly robust AUC estimator

The DR AUC estimator is sensitive to poorly calibrated probability
estimates due to its pairwise/outer-product structure. Standard
random forests (ranger) can produce extreme predictions that
destabilize the estimator. GRF's probability_forest with honesty
produces estimates nearly identical to GLM in our tests.
diff --git a/R/ml_learner.R b/R/ml_learner.R
@@ -330,8 +330,17 @@ print.ml_learner <- function(x, ...) {
 }
 
 .predict_grf <- function(fit, newdata) {
-  formula <- attr(fit, "formula")
-  x <- model.matrix(formula, data = newdata)[, -1, drop = FALSE]
+  # GRF stores the training data columns - use those names
+  # For probability_forest and regression_forest, the X matrix column names are stored
+  train_cols <- colnames(fit$X.orig)
+
+  if (is.null(train_cols)) {
+    # Fallback: use all columns from newdata
+    x <- as.matrix(newdata)
+  } else {
+    # Use the same columns as training
+    x <- as.matrix(newdata[, train_cols, drop = FALSE])
+  }
 
   pred <- predict(fit, newdata = x)
 
diff --git a/R/variance.R b/R/variance.R
@@ -553,7 +553,10 @@ NULL
     if (treatment_level == 0) {
       ps_pred <- 1 - ps_pred
     }
-    ps_cf[val_idx] <- pmax(pmin(ps_pred, 0.99), 0.01)
+
+    # Truncate propensity scores for stability
+    # Use 0.025-0.975 bounds to avoid extreme weights while allowing flexibility
+    ps_cf[val_idx] <- pmax(pmin(ps_pred, 0.975), 0.025)
 
     # Fit outcome model on training fold (among treated)
     subset_train <- train_idx[treatment[train_idx] == treatment_level]
@@ -577,7 +580,8 @@ NULL
     }
 
     # Store outcome probability (q_hat) for AUC
-    q_cf[val_idx] <- pY
+    # Truncate to avoid extreme values that cause instability in DR estimator
+    q_cf[val_idx] <- pmax(pmin(pY, 0.99), 0.01)
 
     # Compute conditional loss: E[(Y - pred)^2 | X, A=a] = p(1-p) + (p - pred)^2
     # For binary Y: E[Y^2] = p, so E[(Y - pred)^2] = p - 2*p*pred + pred^2
@@ -700,8 +704,18 @@ NULL
   # Treatment indicator
   I_a <- as.numeric(treatment == treatment_level)
 
-  # Truncate propensity scores for stability
-  ps <- pmax(pmin(ps, 0.99), 0.01)
+  # Note: propensity scores are already truncated in .cross_fit_nuisance()
+  # Additional truncation here is defensive
+
+  # Check for extreme propensity scores and warn
+  ps_extreme <- sum(ps <= 0.025 | ps >= 0.975)
+  if (ps_extreme > 0.1 * n) {
+    warning(sprintf(
+      "%.0f%% of propensity scores are at truncation bounds. ",
+      100 * ps_extreme / n
+    ), "Consider using a simpler propensity model or more regularization.",
+    call. = FALSE)
+  }
 
   # Concordance indicator matrix (f_i > f_j)
   ind_f <- outer(predictions, predictions, ">")
diff --git a/vignettes/ml-integration.Rmd b/vignettes/ml-integration.Rmd
@@ -293,17 +293,23 @@ print(result_mse)
 
 ### AUC with ML Learners
 
-```{r auc-final-example, eval=requireNamespace("ranger", quietly = TRUE)}
-# AUC estimation with ML nuisance models
+For AUC estimation, we recommend using **GRF (Generalized Random Forests)** 
+rather than standard random forests. GRF's "honesty" property produces 
+well-calibrated probability estimates, which is critical for the doubly robust 
+AUC estimator. Standard random forests can produce extreme probability 
+predictions that destabilize the estimator.
+
+```{r auc-final-example, eval=requireNamespace("grf", quietly = TRUE)}
+# AUC estimation with GRF nuisance models (recommended)
 result_auc <- cf_auc(
   predictions = pred,
   outcomes = y,
   treatment = a,
   covariates = df,
   treatment_level = 0,
   estimator = "dr",
-  propensity_model = ml_learner("ranger", num.trees = 100),
-  outcome_model = ml_learner("ranger", num.trees = 100),
+  propensity_model = ml_learner("grf", num.trees = 500),
+  outcome_model = ml_learner("grf", num.trees = 500),
   cross_fit = TRUE,
   n_folds = 5,
   se_method = "influence"
@@ -328,7 +334,12 @@ print(result_auc)
    exploration, use fewer trees/rounds, then increase for final analysis.
 
 5. **Check for extreme propensity scores**: ML methods can produce very extreme 
-   propensity scores. The package truncates these at [0.01, 0.99] by default.
+   propensity scores. The package truncates these at [0.025, 0.975] by default.
+
+6. **Use GRF for AUC estimation**: The doubly robust AUC estimator is sensitive 
+   to poorly calibrated probability estimates. GRF's "honest" estimation 
+   produces well-calibrated probabilities, while standard random forests 
+   (ranger) can produce extreme predictions that destabilize the estimator.
 
 ## References