|
| 1 | +--- |
| 2 | +title: "Machine Learning Integration" |
| 3 | +output: rmarkdown::html_vignette |
| 4 | +vignette: > |
| 5 | + %\VignetteIndexEntry{Machine Learning Integration} |
| 6 | +
|
| 7 | + %\VignetteEngine{knitr::rmarkdown} |
| 8 | + %\VignetteEncoding{UTF-8} |
| 9 | +--- |
| 10 | + |
| 11 | +```{r, include = FALSE} |
| 12 | +knitr::opts_chunk$set( |
| 13 | + collapse = TRUE, |
| 14 | + comment = "#>", |
| 15 | + fig.width = 7, |
| 16 | + fig.height = 5 |
| 17 | +) |
| 18 | +``` |
| 19 | + |
| 20 | +## Overview |
| 21 | + |
| 22 | +The `cfperformance` package supports flexible machine learning methods for |
| 23 | +nuisance model estimation (propensity scores and outcome models). When using |
| 24 | +ML methods, cross-fitting is essential for valid inference—this is handled |
| 25 | +automatically when you use the `ml_learner()` specification. |
| 26 | + |
| 27 | +This vignette demonstrates how to use ML learners with the counterfactual |
| 28 | +performance estimators. |
| 29 | + |
| 30 | +## Why Use ML for Nuisance Models? |
| 31 | + |
| 32 | +The doubly robust (DR) estimator requires models for: |
| 33 | + |
| 34 | +1. **Propensity scores**: $P(A = a | X)$ — the probability of treatment given |
| 35 | + covariates |
| 36 | +2. **Outcome model**: $E[Y | X, A = a]$ — the expected outcome given covariates |
| 37 | + and treatment |
| 38 | + |
| 39 | +Traditional approaches use logistic regression, which may be misspecified when |
| 40 | +the true relationships are complex. Machine learning methods can capture |
| 41 | +nonlinear relationships and interactions without explicit specification. |
| 42 | + |
| 43 | +However, using flexible ML methods introduces a challenge: if the same data is |
| 44 | +used to both fit the nuisance models and compute the estimator, the resulting |
| 45 | +inference can be invalid. **Cross-fitting** (sample splitting) solves this by: |
| 46 | + |
| 47 | +1. Splitting data into K folds |
| 48 | +2. For each fold, fitting nuisance models on the other K-1 folds |
| 49 | +3. Using the held-out fold for estimation |
| 50 | + |
| 51 | +This approach, known as "double/debiased machine learning" (Chernozhukov et al., |
| 52 | +2018), maintains valid inference while leveraging flexible estimation. |
| 53 | + |
| 54 | +## The `ml_learner()` Interface |
| 55 | + |
| 56 | +The `ml_learner()` function creates a learner specification that can be passed |
| 57 | +to the `propensity_model` or `outcome_model` arguments: |
| 58 | + |
| 59 | +```{r setup, message=FALSE} |
| 60 | +library(cfperformance) |
| 61 | +
|
| 62 | +# Create a learner specification |
| 63 | +ranger_learner <- ml_learner("ranger", num.trees = 500) |
| 64 | +print(ranger_learner) |
| 65 | +``` |
| 66 | + |
| 67 | +### Supported Learners |
| 68 | + |
| 69 | +| Method | Package | Description | |
| 70 | +|--------|---------|-------------| |
| 71 | +| `"ranger"` | ranger | Fast random forest implementation | |
| 72 | +| `"xgboost"` | xgboost | Gradient boosting (XGBoost) | |
| 73 | +| `"grf"` | grf | Generalized random forests with honesty | |
| 74 | +| `"glmnet"` | glmnet | Elastic net with cross-validated λ | |
| 75 | +| `"superlearner"` | SuperLearner | Ensemble of multiple learners | |
| 76 | +| `"custom"` | — | User-supplied fit/predict functions | |
| 77 | + |
| 78 | +## Basic Usage |
| 79 | + |
| 80 | +Here's a complete example using random forests for nuisance estimation: |
| 81 | + |
| 82 | +```{r basic-example} |
| 83 | +# Generate example data |
| 84 | +set.seed(123) |
| 85 | +n <- 1000 |
| 86 | +
|
| 87 | +# Covariates with nonlinear relationships |
| 88 | +x1 <- rnorm(n) |
| 89 | +x2 <- rnorm(n) |
| 90 | +x3 <- rnorm(n) |
| 91 | +
|
| 92 | +# Treatment depends on covariates (nonlinearly) |
| 93 | +ps_true <- plogis(-0.5 + 0.8 * x1 - 0.5 * x2 + 0.3 * x1 * x2) |
| 94 | +a <- rbinom(n, 1, ps_true) |
| 95 | +
|
| 96 | +# Outcome depends on covariates and treatment |
| 97 | +y_prob <- plogis(-1 + 0.5 * x1 + 0.3 * x2^2 + 0.2 * x3 - 0.4 * a) |
| 98 | +y <- rbinom(n, 1, y_prob) |
| 99 | +
|
| 100 | +# Prediction model (intentionally simplified) |
| 101 | +pred <- plogis(-1 + 0.4 * x1 + 0.2 * x2) |
| 102 | +
|
| 103 | +# Create data frame |
| 104 | +df <- data.frame(x1 = x1, x2 = x2, x3 = x3) |
| 105 | +``` |
| 106 | + |
| 107 | +### Using GLM (Default) |
| 108 | + |
| 109 | +```{r glm-example} |
| 110 | +# Standard approach with GLM |
| 111 | +result_glm <- cf_mse( |
| 112 | + predictions = pred, |
| 113 | + outcomes = y, |
| 114 | + treatment = a, |
| 115 | + covariates = df, |
| 116 | + treatment_level = 0, |
| 117 | + estimator = "dr", |
| 118 | + cross_fit = TRUE, |
| 119 | + n_folds = 5, |
| 120 | + se_method = "influence" |
| 121 | +) |
| 122 | +
|
| 123 | +print(result_glm) |
| 124 | +``` |
| 125 | + |
| 126 | +### Using Random Forest |
| 127 | + |
| 128 | +```{r ranger-example, eval=requireNamespace("ranger", quietly = TRUE)} |
| 129 | +# With random forest nuisance models |
| 130 | +result_rf <- cf_mse( |
| 131 | + predictions = pred, |
| 132 | + outcomes = y, |
| 133 | + treatment = a, |
| 134 | + covariates = df, |
| 135 | + treatment_level = 0, |
| 136 | + estimator = "dr", |
| 137 | + propensity_model = ml_learner("ranger", num.trees = 200), |
| 138 | + outcome_model = ml_learner("ranger", num.trees = 200), |
| 139 | + cross_fit = TRUE, |
| 140 | + n_folds = 5, |
| 141 | + se_method = "influence" |
| 142 | +) |
| 143 | +
|
| 144 | +print(result_rf) |
| 145 | +``` |
| 146 | + |
| 147 | +## Learner-Specific Examples |
| 148 | + |
| 149 | +### ranger (Random Forest) |
| 150 | + |
| 151 | +```{r ranger-detailed, eval=FALSE} |
| 152 | +cf_mse( |
| 153 | + ..., |
| 154 | + propensity_model = ml_learner( |
| 155 | + "ranger", |
| 156 | + num.trees = 500, # Number of trees |
| 157 | + mtry = 3, # Variables per split |
| 158 | + min.node.size = 10 # Minimum node size |
| 159 | + ), |
| 160 | + cross_fit = TRUE |
| 161 | +) |
| 162 | +``` |
| 163 | + |
| 164 | +### xgboost (Gradient Boosting) |
| 165 | + |
| 166 | +```{r xgboost-example, eval=FALSE} |
| 167 | +cf_mse( |
| 168 | + ..., |
| 169 | + propensity_model = ml_learner( |
| 170 | + "xgboost", |
| 171 | + nrounds = 100, # Boosting rounds |
| 172 | + max_depth = 4, # Tree depth |
| 173 | + eta = 0.1 # Learning rate |
| 174 | + ), |
| 175 | + cross_fit = TRUE |
| 176 | +) |
| 177 | +``` |
| 178 | + |
| 179 | +### glmnet (Elastic Net) |
| 180 | + |
| 181 | +```{r glmnet-example, eval=FALSE} |
| 182 | +cf_mse( |
| 183 | + ..., |
| 184 | + propensity_model = ml_learner( |
| 185 | + "glmnet", |
| 186 | + alpha = 0.5, # 0=ridge, 1=lasso, 0.5=elastic net |
| 187 | + nfolds = 10 # CV folds for lambda selection |
| 188 | + ), |
| 189 | + cross_fit = TRUE |
| 190 | +) |
| 191 | +``` |
| 192 | + |
| 193 | +### grf (Generalized Random Forest) |
| 194 | + |
| 195 | +```{r grf-example, eval=FALSE} |
| 196 | +cf_mse( |
| 197 | + ..., |
| 198 | + propensity_model = ml_learner( |
| 199 | + "grf", |
| 200 | + num.trees = 2000, # More trees for honesty |
| 201 | + honesty = TRUE # Honest estimation (default) |
| 202 | + ), |
| 203 | + cross_fit = TRUE |
| 204 | +) |
| 205 | +``` |
| 206 | + |
| 207 | +### SuperLearner (Ensemble) |
| 208 | + |
| 209 | +```{r sl-example, eval=FALSE} |
| 210 | +cf_mse( |
| 211 | + ..., |
| 212 | + propensity_model = ml_learner( |
| 213 | + "superlearner", |
| 214 | + SL.library = c("SL.glm", "SL.ranger", "SL.xgboost") |
| 215 | + ), |
| 216 | + cross_fit = TRUE |
| 217 | +) |
| 218 | +``` |
| 219 | + |
| 220 | +## Custom Learners |
| 221 | + |
| 222 | +You can define custom learners by providing `fit_fun` and `predict_fun`: |
| 223 | +```{r custom-example} |
| 224 | +# Define a custom learner (e.g., GAM with specific settings) |
| 225 | +custom_fit <- function(formula, data, family, ...) { |
| 226 | + if (!requireNamespace("mgcv", quietly = TRUE)) { |
| 227 | + stop("mgcv required for this custom learner") |
| 228 | + } |
| 229 | + mgcv::gam(formula, data = data, family = binomial()) |
| 230 | +} |
| 231 | +
|
| 232 | +custom_predict <- function(object, newdata, ...) { |
| 233 | + predict(object, newdata = newdata, type = "response") |
| 234 | +} |
| 235 | +
|
| 236 | +# Use in estimation |
| 237 | +custom_learner <- ml_learner( |
| 238 | + "custom", |
| 239 | + fit_fun = custom_fit, |
| 240 | + predict_fun = custom_predict |
| 241 | +) |
| 242 | +``` |
| 243 | + |
| 244 | +## Mixing Learners |
| 245 | + |
| 246 | +You can use different learners for propensity and outcome models: |
| 247 | + |
| 248 | +```{r mixing-example, eval=FALSE} |
| 249 | +# Random forest for propensity, elastic net for outcome |
| 250 | +cf_mse( |
| 251 | + predictions = pred, |
| 252 | + outcomes = y, |
| 253 | + treatment = a, |
| 254 | + covariates = df, |
| 255 | + treatment_level = 0, |
| 256 | + estimator = "dr", |
| 257 | + propensity_model = ml_learner("ranger", num.trees = 500), |
| 258 | + outcome_model = ml_learner("glmnet", alpha = 0.5), |
| 259 | + cross_fit = TRUE |
| 260 | +) |
| 261 | +``` |
| 262 | + |
| 263 | +## Works with All Estimators |
| 264 | + |
| 265 | +ML learners work with `cf_mse()`, `cf_auc()`, `cf_calibration()`, and their |
| 266 | +transportability variants (`tr_mse()`, `tr_auc()`, `tr_calibration()`): |
| 267 | + |
| 268 | +```{r auc-example, eval=requireNamespace("ranger", quietly = TRUE)} |
| 269 | +# AUC with ML nuisance models |
| 270 | +result_auc <- cf_auc( |
| 271 | + predictions = pred, |
| 272 | + outcomes = y, |
| 273 | + treatment = a, |
| 274 | + covariates = df, |
| 275 | + treatment_level = 0, |
| 276 | + estimator = "dr", |
| 277 | + propensity_model = ml_learner("ranger", num.trees = 100), |
| 278 | + outcome_model = ml_learner("ranger", num.trees = 100), |
| 279 | + cross_fit = TRUE, |
| 280 | + se_method = "influence" |
| 281 | +) |
| 282 | +
|
| 283 | +print(result_auc) |
| 284 | +``` |
| 285 | + |
| 286 | +## Best Practices |
| 287 | + |
| 288 | +1. **Always use cross-fitting with ML**: The package will warn you if you pass |
| 289 | + an `ml_learner` without `cross_fit = TRUE`. |
| 290 | + |
| 291 | +2. **Choose appropriate number of folds**: 5-10 folds is typical. More folds = |
| 292 | + more data for training each nuisance model, but more computation. |
| 293 | + |
| 294 | +3. **Tune hyperparameters thoughtfully**: Default values work reasonably well, |
| 295 | + but tuning can improve performance. Consider using external cross-validation |
| 296 | + for hyperparameter selection. |
| 297 | + |
| 298 | +4. **Consider computational cost**: ML methods are slower than GLM. For initial |
| 299 | + exploration, use fewer trees/rounds, then increase for final analysis. |
| 300 | + |
| 301 | +5. **Check for extreme propensity scores**: ML methods can produce very extreme |
| 302 | + propensity scores. The package truncates these at [0.01, 0.99] by default. |
| 303 | + |
| 304 | +## References |
| 305 | + |
| 306 | +Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., |
| 307 | +& Robins, J. (2018). Double/debiased machine learning for treatment and |
| 308 | +structural parameters. *The Econometrics Journal*, 21(1), C1-C68. |
| 309 | + |
| 310 | +Boyer, C. B., Dahabreh, I. J., & Steingrimsson, J. A. (2025). Estimating and |
| 311 | +evaluating counterfactual prediction models. *Statistics in Medicine*, |
| 312 | +44(23-24), e70287. |
0 commit comments