Skip to content

Commit e199d86

Browse files
committed
Add ML integration vignette and README section
1 parent fefac1b commit e199d86

3 files changed

Lines changed: 348 additions & 0 deletions

File tree

README.md

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -103,6 +103,35 @@ tr_calibration(
103103

104104
See `vignette("transportability", package = "cfperformance")` for details.
105105

106+
## Machine Learning Integration
107+
108+
The package supports flexible ML methods for nuisance model estimation with automatic cross-fitting for valid inference:
109+
110+
```r
111+
# Use random forest for propensity scores and outcome models
112+
cf_mse(
113+
predictions = cvd_sim$risk_score,
114+
outcomes = cvd_sim$event,
115+
treatment = cvd_sim$treatment,
116+
covariates = cvd_sim[, c("age", "bp", "chol")],
117+
treatment_level = 0,
118+
estimator = "dr",
119+
propensity_model = ml_learner("ranger", num.trees = 500),
120+
outcome_model = ml_learner("xgboost", nrounds = 100),
121+
cross_fit = TRUE
122+
)
123+
```
124+
125+
**Supported learners:**
126+
- `ranger` - Fast random forests
127+
- `xgboost` - Gradient boosting
128+
- `grf` - Generalized random forests (honest estimation)
129+
- `glmnet` - Elastic net with cross-validated λ
130+
- `superlearner` - Ensemble learning
131+
- `custom` - User-supplied fit/predict functions
132+
133+
See `vignette("ml-integration", package = "cfperformance")` for details.
134+
106135
## Example
107136

108137
```r

_pkgdown.yml

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,8 @@ navbar:
2828
href: articles/introduction.html
2929
- text: Transportability Analysis
3030
href: articles/transportability.html
31+
- text: Machine Learning Integration
32+
href: articles/ml-integration.html
3133

3234
reference:
3335
- title: Counterfactual Performance (Single Population)
@@ -47,6 +49,11 @@ reference:
4749
contents:
4850
- cf_cv
4951
- cf_compare
52+
- title: Machine Learning
53+
desc: ML learner specifications for nuisance models
54+
contents:
55+
- ml_learner
56+
- print.ml_learner
5057
- title: Nuisance Models
5158
desc: Fit propensity and outcome models
5259
contents:

vignettes/ml-integration.Rmd

Lines changed: 312 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,312 @@
1+
---
2+
title: "Machine Learning Integration"
3+
output: rmarkdown::html_vignette
4+
vignette: >
5+
%\VignetteIndexEntry{Machine Learning Integration}
6+
7+
%\VignetteEngine{knitr::rmarkdown}
8+
%\VignetteEncoding{UTF-8}
9+
---
10+
11+
```{r, include = FALSE}
12+
knitr::opts_chunk$set(
13+
collapse = TRUE,
14+
comment = "#>",
15+
fig.width = 7,
16+
fig.height = 5
17+
)
18+
```
19+
20+
## Overview
21+
22+
The `cfperformance` package supports flexible machine learning methods for
23+
nuisance model estimation (propensity scores and outcome models). When using
24+
ML methods, cross-fitting is essential for valid inference—this is handled
25+
automatically when you use the `ml_learner()` specification.
26+
27+
This vignette demonstrates how to use ML learners with the counterfactual
28+
performance estimators.
29+
30+
## Why Use ML for Nuisance Models?
31+
32+
The doubly robust (DR) estimator requires models for:
33+
34+
1. **Propensity scores**: $P(A = a | X)$ — the probability of treatment given
35+
covariates
36+
2. **Outcome model**: $E[Y | X, A = a]$ — the expected outcome given covariates
37+
and treatment
38+
39+
Traditional approaches use logistic regression, which may be misspecified when
40+
the true relationships are complex. Machine learning methods can capture
41+
nonlinear relationships and interactions without explicit specification.
42+
43+
However, using flexible ML methods introduces a challenge: if the same data is
44+
used to both fit the nuisance models and compute the estimator, the resulting
45+
inference can be invalid. **Cross-fitting** (sample splitting) solves this by:
46+
47+
1. Splitting data into K folds
48+
2. For each fold, fitting nuisance models on the other K-1 folds
49+
3. Using the held-out fold for estimation
50+
51+
This approach, known as "double/debiased machine learning" (Chernozhukov et al.,
52+
2018), maintains valid inference while leveraging flexible estimation.
53+
54+
## The `ml_learner()` Interface
55+
56+
The `ml_learner()` function creates a learner specification that can be passed
57+
to the `propensity_model` or `outcome_model` arguments:
58+
59+
```{r setup, message=FALSE}
60+
library(cfperformance)
61+
62+
# Create a learner specification
63+
ranger_learner <- ml_learner("ranger", num.trees = 500)
64+
print(ranger_learner)
65+
```
66+
67+
### Supported Learners
68+
69+
| Method | Package | Description |
70+
|--------|---------|-------------|
71+
| `"ranger"` | ranger | Fast random forest implementation |
72+
| `"xgboost"` | xgboost | Gradient boosting (XGBoost) |
73+
| `"grf"` | grf | Generalized random forests with honesty |
74+
| `"glmnet"` | glmnet | Elastic net with cross-validated λ |
75+
| `"superlearner"` | SuperLearner | Ensemble of multiple learners |
76+
| `"custom"` || User-supplied fit/predict functions |
77+
78+
## Basic Usage
79+
80+
Here's a complete example using random forests for nuisance estimation:
81+
82+
```{r basic-example}
83+
# Generate example data
84+
set.seed(123)
85+
n <- 1000
86+
87+
# Covariates with nonlinear relationships
88+
x1 <- rnorm(n)
89+
x2 <- rnorm(n)
90+
x3 <- rnorm(n)
91+
92+
# Treatment depends on covariates (nonlinearly)
93+
ps_true <- plogis(-0.5 + 0.8 * x1 - 0.5 * x2 + 0.3 * x1 * x2)
94+
a <- rbinom(n, 1, ps_true)
95+
96+
# Outcome depends on covariates and treatment
97+
y_prob <- plogis(-1 + 0.5 * x1 + 0.3 * x2^2 + 0.2 * x3 - 0.4 * a)
98+
y <- rbinom(n, 1, y_prob)
99+
100+
# Prediction model (intentionally simplified)
101+
pred <- plogis(-1 + 0.4 * x1 + 0.2 * x2)
102+
103+
# Create data frame
104+
df <- data.frame(x1 = x1, x2 = x2, x3 = x3)
105+
```
106+
107+
### Using GLM (Default)
108+
109+
```{r glm-example}
110+
# Standard approach with GLM
111+
result_glm <- cf_mse(
112+
predictions = pred,
113+
outcomes = y,
114+
treatment = a,
115+
covariates = df,
116+
treatment_level = 0,
117+
estimator = "dr",
118+
cross_fit = TRUE,
119+
n_folds = 5,
120+
se_method = "influence"
121+
)
122+
123+
print(result_glm)
124+
```
125+
126+
### Using Random Forest
127+
128+
```{r ranger-example, eval=requireNamespace("ranger", quietly = TRUE)}
129+
# With random forest nuisance models
130+
result_rf <- cf_mse(
131+
predictions = pred,
132+
outcomes = y,
133+
treatment = a,
134+
covariates = df,
135+
treatment_level = 0,
136+
estimator = "dr",
137+
propensity_model = ml_learner("ranger", num.trees = 200),
138+
outcome_model = ml_learner("ranger", num.trees = 200),
139+
cross_fit = TRUE,
140+
n_folds = 5,
141+
se_method = "influence"
142+
)
143+
144+
print(result_rf)
145+
```
146+
147+
## Learner-Specific Examples
148+
149+
### ranger (Random Forest)
150+
151+
```{r ranger-detailed, eval=FALSE}
152+
cf_mse(
153+
...,
154+
propensity_model = ml_learner(
155+
"ranger",
156+
num.trees = 500, # Number of trees
157+
mtry = 3, # Variables per split
158+
min.node.size = 10 # Minimum node size
159+
),
160+
cross_fit = TRUE
161+
)
162+
```
163+
164+
### xgboost (Gradient Boosting)
165+
166+
```{r xgboost-example, eval=FALSE}
167+
cf_mse(
168+
...,
169+
propensity_model = ml_learner(
170+
"xgboost",
171+
nrounds = 100, # Boosting rounds
172+
max_depth = 4, # Tree depth
173+
eta = 0.1 # Learning rate
174+
),
175+
cross_fit = TRUE
176+
)
177+
```
178+
179+
### glmnet (Elastic Net)
180+
181+
```{r glmnet-example, eval=FALSE}
182+
cf_mse(
183+
...,
184+
propensity_model = ml_learner(
185+
"glmnet",
186+
alpha = 0.5, # 0=ridge, 1=lasso, 0.5=elastic net
187+
nfolds = 10 # CV folds for lambda selection
188+
),
189+
cross_fit = TRUE
190+
)
191+
```
192+
193+
### grf (Generalized Random Forest)
194+
195+
```{r grf-example, eval=FALSE}
196+
cf_mse(
197+
...,
198+
propensity_model = ml_learner(
199+
"grf",
200+
num.trees = 2000, # More trees for honesty
201+
honesty = TRUE # Honest estimation (default)
202+
),
203+
cross_fit = TRUE
204+
)
205+
```
206+
207+
### SuperLearner (Ensemble)
208+
209+
```{r sl-example, eval=FALSE}
210+
cf_mse(
211+
...,
212+
propensity_model = ml_learner(
213+
"superlearner",
214+
SL.library = c("SL.glm", "SL.ranger", "SL.xgboost")
215+
),
216+
cross_fit = TRUE
217+
)
218+
```
219+
220+
## Custom Learners
221+
222+
You can define custom learners by providing `fit_fun` and `predict_fun`:
223+
```{r custom-example}
224+
# Define a custom learner (e.g., GAM with specific settings)
225+
custom_fit <- function(formula, data, family, ...) {
226+
if (!requireNamespace("mgcv", quietly = TRUE)) {
227+
stop("mgcv required for this custom learner")
228+
}
229+
mgcv::gam(formula, data = data, family = binomial())
230+
}
231+
232+
custom_predict <- function(object, newdata, ...) {
233+
predict(object, newdata = newdata, type = "response")
234+
}
235+
236+
# Use in estimation
237+
custom_learner <- ml_learner(
238+
"custom",
239+
fit_fun = custom_fit,
240+
predict_fun = custom_predict
241+
)
242+
```
243+
244+
## Mixing Learners
245+
246+
You can use different learners for propensity and outcome models:
247+
248+
```{r mixing-example, eval=FALSE}
249+
# Random forest for propensity, elastic net for outcome
250+
cf_mse(
251+
predictions = pred,
252+
outcomes = y,
253+
treatment = a,
254+
covariates = df,
255+
treatment_level = 0,
256+
estimator = "dr",
257+
propensity_model = ml_learner("ranger", num.trees = 500),
258+
outcome_model = ml_learner("glmnet", alpha = 0.5),
259+
cross_fit = TRUE
260+
)
261+
```
262+
263+
## Works with All Estimators
264+
265+
ML learners work with `cf_mse()`, `cf_auc()`, `cf_calibration()`, and their
266+
transportability variants (`tr_mse()`, `tr_auc()`, `tr_calibration()`):
267+
268+
```{r auc-example, eval=requireNamespace("ranger", quietly = TRUE)}
269+
# AUC with ML nuisance models
270+
result_auc <- cf_auc(
271+
predictions = pred,
272+
outcomes = y,
273+
treatment = a,
274+
covariates = df,
275+
treatment_level = 0,
276+
estimator = "dr",
277+
propensity_model = ml_learner("ranger", num.trees = 100),
278+
outcome_model = ml_learner("ranger", num.trees = 100),
279+
cross_fit = TRUE,
280+
se_method = "influence"
281+
)
282+
283+
print(result_auc)
284+
```
285+
286+
## Best Practices
287+
288+
1. **Always use cross-fitting with ML**: The package will warn you if you pass
289+
an `ml_learner` without `cross_fit = TRUE`.
290+
291+
2. **Choose appropriate number of folds**: 5-10 folds is typical. More folds =
292+
more data for training each nuisance model, but more computation.
293+
294+
3. **Tune hyperparameters thoughtfully**: Default values work reasonably well,
295+
but tuning can improve performance. Consider using external cross-validation
296+
for hyperparameter selection.
297+
298+
4. **Consider computational cost**: ML methods are slower than GLM. For initial
299+
exploration, use fewer trees/rounds, then increase for final analysis.
300+
301+
5. **Check for extreme propensity scores**: ML methods can produce very extreme
302+
propensity scores. The package truncates these at [0.01, 0.99] by default.
303+
304+
## References
305+
306+
Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W.,
307+
& Robins, J. (2018). Double/debiased machine learning for treatment and
308+
structural parameters. *The Econometrics Journal*, 21(1), C1-C68.
309+
310+
Boyer, C. B., Dahabreh, I. J., & Steingrimsson, J. A. (2025). Estimating and
311+
evaluating counterfactual prediction models. *Statistics in Medicine*,
312+
44(23-24), e70287.

0 commit comments

Comments
 (0)