In practice, machine learning (ML) algorithms may fail to ascertain
+heterogeneous treatment effects due to small small sample sizes, high
+dimensionality, and arbitrary parameter-tuning. The
+test_itr function allows users to empirically validate the
+estimates of GATEs under various ML algorithms with statistical testing.
+In particular, there are two types of nonparametric statistical tests:
+(1) test for across group treatment effects heterogeneity and (2) test
+of rank consistency of GATEs. The following provides a description of
+each test. The tests are based on the idea that, if an ML algorithm
+produces a reasonable scoring rule (achieved by the
+estimate_itr function), it is reasonable to expect that (1)
+the GATEs across groups are heterogeneous; and (2) the rank ordering of
+the GATEs based on their magnitude should be mononotic.
+
Following the previous examples, we first estimate GATEs using causal
+forest (causal forest), Bayesian Additive Regression Trees
+(bartc), LASSO (lasso), random forest
+(rf) under cross-validation using the
+estimate_itr function. We specify the number of groups to
+divide the sample into through the ngates argument. By
+setting ngates = 5 in the example below, we estimate the
+heterogeneous impact of small class sizes on students’ writing scores
+across 5 groups of students.
The table reports the quintile GATEs (\(K =
+5\)) estimates for each ML algorithm. We find that the Random
+Forest is able to produce statistically negative GATE for the lowest
+quantitle group (group 1) under cross-validation. This provides evidence
+that the Random Forest is able to identify a 20% subgroup whose writing
+scores are negatively impacted by small class sizes.
+
We now conduct the statistical tests of treatment effect
+heterogeneity and rank consistency to validate these GATEs estimates. We
+use the module object output test_est_cv from the
+evaluate_itr function as the input object for the
+test_itr function to conduct 2 tests simultaneously. We can
+summarize the test statistics and the p-values using the
+summary function. Lastly, we use the nsims
+argument to specify the number of simulations to conduct for each test.
+The default is 1000 simulations.
+
+# conduct nonparametric tests
+test_est_cv<-test_itr(est_cv,
+ nsim =5000)
+#> Conduct hypothesis tests for GATEs unde cross-validation ...
+
+# summarize test statistics and p-values
+summary(test_est_cv)
+#> ── The Consistency Test Results for GATEs ──────────────────────────────────────
+#> No consistency results available (sample-splitting).
+#>
+#> ── The Heterogeneity Test Results for GATEs ────────────────────────────────────
+#> No heterogeneity results available (sample-splitting).
+#>
+#> ── The Consistency Test Results for GATEs (Cross-validation) ───────────────────
+#> algorithm statistic p.value
+#> 1 causal_forest 0.83 0.74
+#> 2 bartc 1.03 0.63
+#> 3 lasso 0.24 0.82
+#> 4 rf 1.32 0.66
+#>
+#> ── The Heterogeneity Test Results for GATEs (Cross-validation) ─────────────────
+#> algorithm statistic p.value
+#> 1 causal_forest 1.9 0.86
+#> 2 bartc 2.3 0.81
+#> 3 lasso 3.3 0.65
+#> 4 rf 6.6 0.25
+
The table reports the resulting values of test statistics and the
+p-values for each test under each algorithm. We find that none of the ML
+algorithms is able to reject the treatment effect homogeneity hypothesis
+under cross-validation, which indicates that these algorithms failed to
+identify statistically significant GATEs estimate for any subgroup. In
+addition, none of the ML algorithms is able to reject the rank
+consistency hypothesis under cross-validation. Thus, there is no strong
+statistical evidence that these algorithms are producing unreliable
+GATEs.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
diff --git a/man/.DS_Store b/man/.DS_Store
deleted file mode 100644
index 467497b..0000000
Binary files a/man/.DS_Store and /dev/null differ
diff --git a/man/figures/README-caret_model-1 2.png b/man/figures/README-caret_model-1 2.png
new file mode 100644
index 0000000..e2a173a
Binary files /dev/null and b/man/figures/README-caret_model-1 2.png differ
diff --git a/man/figures/README-caret_model-2 2.png b/man/figures/README-caret_model-2 2.png
new file mode 100644
index 0000000..bb0e12a
Binary files /dev/null and b/man/figures/README-caret_model-2 2.png differ
diff --git a/man/figures/README-compare_itr_aupec-1 2.png b/man/figures/README-compare_itr_aupec-1 2.png
new file mode 100644
index 0000000..db3b005
Binary files /dev/null and b/man/figures/README-compare_itr_aupec-1 2.png differ
diff --git a/man/figures/README-compare_itr_gate-1 2.png b/man/figures/README-compare_itr_gate-1 2.png
new file mode 100644
index 0000000..7d0a0a7
Binary files /dev/null and b/man/figures/README-compare_itr_gate-1 2.png differ
diff --git a/man/figures/README-compare_itr_model_summary-1 2.png b/man/figures/README-compare_itr_model_summary-1 2.png
new file mode 100644
index 0000000..2ef4008
Binary files /dev/null and b/man/figures/README-compare_itr_model_summary-1 2.png differ
diff --git a/man/figures/README-cv_estimate-1 2.png b/man/figures/README-cv_estimate-1 2.png
new file mode 100644
index 0000000..a1e0d32
Binary files /dev/null and b/man/figures/README-cv_estimate-1 2.png differ
diff --git a/man/figures/README-est_extract-1 2.png b/man/figures/README-est_extract-1 2.png
new file mode 100644
index 0000000..a418cb1
Binary files /dev/null and b/man/figures/README-est_extract-1 2.png differ
diff --git a/man/figures/README-sl_plot-1 2.png b/man/figures/README-sl_plot-1 2.png
new file mode 100644
index 0000000..158837e
Binary files /dev/null and b/man/figures/README-sl_plot-1 2.png differ
diff --git a/man/figures/README-user_itr_aupec-1 2.png b/man/figures/README-user_itr_aupec-1 2.png
new file mode 100644
index 0000000..63304d2
Binary files /dev/null and b/man/figures/README-user_itr_aupec-1 2.png differ
diff --git a/man/figures/README-user_itr_gate-1 2.png b/man/figures/README-user_itr_gate-1 2.png
new file mode 100644
index 0000000..094375a
Binary files /dev/null and b/man/figures/README-user_itr_gate-1 2.png differ
diff --git a/man/figures/gate 2.png b/man/figures/gate 2.png
new file mode 100644
index 0000000..3fc77af
Binary files /dev/null and b/man/figures/gate 2.png differ
diff --git a/man/figures/plot_5folds 2.png b/man/figures/plot_5folds 2.png
new file mode 100644
index 0000000..bd59a4f
Binary files /dev/null and b/man/figures/plot_5folds 2.png differ
diff --git a/man/figures/rf 2.png b/man/figures/rf 2.png
new file mode 100644
index 0000000..4eb1ba2
Binary files /dev/null and b/man/figures/rf 2.png differ
diff --git a/tests/testthat/star 2.rda b/tests/testthat/star 2.rda
new file mode 100644
index 0000000..5baf7ba
Binary files /dev/null and b/tests/testthat/star 2.rda differ
diff --git a/tests/testthat/test-high_level 2.R b/tests/testthat/test-high_level 2.R
new file mode 100644
index 0000000..b58529b
--- /dev/null
+++ b/tests/testthat/test-high_level 2.R
@@ -0,0 +1,42 @@
+library(evalITR)
+library(dplyr)
+test_that("Sample Splitting Works", {
+ load("star.rda")
+ # specifying the outcome
+ outcomes <- "g3tlangss"
+
+ # specifying the treatment
+ treatment <- "treatment"
+
+ # specifying the data (remove other outcomes)
+ star_data <- star %>% dplyr::select(-c(g3treadss,g3tmathss))
+
+ # specifying the formula
+ user_formula <- as.formula(
+ "g3tlangss ~ treatment + gender + race + birthmonth +
+ birthyear + SCHLURBN + GRDRANGE + GKENRMNT + GKFRLNCH +
+ GKBUSED + GKWHITE ")
+
+
+ # estimate ITR
+ fit <- estimate_itr(
+ treatment = treatment,
+ form = user_formula,
+ data = star_data,
+ algorithms = c("lasso"),
+ budget = 0.2,
+ split_ratio = 0.7)
+ expect_no_error(estimate_itr(
+ treatment = treatment,
+ form = user_formula,
+ data = star_data,
+ algorithms = c("lasso"),
+ budget = 0.2,
+ split_ratio = 0.7))
+
+
+ # evaluate ITR
+ est <- evaluate_itr(fit)
+ expect_no_error(evaluate_itr(fit))
+})
+
diff --git a/tests/testthat/test-low_level 2.R b/tests/testthat/test-low_level 2.R
new file mode 100644
index 0000000..dd0993b
--- /dev/null
+++ b/tests/testthat/test-low_level 2.R
@@ -0,0 +1,59 @@
+library(evalITR)
+
+test_that("Non Cross-Validated Functions Work", {
+ T = c(1,0,1,0,1,0,1,0)
+ That = c(0,1,1,0,0,1,1,0)
+ That2 = c(1,0,0,1,1,0,0,1)
+ tau = c(0,0.1,0.2,0.3,0.4,0.5,0.6,0.7)
+ Y = c(4,5,0,2,4,1,-4,3)
+ papelist <- PAPE(T,That,Y)
+ pavlist <- PAV(T,That,Y)
+ papdlist <- PAPD(T,That,That2,Y,0.5)
+ aupeclist <- AUPEC(T,tau,Y)
+ gatelist <- GATE(T,tau,Y,ngates=2)
+ expect_type(papelist,"list")
+ expect_type(pavlist,"list")
+ expect_type(papdlist,"list")
+ expect_type(aupeclist,"list")
+ expect_type(gatelist,"list")
+ expect_type(papelist$pape,"double")
+ expect_type(pavlist$pav,"double")
+ expect_type(papdlist$papd,"double")
+ expect_type(aupeclist$aupec,"double")
+ expect_type(gatelist$gate,"double")
+ expect_type(papelist$sd,"double")
+ expect_type(pavlist$sd,"double")
+ expect_type(papdlist$sd,"double")
+ expect_type(aupeclist$sd,"double")
+ expect_type(gatelist$sd,"double")
+})
+
+test_that("Cross-Validated Functions Work", {
+ T = c(1,0,1,0,1,0,1,0)
+ That = matrix(c(0,1,1,0,0,1,1,0,1,0,0,1,1,0,0,1), nrow = 8, ncol = 2)
+ That2 = matrix(c(0,0,1,1,0,0,1,1,1,1,0,0,1,1,0,0), nrow = 8, ncol = 2)
+ tau = matrix(c(0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,-0.5,-0.3,-0.1,0.1,0.3,0.5,0.7,0.9),nrow = 8, ncol = 2)
+ Y = c(4,5,0,2,4,1,-4,3)
+ ind = c(rep(1,4),rep(2,4))
+ papelist <- PAPEcv(T,That,Y,ind,budget = 0.5)
+ pavlist <- PAVcv(T,That,Y,ind)
+ papdlist <- PAPDcv(T,That,That2,Y,ind,budget = 0.5)
+ aupeclist <- AUPECcv(T,tau,Y,ind)
+ gatelist <- GATEcv(T,tau,Y,ind,ngates=2)
+ expect_type(papelist,"list")
+ expect_type(pavlist,"list")
+ expect_type(papdlist,"list")
+ expect_type(aupeclist,"list")
+ expect_type(gatelist,"list")
+ expect_type(papelist$pape,"double")
+ expect_type(pavlist$pav,"double")
+ expect_type(papdlist$papd,"double")
+ expect_type(aupeclist$aupec,"double")
+ expect_type(gatelist$gate,"double")
+ expect_type(papelist$sd,"double")
+ expect_type(pavlist$sd,"double")
+ expect_type(papdlist$sd,"double")
+ expect_type(aupeclist$sd,"double")
+ expect_type(gatelist$sd,"double")
+})
+
diff --git a/vignettes/test_itr.Rmd b/vignettes/test_itr.Rmd
new file mode 100644
index 0000000..c8c045c
--- /dev/null
+++ b/vignettes/test_itr.Rmd
@@ -0,0 +1,84 @@
+---
+title: "Nonparametric statistical tests for treatment heterogeneity and rank consistency across multiple ML algorithms"
+output: rmarkdown::html_vignette
+vignette: >
+ %\VignetteIndexEntry{Nonparametric statistical tests with multiple ML algorithms}
+ %\VignetteEngine{knitr::rmarkdown}
+ %\VignetteEncoding{UTF-8}
+---
+
+```{r, include = FALSE}
+knitr::opts_chunk$set(
+ collapse = TRUE,
+ comment = "#>",
+ fig.path = "../man/figures/README-"
+ )
+
+library(dplyr)
+
+load("../data/star.rda")
+
+# specifying the outcome
+outcomes <- "g3tlangss"
+
+# specifying the treatment
+treatment <- "treatment"
+
+# specifying the data (remove other outcomes)
+star_data <- star %>% dplyr::select(-c(g3treadss,g3tmathss))
+
+# specifying the formula
+user_formula <- as.formula(
+ "g3tlangss ~ treatment + gender + race + birthmonth +
+ birthyear + SCHLURBN + GRDRANGE + GKENRMNT + GKFRLNCH +
+ GKBUSED + GKWHITE ")
+```
+
+In practice, machine learning (ML) algorithms may fail to ascertain heterogeneous treatment effects due to small small sample sizes, high dimensionality, and arbitrary parameter-tuning. The `test_itr` function allows users to empirically validate the estimates of GATEs under various ML algorithms with statistical testing. In particular, there are two types of nonparametric statistical tests: (1) test for across group treatment effects heterogeneity and (2) test of rank consistency of GATEs. The following provides a description of each test. The tests are based on the idea that, if an ML algorithm produces a reasonable scoring rule (achieved by the `estimate_itr` function), it is reasonable to expect that (1) the GATEs across groups are heterogeneous; and (2) the rank ordering of the GATEs based on their magnitude should be mononotic.
+
+Following the previous examples, we first estimate GATEs using causal forest (`causal forest`), Bayesian Additive Regression Trees (`bartc`), LASSO (`lasso`), random forest (`rf`) under cross-validation using the `estimate_itr` function. We specify the number of groups to divide the sample into through the `ngates` argument. By setting `ngates = 5` in the example below, we estimate the heterogeneous impact of small class sizes on students’ writing scores across 5 groups of students.
+
+```{r multiple, warning=FALSE, message=FALSE}
+# library(evalITR)
+devtools::load_all(".")
+
+# specify the trainControl method
+fitControl <- caret::trainControl(
+ method = "repeatedcv",
+ number = 2,
+ repeats = 2)
+# estimate ITR
+set.seed(2023)
+fit_cv <- estimate_itr(
+ treatment = "treatment",
+ form = user_formula,
+ data = star_data,
+ trControl = fitControl,
+ algorithms = c(
+ "causal_forest", # from caret
+ "bartc", # from caret
+ "lasso", # from caret
+ "rf"), # from caret
+ budget = 0.2, # 20% budget constraint
+ n_folds = 5, # 5-fold cross-validation
+ ngates = 5) # 5 groups
+
+# evaluate ITR
+est_cv <- evaluate_itr(fit_cv)
+
+# extract GATEs estimates
+summary(est_cv)$GATE
+```
+The table reports the quintile GATEs ($K = 5$) estimates for each ML algorithm. We find that the Random Forest is able to produce statistically negative GATE for the lowest quantitle group (group 1) under cross-validation. This provides evidence that the Random Forest is able to identify a 20% subgroup whose writing scores are negatively impacted by small class sizes.
+
+We now conduct the statistical tests of treatment effect heterogeneity and rank consistency to validate these GATEs estimates. We use the module object output `test_est_cv` from the `evaluate_itr` function as the input object for the `test_itr` function to conduct 2 tests simultaneously. We can summarize the test statistics and the p-values using the `summary` function. Lastly, we use the `nsims` argument to specify the number of simulations to conduct for each test. The default is 1000 simulations.
+
+```{r warning=FALSE, message=FALSE}
+# conduct nonparametric tests
+test_est_cv <- test_itr(est_cv,
+ nsim = 5000)
+
+# summarize test statistics and p-values
+summary(test_est_cv)
+```
+The table reports the resulting values of test statistics and the p-values for each test under each algorithm. We find that none of the ML algorithms is able to reject the treatment effect homogeneity hypothesis under cross-validation, which indicates that these algorithms failed to identify statistically significant GATEs estimate for any subgroup. In addition, none of the ML algorithms is able to reject the rank consistency hypothesis under cross-validation. Thus, there is no strong statistical evidence that these algorithms are producing unreliable GATEs.