Skip to content

Conversation

@mollyow
Copy link

@mollyow mollyow commented Jan 7, 2025

This PR modifies get_X() in predict.lm_robust() to appropriately create model matrices for new data predicted from lm_lin() models with multi-valued and factorial treatments, and fixes #415.

Changes in this PR:

  • allows for prediction with lm_lin() when treatment is a factor and/or multi-valued (primary goal)
    • adds saved treatment_levels to the returned lm_lin model object
    • stops prediction for lm_lin if the treatment values in new data are not a subset of treatment_levels. 1
  • standardizes model fit for lm_lin() models with no intercept 2
  • adds tests to ensure identical predictions from lm_lin() models where treatment is either numeric or factorial, and fit with/without an intercept (there may be extremely small floating point differences)
  • adds relevant examples to predict.lm_robust and lm_lin documentation.

Notes:
1: This change has consequences for use with margins::margins(), which now must be instructed that treatment is a factor.

  • Why? margins() perturbs values of the variable to get marginal effects. The perturbed values will not be a subset of the original treatment levels, which now throws an error.
  • If we only ever had binary treatments, this would not be an issue. However, lm_lin() allows users to input multi-valued treatments as a numeric variable, but recognizes each distinct numeric value as a different treatment level. Meanwhile, margins() would still treat these variables as continuous, resulting in different errors or weird behavior. The margins() package already has intended behavior for factor variables following Stata implementation (margins #6), users just need to explicitly instruct margins that treatment is a factor to get correct behavior.

2:
There is a small change to modify behavior of lm_lin() for binary 0/1 treatments with no intercept. I think it's an open question what correct behavior should be here, as in Winston's original paper all models have intercepts. Previously, with binary treatment, if there's no intercept you would get a model with a treatment indicator, de-meaned covariates, and treatment interacted with covariates:

set.seed(60637)

N <- 40
dat <- data.frame(
  x = rnorm(N, mean = 2.3),
  x2 = rpois(N, lambda = 2),
  x3 = runif(N)
)

dat$y0 <- rnorm(N) + dat$x
dat$y1 <- dat$y0 + 0.35

dat$z_bin <- sample(0:1, size = nrow(dat), replace = TRUE)
dat$y <- (dat$z_bin == 0)*dat$y0 + (dat$z_bin == 1)*dat$y1

# Binary 01 treatments with lm_lin and no intercept
lmlin_bin <- lm_lin(y ~ z_bin-1, covariates = ~ x, data = dat)

Produces:

           Estimate Std. Error   t value     Pr(>|t|)   CI Lower CI Upper DF
z_bin     2.5165919  0.2647700 9.5048223 1.797926e-11  1.9801169 3.053067 37
x_c       0.6103258  0.6999342 0.8719759 3.888449e-01 -0.8078756 2.028527 37
z_bin:x_c 0.5577459  0.7818820 0.7133377 4.801122e-01 -1.0264974 2.141989 37

This is difficult to interpret in terms of a treatment effect.

This PR changes behavior for binary 0/1 treatment to be the same as what you would see when treatment is multi-valued or otherwise treated as a factor. It also allows you to back out the Lin estimate of the ATE.

> devtools::install_github("mollyow/estimatr")
> library(estimatr)
> lm_lin(y ~ z_bin-1, covariates = ~ x, data = dat)
            Estimate Std. Error   t value     Pr(>|t|)  CI Lower  CI Upper DF
z_bin0     2.3921681  0.1136183 21.054425 7.934125e-22 2.1617395 2.6225967 36
z_bin1     2.5165919  0.2647700  9.504822 2.370080e-11 1.9796134 3.0535704 36
z_bin0:x_c 0.7151286  0.1249358  5.723967 1.625271e-06 0.4617470 0.9685102 36
z_bin1:x_c 1.1680717  0.3484703  3.351998 1.896537e-03 0.4613412 1.8748022 36

A few comments:

  • (Edit: resolved) If the new data has additional factor levels/treatment values, prediction will fail as the model matrix will not be the correct dimension. If there are new treatment levels AND old treatment levels are dropped, there could be some bad behavior, as names of treatment levels are not checked.
  • My understanding is that expected behavior for predict.lm_robust() without new data is failure, because the model object does not save the original model matrix. This PR does not modify that behavior, and so doesn't address Luke's question in predict and residuals have odd behavior #403.

@mollyow mollyow changed the title Prediction with lm_lin() in #415 Prediction with lm_lin() fixes #415 Jan 7, 2025
data("alo_star_men")
lml <- lm_lin(GPA_year1 ~ ssp, ~ gpa0, data = alo_star_men, se_type = "classical")
# instruct margins to treat treatment as a factor
lml <- lm_lin(GPA_year1 ~ factor(ssp), ~ gpa0, data = alo_star_men, se_type = "classical")
Copy link
Author

@mollyow mollyow Jan 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that margins needs to be instructed which variables are factors, to treat them accordingly. https://cran.r-project.org/web/packages/margins/vignettes/Introduction.html#Using_the_at_Argument

Otherwise this results in an error in prediction, which is stopped if treatment values in new data are not a subset of treatment values in the old data (the consequence of margin's perturbing variables)

expect_equal(
lmlo$term,
c("z", "X1_c", "z:X1_c")
c("z0", "z1", "z0:X1_c", "z1:X1_c")
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR changes the expected behavior for binary treatment with no intercept.

Comment on lines +353 to +358
# Store unique treatment values
if(attr(terms(model_data), "dataClasses")[attr(terms(model_data),"term.labels")[1]] == "factor"){
return_list[["treatment_levels"]] <- model_data$xlevels[[1]]
} else {
return_list[["treatment_levels"]] <- sort(unique(design_matrix[, design_mat_treatment]))
}
Copy link
Author

@mollyow mollyow Jan 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is added so that when the model matrix is generated for predictions, we can ensure that the new data only includes a subset of treatment levels that were in the original model fit. Without being able to check this, weird behavior could result from predictions where the new data does not share identical treatment levels with the original data. This is saved in $xlevels in the model object if treatment is a factor, but if treatment is entered into the model as a numeric variable, this information is not otherwise saved.


X <- get_X(object, newdata, na.action)

# lm_lin scaling
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all of lm_lin scaling is moved down to get_X()

Comment on lines -90 to -92
# Interacted with treatment
treat_name <- attr(object$terms, "term.labels")[1]
interacted_covars <- X[, treat_name] * demeaned_covars
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not have the desired behavior when there are multiple treatment levels.


# Check case where treatment is not factor and is not binary
if (any(!(treatment %in% c(0, 1)))) {
if (any(!(treatment %in% c(0, 1))) | (!has_intercept&ncol(treatment) ==1) ) {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change and the subsequent one modify how lm_lin() fits without an intercept and with 0/1 treatment values.

Comment on lines -316 to -323
# If no intercept, but treatment is only one column,
# need to add base terms for covariates
if (n_treat_cols == 1) {
X <- cbind(
treatment,
demeaned_covars,
interacted_covars
)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This special case is resolved

@graemeblair graemeblair changed the base branch from main to lh-fixes February 28, 2025 17:19
@graemeblair graemeblair merged commit b5da215 into DeclareDesign:lh-fixes Feb 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Revisions to prediction with lm_lin()

2 participants