-
Notifications
You must be signed in to change notification settings - Fork 21
Prediction with lm_lin() fixes #415 #416
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| data("alo_star_men") | ||
| lml <- lm_lin(GPA_year1 ~ ssp, ~ gpa0, data = alo_star_men, se_type = "classical") | ||
| # instruct margins to treat treatment as a factor | ||
| lml <- lm_lin(GPA_year1 ~ factor(ssp), ~ gpa0, data = alo_star_men, se_type = "classical") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that margins needs to be instructed which variables are factors, to treat them accordingly. https://cran.r-project.org/web/packages/margins/vignettes/Introduction.html#Using_the_at_Argument
Otherwise this results in an error in prediction, which is stopped if treatment values in new data are not a subset of treatment values in the old data (the consequence of margin's perturbing variables)
| expect_equal( | ||
| lmlo$term, | ||
| c("z", "X1_c", "z:X1_c") | ||
| c("z0", "z1", "z0:X1_c", "z1:X1_c") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PR changes the expected behavior for binary treatment with no intercept.
| # Store unique treatment values | ||
| if(attr(terms(model_data), "dataClasses")[attr(terms(model_data),"term.labels")[1]] == "factor"){ | ||
| return_list[["treatment_levels"]] <- model_data$xlevels[[1]] | ||
| } else { | ||
| return_list[["treatment_levels"]] <- sort(unique(design_matrix[, design_mat_treatment])) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is added so that when the model matrix is generated for predictions, we can ensure that the new data only includes a subset of treatment levels that were in the original model fit. Without being able to check this, weird behavior could result from predictions where the new data does not share identical treatment levels with the original data. This is saved in $xlevels in the model object if treatment is a factor, but if treatment is entered into the model as a numeric variable, this information is not otherwise saved.
|
|
||
| X <- get_X(object, newdata, na.action) | ||
|
|
||
| # lm_lin scaling |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all of lm_lin scaling is moved down to get_X()
| # Interacted with treatment | ||
| treat_name <- attr(object$terms, "term.labels")[1] | ||
| interacted_covars <- X[, treat_name] * demeaned_covars |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does not have the desired behavior when there are multiple treatment levels.
|
|
||
| # Check case where treatment is not factor and is not binary | ||
| if (any(!(treatment %in% c(0, 1)))) { | ||
| if (any(!(treatment %in% c(0, 1))) | (!has_intercept&ncol(treatment) ==1) ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change and the subsequent one modify how lm_lin() fits without an intercept and with 0/1 treatment values.
| # If no intercept, but treatment is only one column, | ||
| # need to add base terms for covariates | ||
| if (n_treat_cols == 1) { | ||
| X <- cbind( | ||
| treatment, | ||
| demeaned_covars, | ||
| interacted_covars | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This special case is resolved
This PR modifies
get_X()inpredict.lm_robust()to appropriately create model matrices for new data predicted fromlm_lin()models with multi-valued and factorial treatments, and fixes #415.Changes in this PR:
lm_lin()when treatment is a factor and/or multi-valued (primary goal)treatment_levelsto the returnedlm_linmodel objectlm_linif the treatment values in new data are not a subset oftreatment_levels. 1lm_lin()models with no intercept 2lm_lin()models where treatment is either numeric or factorial, and fit with/without an intercept (there may be extremely small floating point differences)predict.lm_robustandlm_lindocumentation.Notes:
1: This change has consequences for use with
margins::margins(), which now must be instructed that treatment is a factor.margins()perturbs values of the variable to get marginal effects. The perturbed values will not be a subset of the original treatment levels, which now throws an error.lm_lin()allows users to input multi-valued treatments as a numeric variable, but recognizes each distinct numeric value as a different treatment level. Meanwhile,margins()would still treat these variables as continuous, resulting in different errors or weird behavior. Themargins()package already has intended behavior for factor variables following Stata implementation (margins#6), users just need to explicitly instruct margins that treatment is a factor to get correct behavior.2:
There is a small change to modify behavior of
lm_lin()for binary 0/1 treatments with no intercept. I think it's an open question what correct behavior should be here, as in Winston's original paper all models have intercepts. Previously, with binary treatment, if there's no intercept you would get a model with a treatment indicator, de-meaned covariates, and treatment interacted with covariates:Produces:
This is difficult to interpret in terms of a treatment effect.
This PR changes behavior for binary 0/1 treatment to be the same as what you would see when treatment is multi-valued or otherwise treated as a factor. It also allows you to back out the Lin estimate of the ATE.
A few comments:
predict.lm_robust()without new data is failure, because the model object does not save the original model matrix. This PR does not modify that behavior, and so doesn't address Luke's question in predict and residuals have odd behavior #403.