-
Notifications
You must be signed in to change notification settings - Fork 1
Description
Consider the following:
tst <- data.frame(x = gl(5, 2), y = rnorm(10))
mdl_data <- tst |>
dplyr::filter(x %in% 1:3) |>
set_contrasts(x ~ scaled_sum_code)
mdl <- lm(y ~ x, data = mdl_data)the lm() call will throw the following warning
Warning message:
contrasts dropped from factor x due to missing levels
and the model contrasts will be set to the default
>mdl$contrasts
$x
[1] "contr.treatment"A similar issue arises with enlist_contrasts() where the factors that were filtered out still exist in the levels, thus in the final contrast matrix:
> tst |>
+ dplyr::filter(x %in% 1:3) |>
+ enlist_contrasts(x ~ scaled_sum_code)
$x
2 3 4 5
1 -0.2 -0.2 -0.2 -0.2
2 0.8 -0.2 -0.2 -0.2
3 -0.2 0.8 -0.2 -0.2
4 -0.2 -0.2 0.8 -0.2
5 -0.2 -0.2 -0.2 0.8This arises because x is already a factor with associated levels. Filtering values out of a factor vector does not filter out levels from the levels attribute, hence why droplevels() exists.
We could call droplevels on columns that are already of class factor by default. This is a potentially slightly breaking change. Existing analyses that already had missing factors should have already been forced to deal with it somehow (e.g., calling droplevels() before set_contrasts()), so adding this would be redundant but harmless in those cases.
I'm struggling to identify a case where having unused factors in a data frame is actually a desirable/exploitable behavior. For instance, if you rbind two dataframes that share a factor column, the resulting contrasts for the factor will be the default contrasts using the union of the levels of the originald ataframes. So, if df has factor column x with levels 1, 2, 3, 4, 5 and is set to scaled_sum_code, rbinding df2 with levels 4, 5 will yield the default contrasts for levels 1, 2, 3, 4, 5. If the levels were 4, 5, 6, then the contrasts would again be the default with levels 1, 2, 3, 4, 5, 6.
An additional scenario is that if a dataframe has filtered a factor to a single level, you get an error and a warning when fitting the model:
> mdldata2 <- tst |>
+ dplyr::filter(x == 1) |>
+ set_contrasts(x ~ scaled_sum_code)
> lm(y ~ 0 + x, data = mdldata2)
Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
contrasts can be applied only to factors with 2 or more levels
In addition: Warning message:
contrasts dropped from factor x due to missing levels Based on this I think i'm leaning towards dropping levels by default so that the result of the contrastable functions work as expected; it's frustrating to use them and then have a modeling function warn about contrasts (which should be handled perfectly by the package). A default option .droplevels=true can be added, and there could be an additional message displayed when levels are dropped. This should also result in set/enlist_contrasts throwing the expected error when there's only one factor level. This way the missing levels can be addressed prior to attempting to fit a model.