Skip to content

Contrasts don't work as expected if levels have been filtered out #42

@tsostarics

Description

@tsostarics

Consider the following:

tst <- data.frame(x = gl(5, 2), y = rnorm(10))

mdl_data <- tst |> 
  dplyr::filter(x %in% 1:3) |> 
  set_contrasts(x ~ scaled_sum_code)

mdl <- lm(y ~ x, data = mdl_data)

the lm() call will throw the following warning

Warning message:
contrasts dropped from factor x due to missing levels 

and the model contrasts will be set to the default

>mdl$contrasts
$x
[1] "contr.treatment"

A similar issue arises with enlist_contrasts() where the factors that were filtered out still exist in the levels, thus in the final contrast matrix:

> tst |> 
+   dplyr::filter(x %in% 1:3) |> 
+   enlist_contrasts(x ~ scaled_sum_code)
$x
     2    3    4    5
1 -0.2 -0.2 -0.2 -0.2
2  0.8 -0.2 -0.2 -0.2
3 -0.2  0.8 -0.2 -0.2
4 -0.2 -0.2  0.8 -0.2
5 -0.2 -0.2 -0.2  0.8

This arises because x is already a factor with associated levels. Filtering values out of a factor vector does not filter out levels from the levels attribute, hence why droplevels() exists.

We could call droplevels on columns that are already of class factor by default. This is a potentially slightly breaking change. Existing analyses that already had missing factors should have already been forced to deal with it somehow (e.g., calling droplevels() before set_contrasts()), so adding this would be redundant but harmless in those cases.

I'm struggling to identify a case where having unused factors in a data frame is actually a desirable/exploitable behavior. For instance, if you rbind two dataframes that share a factor column, the resulting contrasts for the factor will be the default contrasts using the union of the levels of the originald ataframes. So, if df has factor column x with levels 1, 2, 3, 4, 5 and is set to scaled_sum_code, rbinding df2 with levels 4, 5 will yield the default contrasts for levels 1, 2, 3, 4, 5. If the levels were 4, 5, 6, then the contrasts would again be the default with levels 1, 2, 3, 4, 5, 6.

An additional scenario is that if a dataframe has filtered a factor to a single level, you get an error and a warning when fitting the model:

> mdldata2 <- tst |> 
+   dplyr::filter(x == 1) |> 
+   set_contrasts(x ~ scaled_sum_code)
> lm(y ~ 0 + x, data = mdldata2)
Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) : 
  contrasts can be applied only to factors with 2 or more levels
In addition: Warning message:
contrasts dropped from factor x due to missing levels 

Based on this I think i'm leaning towards dropping levels by default so that the result of the contrastable functions work as expected; it's frustrating to use them and then have a modeling function warn about contrasts (which should be handled perfectly by the package). A default option .droplevels=true can be added, and there could be an additional message displayed when levels are dropped. This should also result in set/enlist_contrasts throwing the expected error when there's only one factor level. This way the missing levels can be addressed prior to attempting to fit a model.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or requestinvalidThis doesn't seem right

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions