STAT350-class-notes/multicollinearity.qmd at main · NUstat/STAT350-class-notes · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
# Multicollinearity

Collinearity refers to the situation when two or more predictor variables have a high linear association. Linear association between a pair of variables can be measured by the correlation coefficient. Thus the correlation matrix can indicate some potential collinearity problems.

## Why and how is collinearity a problem?
*(Source:  page 100-101 of [ISLR](https://drive.google.com/file/d/106d-rN7cXpyAkgrUqjcPONNCyO-rX7MQ/view))*

The presence of collinearity can pose problems in the regression context, since it can be difficult to separate out the individual effects of collinear variables on the response.

Since collinearity reduces the accuracy of the estimates of the regression coefficients, it causes the standard error for $\hat \beta_j$ to grow. Recall that the t-statistic for each predictor is calculated by dividing $\hat \beta_j$ by its standard error. Consequently, collinearity results in a decline in the $t$-statistic. As a result, **in the presence of collinearity, we may fail to reject $H_0: \beta_j = 0$. This means that the power of the hypothesis test—the probability of correctly detecting a non-zero coefficient—is reduced by collinearity.**

## How to measure collinearity/multicollinearity?
*(Source: page 102 of [ISLR](https://drive.google.com/file/d/106d-rN7cXpyAkgrUqjcPONNCyO-rX7MQ/view))*

Unfortunately, not all collinearity problems can be detected by inspection of the correlation matrix: it is possible for collinearity to exist between three or more variables even if no pair of variables has a particularly high correlation. We call this situation multicollinearity. Instead of inspecting the correlation matrix, a better way to assess multicollinearity is to compute the variance inflation factor (VIF). The VIF is variance inflation factor the ratio of the variance of $\hat \beta_j$ when fitting the full model divided by the variance of $\hat \beta_j$ if fit on its own. The smallest possible value for VIF is 1, which indicates the complete absence of collinearity. Typically in practice there is a small amount of collinearity among the predictors. As a rule of thumb, a **VIF value that exceeds 5 or 10 indicates a problematic amount of collinearity.**

The estimated variance of the coefficient $\beta_j$, of the $j^{th}$ predictor $X_j$, can be expressed as:

$$\hat{var}(\hat{\beta_j}) = \frac{(\hat{\sigma})^2}{(n-1)\hat{var}({X_j})}.\frac{1}{1-R^2_{X_j|X_{-j}}},$$

where $R^2_{X_j|X_{-j}}$ is the $R$-squared for the regression of $X_j$ on the other covariates (a regression that does not involve the response variable $Y$).

In case of simple linear regression, the variance expression in the equation above does not contain the term $\frac{1}{1-R^2_{X_j|X_{-j}}}$, as there is only one predictor. However, in case of multiple linear regression, the variance of the estimate of the $j^{th}$ coefficient ($\hat{\beta_j}$) gets inflated by a factor of $\frac{1}{1-R^2_{X_j|X_{-j}}}$ *(Note that in the complete absence of collinearity, $R^2_{X_j|X_{-j}}=0$, and the value of this factor will be 1).*

Thus, the Variance inflation factor, or the VIF for the estimated coefficient of the $j^{th}$ predictor $X_j$ is:

VIF$(\hat \beta_j) = \frac{1}{1-R^2_{X_j|X_{-j}}}$

Let us consider the dataset *Credit.csv*. We'll compute VIF of the predictors when predicting the credit card balance of customers.

```{r}
#| message: false
#loading library for VIF
library(car)
```

```{r}
# Reading data
credit_data <- read.csv('./Datasets/Credit.csv')
head(credit_data)

# Developing model
model <- lm(Balance~Limit+Rating+Cards+Age, data = credit_data)
summary(model)
```

Note that the coefficients of `Limit` and `Rating` are statistically insignificant at 5\% significance level.

```{r}
#Computing VIF
vif(model)
```

Note that the VIF of `Income` and `Rating` is very high indicating presence of multicollinearity. This may have lead to a high standard error for the coefficients of these predictors, thereby making them appear statistically insignificant.

Let us remove `Rating` from the model as it seems to be redundant based on its high VIF.

```{r}
model <- lm(Balance~Limit+Cards+Age, data = credit_data)
summary(model)
```

Note that now the coefficient of `Limit` is highly statistically significant. This seems to be due to the reduction in standard error of the coefficient, which is potentially due to the reduction in multicollinearity from the model.

Let us re-check the VIF of the predictors of the updated model to see if multicollinearity has indeed been reduced.

```{r}
vif(model)
```

The model no longer suffers from multicollinearity. This is an example that shows that we may make incorrect inference in the presence of multicollinearity.

Note that the $R^2_{adj}$ of the updated model is almost the same as that of the previous model. Thus, the goodness-of-fit of the model seems to be unaffected by multicollinearity. However, this need not always be the case.

## Perfect multicollinearity

The `vif()` function fails to compute the variance inflation factor in case of perfect multicollinearity. In such a scenario, one would like to identify the perfectly multicollinear relationship, and remove the associated predictor(s) to eliminate perfect multicollinearity. The `alias()` function can be used to identify such relationships. Consider the simulated example below.

Let `x1`,...,`x5` be predictor, and `y` be the response.

```{r}
# Simulating the data
x1 = seq(0,1,0.01)
x2 = runif(101)
x3 = x1*2 - 0.5*x2
x4 = rnorm(101)
x5=rnorm(101)
y = x1+x3**2+rnorm(101)
```

From the above simulated data, we can see that the `x1`, `x2`, and `x3` are perfectly multicollinear. Let us try to find the VIF of the predictors.

```{r}
#| eval: false
model <- lm(y~x1+x2+x3+x4+x5)
vif(model)
```

On executing the above code, we obtain the error, *there are aliased coefficients in the model*. This indicates the presence of perfect multicollinearity. The perfectly multicollinear relationship can be identified using the `alias()` function as shown below.

```{r}
alias(lm(y~x1+x2+x3+x4+x5))
```

The above results shows that $x_3 = 2x_1-0.5x_2$. Thus, we can remove any one of $x_1, x_2, x_3$ from the model to eliminate perfect multicollinearity.

```{r}
cor(cbind(x1,x2,x3,y))
```

Let us remove `x1` as it is highly correlated with `x3`. Removing `x1` will reduce multicollinearity more as compared to removing `x2`. Among `x3` and `x1`, `x3` has a higher correlation with `y`, so it is better to remove `x1`.

```{r}
model <- lm(y~x2+x3+x4+x5)
vif(model)
```

## Addressing multicollinearity in polynomial models

In polynomial models, it is likely that $X$ will be highly correlated with $X^p$, resulting in imprecise estimates of the regression coefficents. Let us consider the following simulation example.

```{r}
# Simulating data
X <- seq(5,8,0.01)
y <- X + X^2 + rnorm(length(X))

model <- lm(y~X+I(X^2))
summary(model)
```

Note that the $X$ appears to be statistically insignificant in the model. This may be due to the presence of multicollinearity. Let us compute the correlation between $X$ and $X^2$.

```{r}
cor(X, X^2)
```

The high correlation indicates presence of multicollineairty. Multicollinearity can also be observed with the high value of VIF as shown below.

```{r}
vif(model)
```

Centering the predictors may help reduce multicollinearity as shown below

```{r}
x <- X-mean(X)
cor(x,x^2)
```

The correlation between $X$ and $X^2$, after centering the predictor, is almost 0.

```{r}
model_centered <- lm(y~x+I(x^2))
vif(model_centered)
```

Note that multicollinearity has been eliminated in the model with centered predictor.

Also, the coefficients of both $X$ and $X^2$ are statistically significant *(see model summary below)* as they should be as per the true model.

```{r}
summary(model_centered)
```


## When can we overlook multicollinearity?

- The severity of the problems increases with the degree of the multicollinearity. Therefore, if there is only moderate multicollinearity *(5 < VIF < 10)*, we may overlook it.

- Multicollinearity affects only the standard errors of the coefficients of collinear predictors. Therefore, if multicollinearity is not present for the predictors that we are particularly interested in, we may not need to resolve it.

- Multicollinearity affects the standard error of the coefficients and thereby their $p$-values, but in general, it does not influence the prediction accuracy, except in the case that the coefficients are so unstable that the predictions are outside of the domain space of the response. If our sole aim is prediction, and we don’t wish to infer the statistical significance of predictors, then we may avoid addressing multicollinearity. "*The fact that some or all predictor variables are correlated among themselves does not, in general, inhibit our ability to obtain a good fit nor does it tend to affect inferences about mean responses or predictions of new observations, provided these inferences are made within the region of observations" - Neter, John, Michael H. Kutner, Christopher J. Nachtsheim, and William Wasserman. "Applied linear statistical models." (1996): 318.*