STAT350-class-notes/Diagnostics_Remedial_Measures.qmd at main · NUstat/STAT350-class-notes · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
# Diagnostics and Remedial measures (SLR)

Let us consider the `Auto` dataset from the `ISLR` library. The library is related to the book *[Introduction to Statistical Learning](https://hastie.su.domains/ISLR2/ISLRv2_website.pdf)*

## Loading libraries

```{r}
#| message: false
library(ISLR) #for the dataset
library(ggplot2) #for making plots
library(lmtest) #for bptest()
library(MASS) #for boxcox()
```


## Reading data

```{r}
auto_data <- Auto
```

## Developing model

Suppose we wish to predict `mpg` based on `horsepower`.

```{r}
linear_model <- lm(mpg~horsepower, data = auto_data)
```

## Model fit

Let us visualize the model fit to the data.

```{r}
ggplot(data = auto_data, aes(x = horsepower, y = mpg))+
  geom_point()+
  geom_smooth(method = "lm")
```

The plot indicates that a quadratic relationship between `mpg` and `horsepower` may be more appropriate.

## Diagnostic plots & statistical tests

Let us make diagnostic plots that help us check the model assumptions.

```{r}
par(mfrow = c(2,2))
plot(linear_model)
```

### Linear relationship

#### Visual check

We can check the linearity assumption visually by analyzing the plot of residuals against fitted values. The plot indicates that a quadratic relationship between the response and the predictor may be more appropriate than a linear one.

#### Statistical test

We'll use the *F* test for lack of fit to check the linearity assumption

```{r}
#Full model
full_model <- lm(mpg~as.factor(horsepower), data = auto_data)
```

```{r}
# F test for lack of fit
anova(linear_model, full_model)
```

As expected based on the visual check, the linear model fails the linearity test with a very low $p$-value of the order of $10^{-14}$. Thus, we conclude that relationship is non-linear.

### Constant variance (Homoscedasticity)

#### Visual check

The plot of the residuals against fitted values indicates that the error variance is increasing with increasing values of predicted `mpg` or decreasing `horsepower`. Thus, there seems to be heteroscedasticity.

#### Statistical test

**Breusch-Pagan test**

Let us conduct the Breusch-Pagan test for homoscedasticity. This is a large sample test, which assumes that the error terms are independent and normally distruted, and that the variance of of the error term $\epsilon_i$, denoted by $\sigma_i^2$, is related to the level of the predictor $X$ in the following way:

$\log(\sigma_i^2) = \gamma_0 + \gamma_1X_i$.

This implies that $\sigma_i^2$ either increases or decreases with the level of $X$. Constant error variance corresponds to $\gamma_1 = 0$. The hypotheses for the test are:

$H_0: \gamma_1 = 0$
$H_a: \gamma_1 \ne 0$

```{r}
bptest(linear_model)
```

As expected based on the visual check, the model fails the homoscedasticity with a low $p$-value of 0.3\%.


**Brown-Forsythe test**

To conduct the Brown-Forsythe test, we divide the data into two groups, based on the predictor $X$. If the error variance is not constant, the residuals in one group will tend to be more variable than those in the other group. The test consists of a two-sample $t$-test to determine whether the mean of the absolute deviations from the median residual of one group differs significantly from the mean of the absolute deviations from the median residual of the other group.

```{r}
#Break the residuals into 2 groups
residuals_group1 <- linear_model$residuals[auto_data$horsepower<=100]
residuals_group2 <- linear_model$residuals[auto_data$horsepower>100]
```

```{r}
#Obtain the median of each group
median_group1 <- median(residuals_group1)
median_group2 <- median(residuals_group2)
```

```{r}
#Two-sample t-test
t.test(abs(residuals_group1-median_group1), abs(residuals_group2-median_group2), var.equal = TRUE)
```

Brown-Forsythe test also shows that there is heteroscedasticity.

### Normality of error terms

#### Visual check

The QQ plot indicates that the distribution of the residuals is slightly right-skewed. This right-skew can be visualized by making a histogram or a density plot of residuals:

```{r}
par(mfrow = c(1,1))
hist(linear_model$residuals, breaks = 20, prob = TRUE)
lines(density(linear_model$residuals), col = 'blue')
```

#### Statistical test

```{r}
shapiro.test(linear_model$residuals)
```

As expected based on the visual check, the model fails the test for normal distribution of error terms with a low $p$-value of the order of $10^{-5}$.

### Independence of error terms

This test is relevant if there is an order in the data, i.e., if the observations can be arranged based on time, space, etc. In the current example, there is no order, and so this test is not relevant.

## Remedial Measures

If the linear relationship assumption violated, it results in biased estimates of the regression coefficients resulting in a biased prediction. As the estimates are biased, their variance estimates may also be inaccurate. Thus, this is a very important assumption as it leads to inaccuracies in both point estimates and statistical inference.

If the homoscedasticity assumption is violated, but the linear relationship assumption holds, the OLS estimates are still unbiased resulting in an unbiased prediction. However, their standard error estimates are likely to be inaccurate.

If the assumption regarding the normal distribution of errors is violated, the OLS estimates are still BLUE (Best linear unbiased estimates). The confidence interval for a point estimate, though effected, is robust to departures from normality for large sample sizes. However, the prediction interval is sensitive to the same.

Our model above (`linear_model`) violates all the three assumptions.

Note that there is no one unique way of fixing all the model assumption violations. Its an iterative process, and there is some trial and error involved. Two different approaches may lead to two different models, and both models may be more or less similar *(in terms of point estimates / inference)*

### Addressing linear relationship assumption violation

Let us address the linear relationship assumption first.

As indicated in the diagnostic plots, the relationship between $X$ and $Y$ seemed to be quadratic. We'll change the simple linear regression model to the following linear regression model, where $Y$ and $X$ have a quadratic relationship:

$Y_i = \beta_0 + \beta_1X_i + \beta_2X_i^2$

Let us fit the above model to the data:

```{r}
quadratic_model <- lm(mpg~horsepower + I(horsepower**2), data = auto_data)
```

Note that in the formula specified within the `lm()` function, the `I()` operator isolates or insulates the contents within `I(…)` from the regular formula operators. Without the `I()` operator, `horsepower**2` will be treated as the interaction of horsepower with itself, which is horsepower. Thus, to add the square of horsepower as a separate predictor, we need to use the `I()` operator.

After every transformation or remedial measure, we should diagnoze the model to check the model assumptions are satisfied.

```{r}
par(mfrow=c(2,2))
plot(quadratic_model)
```

We observe that the departure from the linear relationship assumption has been reduced. However, there still seems to be heteroscedasticity and non-normal error terms. Let us verify this with statistical tests.

```{r}
#F test for lack of fit
anova(quadratic_model, full_model)
```

Even though the model still fails the linearity test, the extent of violation is reduced as the $p$-value is now of the order of $10^{-5}$ instead of $10^{-14}$.

```{r}
# Breusch-Pagan test
bptest(quadratic_model)
```

The quadratic model has a more severe heteroscedasticity as compared to the linear model based on the $p$-value.

```{r}
shapiro.test(quadratic_model$residuals)
```

The severity of departure from normality has been reduced in the quadratic model.

### Addressing heteroscedasticity

Let us address the heteroscedasticity assumption.

We can see from the diagnostic plot of the quadratic model that the error variance is higher for higher fitted values. A possible way to resolve this is to transform the response $Y$ using a concave function such as log($Y$) or $\sqrt{Y}$. Such a transformation results in a greater amount of shrinkage of the larger responses, leading to a reduction in heteroscedasticity. Let us try the log($Y$) transformation to address heteroscedasticity.

```{r}
log_model <- lm(log(mpg)~horsepower+I(horsepower**2), data = auto_data)
par(mfrow=c(2,2))
plot(log_model)
```

Based on the plot of residuals against fitted values, heteroscedasticity seems to be mitigated to some extent. Let us confirm with the statistical test.

```{r}
bptest(log_model)
```

Assuming a significance level of $5\%$, we do not reject the null hypothesis that the error variance is constant. Thus, the homoscedasticity assumption holds on the log transformed model.

The linear relationship assumptions seems to be satisfied. Let us check it via the $F$ lack of fit test.

```{r}
log_full_model <- lm(log(mpg)~as.factor(horsepower), data = auto_data)
anova(log_model, log_full_model)
```

Although the $p$-value is higher than both the previous models indicating that the departure from linear relationship has been reduced, the model fails the $F$ lack of fit test. However, from the residual plot, there didn't seem to be a strong departure from the linear relationship assumption. The failure in the statistical test may be due to the presence of outliers in the data. Let us visualize the model full model along with the

```{r}
colors <- c("Full log model" = "red", "Log linear model" = "blue")

ggplot(data = auto_data, aes(x = horsepower, y = log(mpg)))+
  geom_point(col = "orange")+
  geom_line(aes(y = log_full_model$fitted.values, color = "Full log model"))+
  geom_line(aes(y = log_model$fitted.values, color = "Log linear model"))+
    scale_color_manual(values = colors)+
  theme(legend.title = element_blank(),
        legend.position = c(0.85,0.9))
```

We observe that the full log model seems to be effected by outlying observations and explaining some of the noise. With the addition of more replicates at the $X$ values corresponding to the outlying observations, the model may have passed the $F$ lack of fit test. Thus, in this case, it appears that the relationship is indeed linear, but the failure in the statistical test is due to the full model over-fitting the data and explaining some noise as well. Another option is to bin the `horsepower` variable such that each bin has several replicates, and then do the $F$ lack of fit test.

From the QQ plot, it appears that the error distribution is normal, except for three outlying values. Let us confirm the same from the normality test.

```{r}
shapiro.test(log_model$residuals)
```

The $p$-value is high supporting the claim that the model satisfies the assumption of normality of error terms. The $p$-value would probably be higher if it was not for those three outlying points. The next step could be to analyze those three points to figure out the reason they are outlying values. For example, those three points may correspond to a particular car model that has only three observations in the data.

## Box-Cox transformation

Sometimes it may be difficult to determine the appropriate transformation from diagnostic plots to correct the skewness of the error terms, unequal error variances, and non-linearity of the error function. The Box-Cox procedure automatically identifies a transformation from the family of power transformations on $Y$. The family of power transformations is of the form:

$Y' = Y^{\lambda}$

The normal error regression model with the response variable a member of the family of power transformations becomes:

$Y_i^{\lambda} = \beta_0 + \beta_1X_i$

The Box-Cox procedure uses the method of Maximum likelihood to estimate $\lambda$, as well as other parameters $\beta_0, \beta_1$, and $\sigma^2$.

Let us use the Box-Cox procedure to identify the appropriate transformation for the `linear_model`.

```{r}
boxcox(linear_model)
```

A plot of the log-likelihood vs $\lambda$ is returned with a confidence interval for the optimal value of $\lambda$. We can zoom-in the plot to identify the appropriate value of $\lambda$ where the log-likelihood maximizes.

```{r}
boxcox(linear_model, lambda = seq(-1, 0, 0.1))
```

A value of $\lambda = -0.5$ seems appropriate. Note that the log-likelihood is based on a sample. Thus, typically we don't select the exact value of $\lambda$ where the log-likeliood maximizes, but any value that is easier to interpret within the confidence interval. Let us consider $\lambda = -0.5$ in this case.

```{r}
boxcox_model <- lm(mpg^-0.5~horsepower, data = auto_data)
par(mfrow=c(2,2))
plot(boxcox_model)
```

Based on the plots, the model seems to satisfy the normality of errors and homoscedasticity assumption, but there appears to be a quadratic relationship between the response and the predictor. Let us confirm our visual analysis with statistical tests.

```{r}
bptest(boxcox_model)
```

The homoscedasticity assumption is violated, but not too severely.

```{r}
shapiro.test(boxcox_model$residuals)
```

Box-Cox transformation indeed resulted in the normal distribution of error terms.

```{r}
power_full_model <- lm(mpg^-0.5~as.factor(horsepower), data = auto_data)
anova(boxcox_model, power_full_model)
```

As expected based on the visual analysis, the relationship has a large departure from linearity.

Note that the Box-Cox transformation can identify the appropriate transformation for the response, but not for the predictors. Thus, we need to figure out the appropriate predictor transformation based on the diagnostic plots and/or trial and error. Let us try the quadratic transformation of the predictor $X$ as the relationship appears to be quadratic based on the residual plot.

```{r}
quadratic_boxcox_model <- lm(mpg^-0.5~horsepower+I(horsepower**2), data = auto_data)
par(mfrow = c(2,2))
plot(quadratic_boxcox_model)
```

Note that the severity of the linearity assumption violation has been reduced, while the homoscedasticity and normality of errors assumptions seem to be satisfied. Note that transformation of the predictor $X$ doesn't change the distribution of errors. Thus, it was expected that the transformation on $X$ will not deteriorate the model in terms of the assumptions regarding the distribution of error terms.

Let us confirm our intuition based on diagnostic plots with statistical tests.

```{r}
bptest(quadratic_boxcox_model)
```

As expected, the errors are homoscedastic.

```{r}
shapiro.test(quadratic_boxcox_model$residuals)
```

There is a slight deviation from the normal distribution, but its only due to three points as we saw earlier.

```{r}
anova(quadratic_boxcox_model, power_full_model)
```

The model fails the linearity assumption test. But as noted earlier, this failure is due to the presence of very few or no replicates at some points. This potentially results in the full model explaining the noise, which in turn results in the model failing the linearity test. Again, we can bin the predictor values to have replicates for every bin, and then the model is likely to satisfy the linearity assumption.

As shown in the plot below, the full model is overfitting the data, and explaining some noise due to the absence of replicates for some values of horsepower.

```{r}
ggplot(data = auto_data, aes(x = horsepower, y = mpg))+
  geom_point()+
  geom_line(aes(y = power_full_model$fitted.values^-2), color = "blue")
```

To resolve this issue, we will bin the predictor values, so that we have replicates for each distinct value of the predictor, and get a better fit for the full model

```{r}
#Binning horsepower into 20 equal width bins
horsepower_bins <- cut(auto_data$horsepower, breaks = 20)
```

```{r}
# Full model based on binned horsepower
power_full_model_binned_hp <- lm(mpg^-0.5~horsepower_bins, data = auto_data)
anova(quadratic_boxcox_model, power_full_model_binned_hp)
```

Note that now the `quadratic_boxcox_model` model saitisfies the linearity assumption *(assuming a significance level of 5\%)*.

Note that our full model based on the binned `horsepower` values seems to better fit the data as shown in the plot below.

```{r}
colors <- c("Full model (binned predictor)" = "red", "Quadratic Box-Cox model" = "blue")
ggplot(data = auto_data, aes(x = horsepower, y = mpg))+
  geom_point()+
  geom_line(aes(y = quadratic_boxcox_model$fitted.values^-2, color = "Quadratic Box-Cox model"))+
  geom_line(aes(y = power_full_model_binned_hp$fitted.values^-2, color = "Full model (binned predictor)"))+
    scale_color_manual(values = colors)+
  theme(legend.title = element_blank(),
        legend.position = c(0.75,0.85))
```