STAT350-class-notes/influential_points.qmd at main · NUstat/STAT350-class-notes · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
# Outliers, High leverage and Influential Points

Let us again consider the `Hitters` dataset to identify outliers, high leverage, and influential points. Recall the meaning of these terms:

**1. Outliers:** These are outlying observations with respect to the response ($Y$). If there is a relatively large gap between the prediction and the observed response, the observation may be an outlier. Or, in other words, observations that correspond to large residuals are likely to be outliers.

**2. High Leverage points:** These are outlying observations with respect to the predictors ($X$). If there is a relatively large gap between an observation and the center of the rest of the observations in terms of predictor values, then the observation has a high leverage in influencing the regression function. However, it may or may not influence it. It will influence it only if the observation is an outlier as well, i.e., outlying observation with respect to the response ($Y$) as well.

**3. Influential points:** Observations that are both outliers and having high leverage tend to influence the regression function. Such observations should be analyzed, and a decision should be made whether they should be included in model fitting or not. If they appear to be incorrect, then they may be discarded.

```{r}
#| message: false
# Importing libraries
rm(list=ls())
library(ISLR)
library(leaps)
library(car)
```

We will read the `Hitters` dataset, and use the best model based on the $BIC$ criterion as identified in the [previous chapter](https://nustat.github.io/STAT350-class-notes/model_selection.html#best-subset-selection).

```{r}
# Reading data
data = Hitters

# Removing missing values
data <- data[complete.cases(data),]

# Identifying the predictors of the best model based on on the BIC criterion
regfit.full <- regsubsets(Salary ~ ., data = Hitters,
                          nvmax = 19)
reg.summary <- summary(regfit.full)
coef(regfit.full, which.min(reg.summary$bic))
```

Let us fit the best model based on the above predictors.

```{r}
# Fitting the best model
model <- lm(Salary~AtBat+Hits+Walks+CRBI+Division+PutOuts, data = data)
```

## Outliers

Let us visualize the deleted studentized residuals to identify outliers.

```{r}
# Number of observations
n = nrow(data)

# Number of predictors
p = 7 # including the intercept
```

```{r}
# Studentized deleted residuals
plot(1:n, rstudent(model), type = "l")
text(1:n, rstudent(model))
```

We'll use the Bonferroni test to identify the value of the studentized deleted residual over which an observation will be considered an outlier

```{r}
# Critical value for the residual to be identified as an outlier
alpha <- 0.05
crit <- qt(1-alpha/2/n, n-p-1) #Bonferroni
which(abs(rstudent(model)) >=crit )
```

We have only one outlying observation, which is the $173rd$ observation in the data (also shown below).

```{r}
# Outliers
plot(1:n, rstudent(model), type = "l",
     ylab = "Deleted studentized residual", xlab = "Index")
text(1:n, rstudent(model))
abline(h=crit, col = 'red')
```

Let us analyze this observation.

```{r}
data[173,]
```

This observation is for Mike Schmidt whose salary is \$2127. However, his predicted salary based on the model is \$87 (shown below):

```{r}
predict(model, newdata = data[173,])
```

The observation is an outlier because players similar to Mike Schmidt, in terms of `AtBat`, `Hits`, `Walks`, `CRBI`, `Division`, and `PutOuts` have a low salary. Let us see the salary of players similar to Mike Schmidt.

```{r}
std_data <- data[,c('AtBat','Hits','Walks','CRBI','PutOuts')]
std_data <- (std_data-apply(std_data,2,mean))/apply(std_data,2,sd)
distance <- sqrt(apply(((std_data[rep(173,n),c('AtBat','Hits','Walks','CRBI','PutOuts')] - std_data[,c('AtBat','Hits','Walks','CRBI','PutOuts')])^2),1,sum))
data_distance <- cbind(data, distance, predicted_salary=model$fitted.values)
data_distance[order(by = distance),][1:5,c('AtBat','Hits','Walks','CRBI','PutOuts','Salary','predicted_salary')]
```

We can see salary of players having the most similar statistics to Mike Schmidt is much lower than him.

Let us see the players for which the predicted salary is more than \$1500

```{r}
cbind(data[model$fitted.values>1500,c('AtBat','Hits','Walks','CRBI','PutOuts','Salary')],predicted_salary = model$fitted.values[model$fitted.values>1500])
```

Note that these players have statistics much different than Mike Schmidt. Thus, Mike Schmidt is an outlier with respect to the response.

Note the model coefficients to see the effect of statistics on predicted salary.

```{r}
model$coefficients
```

The outlying observation can also be seen clearly if we plot the fitted values against salary.

```{r}
plot(data$Salary, model$fitted.values, col = 'orange', pch = 16, main = "Fitted vs true values")
points(data$Salary[173], model$fitted.values[173], col = 'red', pch = 19)
text(data$Salary[173], model$fitted.values[173], "Mike Schmidt", pos = 3)
abline(0,1, col = 'blue')
```


## Leverage

Let us visualize the leverage of all observations.

```{r}
# Leverage
plot(1:n, hatvalues(model), type = "l")
text(1:n, hatvalues(model))
```

```{r}
# Average leverage
p/n
```

Observations whose leverage is more than twice as large as the mean leverage can be considered as high leverage.

```{r}
sum(hatvalues(model)>2*p/n)
```

We have 24 observations that can be considered to have a high leverage (visualized below).

```{r}
# High leverage
plot(1:n, hatvalues(model), type = "l")
text(1:n, hatvalues(model))
abline(h=2*p/n, col = 'red')
```

## Influential observations

Observations that have a high leverage and are outliers are  influential. The degree of influence depends on the leverage and the magnitude of the residual.

### Cook's distance

Let us visualize influential observations based on Cook's distance.

```{r}
plot(1:n, cooks.distance(model), type = "l")
text(1:n, cooks.distance(model))
```

If Cook's distance for an observation is at 50th or higher percentile of $F(p,n-p)$ distribution, then the observation is highly influential, while if it is less than 20th percentile, then it is not influential.

```{r}
#50th percentile value of F(p,n-p) distribution
qf(0.5, 7, n-7)

#20th percentile value of F(p,n-p) distribution
qf(0.2, 7, n-7)
```

As the highest Cook's distance in our observations is 0.2, we don't have any influential observations. However, the 173rd observation is relatively more influential than the rest.

As the influence of an observation depends on its residual, and leverage, a plot of residual against leverage also indicates influence of observations. In the plot below, the size of the points is proportional to the influence.

```{r}
plot(hatvalues(model), rstudent(model), cex = 10*cooks.distance(model), ylab = "Deleted studentized residual", xlab = "Leverage")
abline(h=0,col=2)
```

We can see that the observation with the deleted studentized residual of more than 6 is the most influential. Let us compare the model fit with and without this observation.

```{r}
model_minus173 <- lm(Salary~AtBat+Hits+Walks+CRBI+Division+PutOuts, data = data[-173,])
```


```{r}
# Original model
summary(model)$r.squared
summary(model)$adj.r.squared

# Model after removing the 173rd observation
summary(model_minus173)$r.squared
summary(model_minus173)$adj.r.squared
```

As expected, the model fit does seem to improve if we remove the outlier, as the outlier corresponds to a high prediction error, and thereby a high unexplained variation.

However, as the 173rd observations is not influential, the regression function should not change substantially. Let us compute the average percentage change in fitted values if we fit the model without the 173rd observation.

```{r}
# fitted values using all observations
pred_all <- model$fitted.values
# fitted values using all but the 173rd observation
pred_all_minus173 <- model_minus173$fitted.values

mean(abs((pred_all_minus173-pred_all[-173])/pred_all[-173]*100))
```

The average change in predictions is 7\%, if the 173rd observation is removed.

### DFFITS

Let us visualize the influence of an observation on the prediction for that observation.

```{r}
plot(1:n, dffits(model), type="l")
text(1:n, dffits(model))
```

For the 173rd observation, $DFFITS$ is more than one indicating a large influence on the prediction (or fitted value) for that observation.

```{r}
# Prediction on the 173rd observation with model fitted on all data
predict(model, newdata = data[173,])
```

```{r}
# Prediction on the 173rd observation with model fitted on all data except the 173rd observation
predict(model_minus173, newdata = data[173,])
```

The change in the predicted value is 89\% if the 173rd observation is removed from the data. Although the 173rd observation doesn't have a high influence overall, it does have a high influence on prediction at the point itself.

### DFBETAS

Let us observe the influence of each observation on the regression coefficients.

Influence on the intercept:

```{r}
# Intercept is the 1st predictor
plot(1:n, dfbetas(model)[,1], type="l")
text(1:n, dfbetas(model)[,1])
```

As DFBETAS for the 173rd observation is greater than one, the observation has a high influence on the intercept term

```{r}
model$coefficients
```

```{r}
model_minus173$coefficients
```

The intercept changes by 89\% from \$91 to \$10 when the 173rd observation is removed.

Influence on the `AtBat`:

```{r}
# AtBat is the 2nd predictor
plot(1:n, dfbetas(model)[,2], type="l")
text(1:n, dfbetas(model)[,2])
```

None of the observations have a high influence on the coefficient of `AtBat`.

Similary, we can visualize the influence of observations on all the regression coefficients.