STAT350-class-notes/Assignment F.qmd at main · NUstat/STAT350-class-notes · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
---
title: "Assignment F"
subtitle: "Model selection"
format:
  html:
    toc: true
self-contained: true
link-external-newwindow: true
editor_options:
  chunk_output_type: console
---

## Instructions {.unnumbered}

1.  You may talk to a friend, discuss the questions and potential directions for solving them. However, you need to write your own solutions and code separately, and not as a group activity.

2.  Make R code chunks to insert code and type your answer outside the code chunks. Ensure that the solution is written neatly enough to understand and grade.

3.  Render the file as HTML to submit. For theoretical questions, you can either type the answer and include the solutions in this file, or write the solution on paper, scan and submit separately.

4.  The assignment is worth 100 points, and is due on **21st November 2023 at 11:59 pm**.

5.  **Five points are properly formatting the assignment**. The breakdown is as follows:

-   Must be an HTML file rendered using Quarto (the theory part may be scanned and submitted separately) (2 pts).
-   There aren't excessively long outputs of extraneous information (e.g. no printouts of entire data frames without good reason, there aren't long printouts of which iteration a loop is on, there aren't long sections of commented-out code, etc.). There is no piece of unnecessary / redundant code, and no unnecessary / redundant text (1 pt)
-   Final answers of each question are written clearly (1 pt).
-   The proofs are legible, and clearly written with reasoning provided for every step. They are easy to follow and understand (1 pt)

## Energy model

The datasets *ENB2012_Train.csv* and *ENB2012_Test.csv* provide data on energy analysis using different building shapes simulated in Ecotect. The buildings differ with respect to the glazing area, the glazing area distribution, and the orientation, amongst other parameters. Below is the description of the data columns:

1. `X1`: Relative Compactness
2. `X2`: Surface Area
3. `X3`: Wall Area
4. `X4`: Roof Area
5. `X5`: Overall Height
6. `X6`: Orientation
7. `X7`: Glazing Area
8. `X8`: Glazing Area Distribution
9. `Y1`: Heating Load

### Number of models

Suppose that we want to implement the best subset selection algorithm to find the first order predictors (`X1`-`X8`) that can predict heating load (`y1`). How many models for $E$(`Y1`) are possible, if the model includes (i) one variable, (ii) three variables, and (iii) eight variables? Show your steps without running any code.

Note: The notation $E$(`Y1`) means the expected value of `Y1` or the mean of `y1`.

*(3 points)*

### Resolving perfect multicollinearity

Fit a linear regression model and attempt to find the VIF of each predictor. It will throw an error. This is because of the presence of perfect multicollinearity in the predictors. Identify the perfectly collinear relationship, and discard a predictor to eliminate perfect multicollinearity. Among the collinear predictors, discard the predictor that is the most highly correlated with the one of the predictors not in the set of collinear predictors. State the discarded predictor, and discard it for all the questions in this assignment.

**Hint:**

You may use the R function `alias()` to identify the perfectly collinear relationship. Check out an example of this function [here](https://nustat.github.io/STAT350-class-notes/multicollinearity.html#perfect-multicollinearity).

*(2+2 = 4 points)*


### Best model(s)

Implement the best subset selection algorithm to find the *best* first-order predictors of heating load `Y1`, with respect to $R^2$, Adj-$R^2$, $C_p$, and the $BIC$ criteria.

Note:

1. Use *ENB2012_Train.csv* and consider only the first-order terms.
2. Make plots visualizing the $R^2$, Adj-$R^2$, $C_p$, and the $BIC$ criteria for the best model corresponding to each possible value of number of predictors.

Print the coefficients of all the models that are the best based on at least one criteria among Adj-$R^2$, $C_p$, and $BIC$.

*(2 + 4 + 2 = 8 points)*

### Best among best?

We need to choose one among all the best models identified in the previous question. Use the following approach:

1. Compare the coefficients of each model when fitted on the training data with the coefficients when fitted on the test data. If the coefficients are similar, it means that the model is generalizable to unseen data. Otherwise, the model has issues, and you may discard it.

2. Check the prediction error *(root mean squared prediction error)* on the test data. If the predictor error of one model is much lower than the rest, then you may select that model.

3. Check the model assumptions regarding homoscedasticity and normal distribution of error terms using statistical tests. You may select a model if it has much lesser departure from assumptions as compared to the rest of them.

4. If there is no clear choice based on (1), (2), and (3), perform the general linear F test *(with a significance levlel of 5\%)* to choose the most parsimonious model.

Print the coefficients of the chosen model.

*(2x4 = 8 points)*


### Overfitting

Compute the estimate of the error standard deviation. Compare it with the RMSE on test data, and comment if the chosen model in the previous question ([F.1.4](https://nustat.github.io/STAT350-class-notes/Assignment%20F.html#best-among-best)) seems to overfit the data.

*(2 + 2 = 4 points)*

### Best model vs all-predictor model

Calculate the RMSE of the model selected in [F.1.4](https://nustat.github.io/STAT350-class-notes/Assignment%20F.html#best-among-best). Compare it with the RMSE of the model using all first-order predictors *(except the one dropped in [F.1.2](https://nustat.github.io/STAT350-class-notes/Assignment%20F.html#resolving-perfect-multicollinearity) to address perfect multicollinearity)*. You will find that the two RMSEs are similar. Seems like the best subset model didn't help improve prediction.

1. Why did the best subset model not help improve prediction accuracy as compared to the model with all the predictors?

2. Does the best subset model provide a more accurate inference as compared to the model with all the predictors? Why or why not?

*(3 + 3 = 6 points)*

### Two-factor Interactions

Let us consider adding all the 2-factor interactions of all the predictors in the model *(except the one dropped in [F.1.2](https://nustat.github.io/STAT350-class-notes/Assignment%20F.html#resolving-perfect-multicollinearity) to address perfect multicollinearity)*. Answer the following questions without running code.

1. How many predictors do we have in total?

2. Assume best subset selection is used. How many models are fitted in total?

3. Assume forward selection is used. How many models are fitted in total?

4. Assume backward selection is used. How many models are fitted in total?

5. How many models will be developed in the iteration that contains exactly 10 predictors in each model – for best subsets, fwd/bwd regression?

6. What approach would you choose for variable selection (amonst best subset, forward selection, backward selection)?

*(6x2 = 12 points)*

### Interaction model

Use best subset selection to find the *best* first-order predictors and 2-factor interactions of the predictors of `Y1` (Heating Load). In case different criteria indicate different models as best, choose the most parsimonious model.

Perform the general linear test to show that the model with interactions should be preferred over the additive model selected earlier (in [F.1.4](https://nustat.github.io/STAT350-class-notes/Assignment%20F.html#best-among-best)).

Find the RMSE of the best model with interactions on the test data. Is it lower than the RMSE of the additive model (selected earlier in [F.1.4](https://nustat.github.io/STAT350-class-notes/Assignment%20F.html#best-among-best))?

In case of numerical issues, check if there is perfect multicollinearity and remove the corresponding interaction(s) to eliminate the issue.

**Hint:** The R syntax to include 2 factor interactions for all the predictors in the dataset is:

```{r}
#| eval: false
model <- lm(Y~.^2, data = data)
```

where `Y` is the response, and `data` is the dataset that consists of all the predictors and the response.

If you wish to remove a certain interaction from the above model, say the interaction between `x1` and `x2`, you can use the following syntax:

```{r}
#| eval: false
model <- lm(Y~.^2-x1:x2, data = data)
```

*(2 + 2 + 2 + 2 = 8 points)*

### Polynomial terms

In the model developed in the previous question ([F.1.8](https://nustat.github.io/STAT350-class-notes/Assignment%20F.html#interaction-model)), consider including the polynomial transformation of degree 2 for each predictor, and then repeat the same procedure:

Use best subset selection to find the *best* first-order & second order predictors including 2-factor interactions of the predictors to predict `Y1` (Heating Load). In case different criteria indicate different models as best, choose the most parsimonious model.

Perform the general linear test to show that the model with interactions & polynomial terms of degree 2 should be preferred over the model with interactions but no polynomial terms developed in the previous question ([F.1.8](https://nustat.github.io/STAT350-class-notes/Assignment%20F.html#interaction-model)).

Find the RMSE of the best model with 2-factor interactions and polynomial terms of degree 2, on the test data. Is it lower than the RMSE of the model with 2-factor interactions but no polynomial terms developed in the previous question ([F.1.8](https://nustat.github.io/STAT350-class-notes/Assignment%20F.html#interaction-model))?

In case of numerical issues, check if there is perfect multicollinearity and remove the corresponding polynomial term(s) to eliminate the issue.

**Hint:** The R syntax to include the polynomial transformation of degree 2 of say, the predictor `x1`, in an additive model is:

```{r}
#| eval: false
model <- lm(Y~.+I(x1^2), data = data)
```

*(2 + 2 + 2 + 2 = 8 points)*

### Forward stepwise selection

Repeat the above question, but use forward stepwise selection instead of best subset selection. Compare the time taken in forward stepwise with that taken in best subset selection in the previous question ([F.1.9](https://nustat.github.io/STAT350-class-notes/Assignment%20F.html#polynomial-terms)). Which approach is faster, and by what factor?

Also, compare the RMSE on test data of the model chosen with the best subset selection approach ([F.1.9](https://nustat.github.io/STAT350-class-notes/Assignment%20F.html#polynomial-terms) with the model chosen with the forward stepwise selection procedure. Which approach seems to provide a more accurate model?

*(2 + 2 = 4 points)*

### Backward stepwise selection

Repeat the above question, but use backward stepwise selection instead of best subset selection. Compare the time taken in backward stepwise with that taken in best subset selection in the earlier question ([F.1.9](https://nustat.github.io/STAT350-class-notes/Assignment%20F.html#polynomial-terms)). Which approach is faster, and by what factor?

Also, compare the RMSE on test data of the model chosen with the best subset selection approach ([F.1.9](https://nustat.github.io/STAT350-class-notes/Assignment%20F.html#polynomial-terms) with the model chosen with the backward stepwise selection procedure. Which approach seems to provide a more accurate model?

*(2 + 2 = 4 points)*

## Forward vs backward vs best subset approach

Mention one advantage of each of the 3 approaches (forward stepwise selection, backward stepwise selection, best subset selection) over the other 2 approaches.

*(2x3 = 6 points)*

## K-fold cross validation approach

We will use a new approach to develop a model for predicting `Y1`. The approach is based on considering terms based on the effect hierarchy principle and assessing if a term should be added to the model based on the K-fold cross-validation error.

In this question, you will use $K$-fold cross validation to find out the relevant interactions of these predictors and the relevant interactions of the polynomial transformations of these predictors for predicting `Y1`. We'll consider quadratic and cubic transfromations of each predictor, and the interactions of these transformed predictors. For example, some of the interaction terms that you will consider are (`X1`$^2$), (`X1`)(`X3`), (`X1`$^2$)(`X3`), (`X1`)(`X3`)(`X4`), (`X1`)(`X3`$^2$)(`X4`), (`X1`)(`X3`$^2$)(`X4`)(`X5`$^3$), etc. The highest degree interaction term will be of degree 21 - (`X1`$^3$)(`X3`$^3$)(`X4`$^3$)(`X5`$^3$)(`X6`$^3$)(`X7`$^3$)(`X8`$^3$), and the lowest degree interaction terms will be of degree two, such as (`X1`$^2$), (`X1`)(`X2`), etc.

The algorithm to find out the relevant interactions using $K$-fold cross validation is as follows. Most of the algorithm is already coded for you. You need to code only part 2 as indicated below.

1. Start with considering interactions of degree `d` = 2:

2. **Find out the 10-fold cross validation RMSE if an interaction of degree `d` is added to the model** *(You need to code only this part)*.

3. Repeat step (2) for all possible interactions of degree `d`.

4. Include the interaction of degree `d` in the model that leads to the highest reduction in the 5-fold cross validation error as compared to the previous model *(forward stepwise selection based on K-fold cross validation)*

5. Repeat steps 2-4 until no more reduction is possible in the 10-fold cross validation RMSE.

6. If `d` = 21, or if no term of a particular degree could reduce the $K$-fold cross-validation error further, then stop, otherwise increment `d` by one, and repeat steps 2-5.

The above algorithm is coded below. The algorithm calls a function `KFoldCV` to compute the 10-fold cross validation RMSE given the interaction terms already selected in the model as `selected_interactions`, and the interaction term to be tested as `interaction_being_tested`. The function must return the 10-fold cross validation RMSE if `interaction_being_tested` is included to the model consisting of `X1`, `X3`, `X4`, `X5`, `X6`, `X7`, and `X8` in addition to the already added interactions in `selected_interactions`. The model equation for which you need to find the $10$-fold cross validation RMSE will be`'price~X1+X3+X4+X5+X6+X7+X8'+selected_interactions+interaction_being_tested`

You need to do the following:

1. Fill out the function `KFoldCV`.

2. Execute the code to obtain the relevant interactions in `selected_interactions`. Print out the object `selected_interactions`.

3. Fit the model with the seven predictors, the `selected_interactions` and compute the RMSE on test data.

*(15 + 2 + 3 = 20 points)*

```{r}
#| eval: false
# Creating a dataframe that will consist of all combinations of polynomial transformations of the
# predictors to be considered for interactions
library(AlgDesign)
degree <- gen.factorial(4, 7,center = FALSE,
  varNames=c("X1","X3", "X4","X5","X6", "X7","X8"))
degree <- degree-1
degree$sum_degree <- apply(degree,1,sum)
degree$count_zeros <- apply(degree,1,FUN = function(x) sum(x[1:7]==0))
degree <- degree[order(degree$sum_degree,-degree$count_zeros ),]
degree <- degree[-1,]
degree <- degree[,-9]
head(degree)
```

```{r}
#| eval: false
k=10
fold_size = round(nrow(ENB_train)/k)

# Fill out this function - that is all you need to do to make the code work!

# The function must return the mean 10-fold cross validation RMSE for the model
# that has the 7 individual predictors - 'X1', 'X3', 'X4', 'X5', 'X6', 'X7', and 'X8',
# the 'selected_interactions', and the 'interaction_being_tested'

KFoldCV <- function(selected_interactions, interaction_being_tested)
{
  #
  # for(i in 1:k)
  # {

    # formulation <- paste0("Y1~.",selected_interactions,interaction_being_tested)
    # model <- lm(noquote(formulation), data = train_set)

  # }
  # return()
}
```


```{r}
#| eval: false
# This code implements the algorithm of systematically considering interactions of degree 2 and going upto the interaction of degree 21 For a given degree 'd' the interactions are selected greedily based on highest reduction in the 10-fold cross validation RMSE. Once no more reduction in the 10-fold cross validation RMSE is possible using interactions of degree 'd', interaction terms of the next higher degree 'd+1' are considered. If there is no interaction of order 'd' that reduces the 10-fold cross validation RMSE, the procedure is stopped.

# 10-fold cross validation RMSE of the initial model with the 7 predictors of degree one
cv_previous_model = KFoldCV(selected_interactions = '', interaction_being_tested = '')
pred_names<-colnames(degree)
interaction_being_tested = '+'
selected_interactions = ''

# Considering interactions of degree 'd' = 2 to 21
for(d in 2:21)
{
  d_terms = 0

  # Initializing objects to store the interactions of degree 'd' that reduce the 10-fold cross validation RMSEs as compared to the previous model
  interactions_that_reduce_KfoldCV<-c()
  cv_degree <- c()

  # Selecting interaction terms of degree = 'd'
  degree_set = degree[degree$sum_degree==d,]

  # Continue adding interactions of degree 'd' in the model until no interactions reduce the 5-fold cross-validation RMSE
  while(TRUE)
  {

    #Iterating over all possible interactions of degree 'd'
    for(j in 1:nrow(degree_set))
    {

      # Creating the formula expression for the interaction term to be tested
      row = degree_set[j,1:7]
      for(m in 1:7)
      {
        if(row[m]>1)
        {
          interaction_being_tested <- paste0(interaction_being_tested, "I(", pred_names[m],"^",row[m],")*")
        }else if(row[m]==1)
        {
          interaction_being_tested <- paste0(interaction_being_tested, pred_names[m],"*")
        }
      }
      interaction_being_tested<-substr(interaction_being_tested,1,nchar(interaction_being_tested)-1)

      # Call the function 'KFoldCV' to find out the 10-fold cross validation error on adding the interaction term being tested to the model
      cv <- KFoldCV(selected_interactions, interaction_being_tested)

      # If the interaction term being tested reduces the 10-fold cross validation RMSE as compared to the previous model, then consider adding it to the model
      if(cv<cv_previous_model)
      {
        interactions_that_reduce_KfoldCV<-c(interactions_that_reduce_KfoldCV, interaction_being_tested)
        cv_degree<-c(cv_degree, cv)
      }
      interaction_being_tested = '+'
    }
    cv_data = data.frame(interaction = interactions_that_reduce_KfoldCV, cv = cv_degree)

    # Break the loop if no interaction of degree 'd' reduces the 5-fold cross validation RMSE as compared to the previous model
    if(nrow(cv_data)==0)
    {
      break
    }else{d_terms<-1}

    # Sort the interaction terms that reduce the 10-fold cross valdiation RMSE based on their respective 10-fold cross validation RMSE
    cv_data <- cv_data[order(cv_data$cv),]

    # Select the interaction that corresponds to the least 10-fold cross validation RMSE
    selected_interactions = paste0(selected_interactions, cv_data[1,1])
    cv_previous_model = cv_data[1,2]
    interactions_that_reduce_KfoldCV<-c()
    cv_degree <- c()


    # Print the progress after each model update, i.e., after an interaction term is selected
    print(paste0("Degree of interactions being considered:",d, ", 5-fold CV RMSE:", cv_previous_model))
  }
  if(d_terms==0){break}
}
```