Machine-Learning-Problem-Solving/Answers.Rmd at main · Luc-Wilson/Machine-Learning-Problem-Solving · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: 'GEOGM0053: Midterm'
author: 'LucWilson'
date: '2024-03-04'
output:
  html_document:
    code_folding: hide
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

<p class='text-right' style='background-color: #f5fbff'>*Code chunk loads required libraries:*</p>
```{r, warning = FALSE, message = FALSE}
# loading required packages
library(tidyverse)
library(gridExtra) # for plotting histograms in a grid
library(car) # for vif()
library(caret) # for model training
library(randomForest) # for random forest
library(rpart) # for decision tree
library(rpart.plot) # for plotting decision trees
library(kableExtra) # for editing kables
library(vip) # for kknn variable importance
```

<hr style='border: 1px solid #333;'>

# Regression: What are the most important factors governing the popularity of a dog breed?

<hr style='border: 1px solid #333;'>

### Q1.a

<p class='text-right' style='background-color: #f5fbff'>*Code chunk reads in data:*</p>
```{r, warning = FALSE, message = FALSE}
# reading in the data (using tidyverse read_csv over base read.csv)
dogs = read_csv('midterm-data.csv')
head(dogs)
```

<p class='text-right' style='background-color: #f5fbff'>*Code chunk makes train and test data sets:*</p>
```{r, message=FALSE}
# splitting the data into training and testing sets
set.seed(1)

train = dogs %>%
  sample_frac(0.6) # small data set so using 60-40 split so that test data isn't too small

test = anti_join(dogs, train)
```

<p class='text-right' style='background-color: #f5fbff'>*Code chunk makes histograms to examine feature normality:*</p>
```{r, fig.align='center'}
# visually examining the normality of variables with histograms (assumption of linear regression)

# getting names of numeric columns
numeric_cols = c(dogs %>%
  select_if(is.numeric) %>%
  colnames())

# creating tibble with selected columns
numeric_dogs = dogs[numeric_cols]

# creating an empty list to store plots for grid.arrange()
histograms = list()

# setting plot size
plot_size = 3

# looping through columns to plot histograms in a grid
for (col in numeric_cols) {
  p = ggplot(data = numeric_dogs, aes(x = .data[[col]])) +
    geom_histogram(fill = 'lightblue',
                   color = 'black', # make bins clear
                   bins = 5 # lots of features are categorical 1-5
                   ) +
    ggtitle(col) + # making the titles the column names
    theme_minimal() +
    theme(plot.margin = ggplot2::margin(5, 5, 5, 5, 'pt'), # spacing the plots
          plot.title = element_text(size = 9, # title sizes
                                    hjust = 0.5 # centering titles
                                    )
  ) +
    xlab(NULL) # hiding x axis labels (same as titles)
  # appending list with plots
  histograms[[length(histograms) + 1]] = p
}

# Arrange plots in a grid and print
grid.arrange(grobs = histograms, # using list created above
             ncol = 4, # so its 4x4
             width = plot_size * 3, # setting plotting proportions
             height = plot_size * 5
             )
```

'editors_choice' is especially non-normal (categorical 1 or 0) which violates the assumptions of linear regression so it will be omitted.

Others factors which are especially non-normal: 'affection' (could be log-transformed) and 'good_with_young_children':

- since these features are categorical within the range of 1-5 it is not appropriate to apply a transformation. They will be included in the initial model to evaluate their importance but their non-normality means they significantly violate the assumptions of linear regression.

<p class='text-right' style='background-color: #f5fbff'>*Code chunk constructs initial linear model:*</p>
```{r}
# making initial linear model
# using whole data set minus: 'editors_choice', 'coat_type', 'coat_length', 'breed_clean'
lm_1a1 = lm(popularity ~
              affection +
              good_with_young_children +
              good_with_other_dogs +
              shedding +
              grooming_needs +
              drool +
              openness_to_strangers +
              playfulness +
              protectiveness +
              adaptability +
              trainability_level +
              energy +
              barking +
              mental_stimulation_needs,
            data = dogs)
# printing model summary
summary(lm_1a1)
```

<hr style='border: 1px solid #333;'>

### 1.b.

Looking at the model (lm_1a1) summary:

- The only variables significant to 95%: 'grooming_needs', 'playfulness', 'protectiveness', 'barking'
- The variables significant to 90% (not 95): 'drool', 'openness_to_strangers'

Using the whole data set for identifying useful features:

<p class='text-right' style='background-color: #f5fbff'>*Code chunk makes next linear model:*</p>
```{r}
# further investigating linear model with only variables significant to +90%
lm_1a2 = lm(popularity ~
              grooming_needs +
              openness_to_strangers +
              playfulness +
              protectiveness +
              barking +
              drool,
   data = dogs)
# printing model summary
summary(lm_1a2)
# printing multicollinearity
vif(lm_1a2)
```

- 'openness_to_strangers' no longer significant so it is being removed.
- 'drool' is only significant to 90%, so it is being removed too.

 The next model will use the remaining features.

 Now there are fewer terms, looking to see if there are any useful interaction terms:

 <p class='text-right' style='background-color: #f5fbff'>*Code chunk makes next linear model with interaction terms:*</p>
```{r}
# examining interactions now
lm_1a3 = lm(popularity ~
              grooming_needs *
              playfulness *
              protectiveness *
              barking,
   data = dogs)
# printing model summary
summary(lm_1a3)
```

'grooming_needs:protectiveness:barking' and 'grooming_needs:barking' are significant to 95% so investigating this further.

<p class='text-right' style='background-color: #f5fbff'>*Code chunk makes next linear model, testing the significant interaction terms:*</p>
```{r, warning=FALSE, message=FALSE}
# creating new linear model with the additional interaction terms
lm_1a4 = lm(popularity ~
              grooming_needs +
              playfulness +
              protectiveness +
              barking +
              grooming_needs:protectiveness:barking +
              grooming_needs:barking,
   data = dogs)
# printing model summary
summary(lm_1a4)
# checking multicollinearity
vif(lm_1a4)
```

- Adjusted R squared has increased (from lm_1a2:0.1844 to lm_1a4:0.2425)
- Interaction terms are high significance.

As expected, collinearity has increased. Generally VIF>6 is considered unacceptable.

The only way to lower the VIF of the interaction terms is to have them without the original terms. Investigating to see how this model would perform:

<p class='text-right' style='background-color: #f5fbff'>*Code chunk makes next linear model:*</p>
```{r, message=FALSE}
# creating new linear model without base terms for interaction terms to test if this is better
lm_1a5 = lm(popularity ~
              playfulness +
              grooming_needs:barking +
              grooming_needs:protectiveness:barking,
   data = dogs)
# printing model summary
summary(lm_1a5)
# checking multicollinearity
vif(lm_1a5)
```

- The adjusted R squared drops significantly.
- To keep the multicollinearty (VIF scores) acceptable the model performs worse.
- Therefore removing the interaction terms.

<p class='text-right' style='background-color: #f5fbff'>*Code chunk makes final linear model:*</p>
```{r}
# regenerating lm_1a2 without 'drool' and 'openness_to_strangers'
lm_1a6 = lm(popularity ~
              grooming_needs +
              playfulness +
              protectiveness +
              barking,
   data = dogs)
# printing model summary
summary(lm_1a6)
# checking multicollinearity
vif(lm_1a6)
```

##### All features are now significant (95%), and non-collinear (VIF<1.1).
##### Therefore the most useful features to predict the popularity of a dog breed are:

- grooming_needs
- playfulness
- protectiveness
- barking

<hr style='border: 1px solid #333;'>

### 1.c.

<p class='text-right' style='background-color: #f5fbff'>*Code chunk fits linear model to training data:*</p>
```{r}
# fitting to the training data set
lm_1c = lm(popularity ~ grooming_needs +
              playfulness +
              protectiveness +
              barking,
   data = train)
# printing model summary
summary(lm_1c)
# checking multicollinearity
vif(lm_1c)
```

- significance scores have changed significantly: symptom of small, variable data set

<p class='text-right' style='background-color: #f5fbff'>*Code chunk gets out-of-sample predictions:*</p>
```{r}
# applying lm_1c to test data for cross validation:
test_predictions = predict(lm_1c, newdata = test)
```

<p class='text-right' style='background-color: #f5fbff'>*Code chunk makes function to calculate mean squared error (MSE):*</p>
```{r}
# defining MSE function
MSE = function(actual, predicted) {
  mse = mean((actual - predicted)^2)
  return(mse)
}
```

<p class='text-right' style='background-color: #f5fbff'>*Code chunk calculates, and prints, the out-of-sample MSE, R squared, and adjusted R squared for the linear model:*</p>
```{r}
# using MSE function to calculate MSE of lm_1c
mse_lm_1c = MSE(actual = test$popularity,
                 predicted = test_predictions)
cat('The out-of-sample mean squared error (MSE) is: ', mse_lm_1c, '\n')

# getting out-f-sample adjusted R squared for lm_1c
out_of_samp_Rsq = 1 - sum((test$popularity -
                                     test_predictions)^2) /
  sum((test$popularity -
         mean(test$popularity))^2)

# adjusting R squared for data size:
n = nrow(test) # number of rows
k = 4 # number of features

# getting adjusted R squared (more useful for comparing models with different number of features)
out_of_samp_adj_Rsq = 1 - ((1 - out_of_samp_Rsq) * (n - 1) / (n - k - 1))

# printing answers
# (question specifies R squared but I will use adjusted R squared for model comparisons so printing both)
cat('The out-of-sample R squared is:', out_of_samp_Rsq, 'and the adjusted out-of-sample R squared is: ',out_of_samp_adj_Rsq, '\n')
```

<hr style='border: 1px solid #333;'>

### 1.d.

<p class='text-right' style='background-color: #f5fbff'>*Code chunk adds 'is_terrier' to the data:*</p>
```{r}
# creating 'is_terrier' variable (one-hot encoded)
dogs$is_terrier = as.integer(grepl('Terrier', dogs$breed_clean))

# doing for test and train too
train$is_terrier = as.integer(grepl('Terrier', train$breed_clean))
test$is_terrier = as.integer(grepl('Terrier', test$breed_clean))

```

The initial check for bias is a visual examination of residuals: (I am looking to see if the residuals of 'is_terrier' points are different)

<p class='text-right' style='background-color: #f5fbff'>*Code chunk generates colour-coded plot of residuals, by 'is_terrier' feature:*</p>
```{r, fig.align='center'}
# getting residuals for lm_1c
resid_lm_1c = residuals(lm_1c)

# making data frame for plotting
resid_lm_1c_1d = data.frame(residuals = resid_lm_1c,
                            is_terrier = factor(train$is_terrier), # categorical
                            popularity = train$popularity)

# Creating a plot of residuals colour-coded by is_terrier
ggplot(resid_lm_1c_1d, aes(x = popularity, y = residuals, color = is_terrier)) +
  geom_point() +
  geom_hline(yintercept = 0, # adding line where residuals should centre around
             linetype = 'dashed',
             color = 'black') +
  labs(x = 'Popularity',
       y = 'Residuals',
       title = 'Plot of Residuals (colour-coded by is_terrier)') +
  theme_minimal() +
  scale_x_continuous(breaks = seq(min(as.numeric(resid_lm_1c_1d$popularity)),
                                max(as.numeric(resid_lm_1c_1d$popularity)),
                                length.out = 8),
                     labels = scales::number_format(accuracy = 0.01)) + # rounding x-axis labels to 2dp
                     scale_color_manual(values = c("orange", "skyblue"))  +
  theme(plot.title = element_text(hjust = 0.5))  # centering title

```

- There is obvious bias in the residuals (over predicting high popularities and under predicting low popularities)
- BUT this bias does not appear to be connected to 'is_terrier'.

<p class='text-right' style='background-color: #f5fbff'>*Code chunk regenerates the linear model with 'is_terrier' added:*</p>
```{r}
# no obvious bias with is_terrier in the model.
# adding is_terrier to the model to see any changes
lm_1d = lm(popularity ~ grooming_needs +
              playfulness +
              protectiveness +
              barking +
             is_terrier,
   data = train)
summary(lm_1d)
vif(lm_1d)
# r2 has dropped slightly
```

<p class='text-right' style='background-color: #f5fbff'>*Code chunk generates out-of-sample MSE and R squared of the new linear model:*</p>
```{r}
# checking MSE
test_predictions_1d = predict(lm_1d, newdata = test)

mse_lm_1d = MSE(actual = test$popularity,
                 predicted = test_predictions_1d)
cat('The out-of-sample mean squared error (MSE) is:', mse_lm_1d, '\n')

# getting out of sample R squared
out_of_samp_Rsq_1d = 1 - sum((test$popularity -
                                     test_predictions_1d)^2) /
  sum((test$popularity -
         mean(test$popularity))^2)

# adjusting R squared for data size:
n = nrow(test) # number of rows
k = 5 # number of features

out_of_samp_adj_Rsq_1d = 1 - ((1 - out_of_samp_Rsq_1d) * (n - 1) / (n - k - 1))

cat('The out-of-sample adjusted R squared is:', out_of_samp_adj_Rsq_1d, '\n')
```

<p class='text-right' style='background-color: #f5fbff'>*Code chunk generates a results table to compare model performances with and without 'is_terrier' feature:*</p>
```{r}
# making kable to compare with and without is_terrier models MSE and R squared

# making dataframe to store model results
regression_results = data.frame(
  Model = c('lm_1c', 'lm_1d'),
  OutofSample_MSE = c(mse_lm_1c, mse_lm_1d),
  OutofSample_Adj_R_squared = c(out_of_samp_adj_Rsq, out_of_samp_adj_Rsq_1d)
)
# rounding values to 4dp
regression_results$OutofSample_MSE = sprintf('%.4f', regression_results$OutofSample_MSE)
regression_results$OutofSample_Adj_R_squared = sprintf('%.4f', regression_results$OutofSample_Adj_R_squared)

# making kable
kable(regression_results,
      caption = NULL,
      col.names = c('Model', 'Out-of-Sample MSE', 'Out-of-Sample Adjusted R squared'),
      align = 'c',
      format = 'html',
      table.attr = 'style="width:60%;margin:auto;"'   # centering kable
)%>%
  kable_styling(bootstrap_options = c('bordered'))
```

There was no visual differences between the predictions of 'is_terrier' dogs and the other data points AND there was no significant change in model MSE. There was a slight change in Out-of-Sample Adjusted R squared however the values are close to 0 (<0.11) so the change is not regarded as significant.

##### The conclusion is therefore that there is no model bias against terriers.

- Conceptually, there could have been a difference if the popularity of terrier dogs is determined differently to the mean of how other breeds.
- For example: if terrier dogs have a reputation for being dangerous then the playfulness variable might act differently for them etc.


<hr style='border: 1px solid #333;'>

### 2.a.

<p class='text-right' style='background-color: #f5fbff'>*Code chunk subsets data to only the identified 'useful' features:*</p>
```{r}
# sub-setting specific columns
cols_2a = c('popularity',
            'grooming_needs',
              'playfulness',
              'protectiveness',
              'barking')
dogs_2a = dogs[, cols_2a]
```

<p class='text-right' style='background-color: #f5fbff'>*Code chunk regenerates test and train data with the subsetted data set:*</p>
```{r, message=FALSE}
# making test and training data of dogs_2a df
set.seed(1)

train_2a = dogs_2a %>%
  sample_frac(0.6)

test_2a = anti_join(dogs_2a, train_2a)
```

<p class='text-right' style='background-color: #f5fbff'>*Code chunk sets a control parameter for model cross validation:*</p>
```{r}
# setting control parameters for cross validation
ctrl = trainControl(method = 'cv', number = 5, verboseIter = FALSE)
```

<p class='text-right' style='background-color: #f5fbff'>*Code chunk fits random forest model:*</p>
```{r}
# fitting random forest (training data)
rf_2a = train(popularity ~ grooming_needs+
              playfulness+
              protectiveness+
              barking,
              data = train_2a,
              method = 'rf',
              trControl = ctrl,
              ntree = 1000)
rf_2a
```

<p class='text-right' style='background-color: #f5fbff'>*Code chunk generates out-of-sample MSE for random forest model:*</p>
```{r}
# getting predictions from rf_2a
predictions_2a = predict(rf_2a, newdata = test_2a)

# getting the MSE of rf_2a
rf_2a_MSE = MSE(actual = test_2a$popularity,
                 predicted = predictions_2a)

# getting R squared
rsquared_2a <- 1 - sum((test_2a$popularity - predictions_2a)^2) / sum((test_2a$popularity - mean(test_2a$popularity))^2)
cat('The out-of-sample R squared is:', rsquared_2a, '\n')

# adjusting R squared for data size:
n = nrow(test_2a) # number of rows
k = 4 # number of features

out_of_samp_adj_Rsq_2a = 1 - ((1 - rsquared_2a) * (n - 1) / (n - k - 1))
cat('The out-of-sample adjusted R squared is:', out_of_samp_adj_Rsq_2a, '\n')
```

<p class='text-right' style='background-color: #f5fbff'>*Code chunk appends result table to include the random forest model:*</p>
```{r}
# appending results data frame
regression_results = rbind(regression_results, c('rf_2a', rf_2a_MSE, out_of_samp_adj_Rsq_2a))

# resetting as numeric
regression_results$OutofSample_MSE = as.numeric(regression_results$OutofSample_MSE)
regression_results$OutofSample_Adj_R_squared = as.numeric(regression_results$OutofSample_Adj_R_squared)

# setting to 4dp
regression_results$OutofSample_MSE = sprintf('%.4f', regression_results$OutofSample_MSE)
regression_results$OutofSample_Adj_R_squared = sprintf('%.4f', regression_results$OutofSample_Adj_R_squared)

# making updated kable
kable(regression_results,
      caption = NULL,
      col.names = c('Model', 'Out-of-Sample MSE', 'Out-of-Sample Adjusted R squared'),
      align = 'c',
      format = 'html',
      table.attr = 'style="width:60%;margin:auto;"' ) %>% # centring kable
  kable_styling(bootstrap_options = c('bordered'))
```


<hr style='border: 1px solid #333;'>

### 2.b.

<p class='text-right' style='background-color: #f5fbff'>*Code chunk prints the importance of the variables in the random forest model:*</p>
```{r, message=FALSE}
# checking variable importance
var_importance = varImp(rf_2a)
# Print variable importance
print(var_importance)
```

<p class='text-right' style='background-color: #f5fbff'>*Code chunk prints the importance of the variables in the linear model:*</p>
```{r}
# printing for linear model too, for comparison
var_importance_lm_1c = varImp(lm_1c)
print(var_importance_lm_1c)
```

The variable importance is not the same between the linear model (lm_1c) and the random forest model (rf_2a). The 'best' predictor for the random forest model was 'barking' whereas the most 'best' predictor for the linear model was 'playfulness.'

The two models did agree that 'grooming_needs' was the 'worst' predictor of those used.

<hr style='border: 1px solid #333;'>

### 2.c.

<p class='text-right' style='background-color: #f5fbff'>*Code chunk regenerates training and test data with 'is_terrier' added:*</p>
```{r, message=FALSE}
# before recreating rf_2a with is_terrier to compare out of sample MSEs:

# first adding is_terrier to training data
cols_2c = c('popularity',
            'grooming_needs',
              'playfulness',
              'protectiveness',
              'barking',
            'is_terrier')
dogs_2c = dogs[, cols_2c]

# using the same seed
set.seed(1)

train_2c = dogs_2c %>%
  sample_frac(0.6) # small data set so using 60-40 split so that test data isn't too small

test_2c = anti_join(dogs_2c, train_2c)
```

<p class='text-right' style='background-color: #f5fbff'>*Code chunk regenerates random forest model with additional 'is_terrier' term:*</p>
```{r}
# recreating rf_2a with is_terrier to compare out of sample MSEs
rf_2c = train(popularity ~ grooming_needs+
              playfulness+
              protectiveness+
              barking +
              is_terrier,
              data = train_2c,
              method = 'rf',
              trControl = ctrl,
              ntree = 1000)
print(rf_2c)
```

<p class='text-right' style='background-color: #f5fbff'>*Code chunk generates out-of-sample MSE and R squared for new random forest model:*</p>
```{r}
# getting out of sample MSE for rf_2a
predictions_2c = predict(rf_2c, newdata = test_2c)

mse_rf_2c = MSE(actual = test_2c$popularity,
                 predicted = predictions_2c)
cat('The out-of-sample mean squared error (MSE) is:', mse_rf_2c, '\n')

# getting R squared
rsquared_2c <- 1 - sum((test_2c$popularity - predictions_2c)^2) / sum((test_2c$popularity - mean(test_2c$popularity))^2)
cat('The out-of-sample R squared is:', rsquared_2c, '\n')

# adjusting R squared for data size:
n = nrow(test_2c) # number of rows
k = 5 # number of features

out_of_samp_adj_Rsq_2c = 1 - ((1 - rsquared_2c) * (n - 1) / (n - k - 1))
cat('The out-of-sample adjusted R squared is:', out_of_samp_adj_Rsq_2c, '\n')
```

<p class='text-right' style='background-color: #f5fbff'>*Code chunk appends results table to include the new random forest model:*</p>
```{r, warning=FALSE}
# appending results data frame
regression_results = rbind(regression_results, c('rf_2c', mse_rf_2c, out_of_samp_adj_Rsq_2c))

# resetting as numeric
regression_results$OutofSample_MSE = as.numeric(regression_results$OutofSample_MSE)
regression_results$OutofSample_Adj_R_squared = as.numeric(regression_results$OutofSample_Adj_R_squared)

# Format the numeric columns
regression_results$OutofSample_MSE = sprintf('%.4f', regression_results$OutofSample_MSE)
regression_results$OutofSample_Adj_R_squared = sprintf('%.4f', regression_results$OutofSample_Adj_R_squared)

# making updated kable
kable(regression_results,
      caption = NULL,
      col.names = c('Model', 'Out-of-Sample MSE', 'Out-of-Sample Adjusted R squared'),
      align = 'c',
      format = 'html',
      table.attr = 'style="width:60%;margin:auto;"'   # centring kable
)%>%
  kable_styling(bootstrap_options = c('bordered'))
```


The changes in MSE between rf_2a and rf_2c are insignificant. The addition of 'is_terrier' did not improve the model performance.

There is no indication that the random forest model was not biased against terriers. To further validate this I am bootstrapping the models to compare performance distributions to determine if one model tend to perform better then the other or not.

<p class='text-right' style='background-color: #f5fbff'>*Code chunk regenerates random forest models using bootstrapping (instead of cross validation):*</p>
```{r}
# using bootstrapping to find probability that the change in model performance was random

# setting bootstrapping control
ctrl_boot = trainControl(method = 'boot', number = 500)

rf_2a_boot = train(popularity ~
                     grooming_needs+
              playfulness+
              protectiveness+
              barking,
              data = train_2a,
              method = 'rf',
              trControl = ctrl_boot,
              ntree = 1000)

rf_2c_boot = train(popularity ~
                     grooming_needs+
              playfulness+
              protectiveness+
              barking +
              is_terrier,
              data = train_2c,
              method = 'rf',
              trControl = ctrl_boot,
              ntree = 1000)
```

<p class='text-right' style='background-color: #f5fbff'>*Code chunk stores the resampled MSEs for both models:*</p>
```{r}
# getting MSEs of models
boot_MSE_rf_2a = (rf_2a_boot$resample$RMSE)^2
boot_MSE_rf_2c = (rf_2c_boot$resample$RMSE)^2

# making data frame of accuracies from both models
boot_accuracies_df = data.frame(Model = rep(c('rf_2a', 'rf_2c'), each = length(boot_MSE_rf_2a)),
                             MSE = c(boot_MSE_rf_2a, boot_MSE_rf_2c))

```

<p class='text-right' style='background-color: #f5fbff'>*Code chunk creates histogram to compare the distribution of the resampled MSEs:*</p>
```{r, fig.align='center'}
# plotting both histograms together
ggplot(boot_accuracies_df, aes(x = MSE, fill = Model)) +
  geom_histogram(binwidth = 0.005,
                 position = 'identity',
                 alpha = 0.6,
                 color = 'black') +
  labs(title = 'Comparision of Model MSEs',
       x = 'MSE',
       y = 'Frequency') +
  scale_fill_manual(values = c('rf_2a' = 'orange',
                               'rf_2c' = 'skyblue')) +  # Adjust colors as needed
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5))
```

The figure shows no difference in MSEs between the random forest model trained with 'is_terrier' (rf_2c) and the model trained without (rf_2a). So I am confident the change in model performance between rf_2a (without 'is_terrier') and rf_2c (with 'is_terrier') occurred by random chance.

##### Therefore the original random forest model was not biased against terriers.

<hr style='border: 1px solid #333;'>

### 3.a.

Conceptually, a K-Nearest Neighbours (KNN) model might be inappropriate for the research question:

- 'What are the most important factors governing the popularity of a dog breed?'

because we are interested in learning about the features. KNN models only use a simplistic 'distance' measure to make predictions. This is significantly less interpretable than a random forest would be where variables can be ranked by importance or even a decision tree where variable importance can be deciphered from the depth of the tree.

Furthermore, the 'distances' in KNN here would be calculated on variables which are categorical of 1,2,3,4, or 5. Practically, this means the 'distances' calculated by the KNN model would frequently be identical (for a large data set) and have steps between them.

Additionally, KNN models can perform less well with high dimensional data, in this instance there were 13 numeric features initially (12 without editors choice) which would cause a KNN model to perform worse however there were only 4 features selected in 1b which would mitigate this concern.

<hr style='border: 1px solid #333;'>

### 3.b.

<p class='text-right' style='background-color: #f5fbff'>*Code chunk constructs a KNN model (K=10):*</p>
```{r}
# constructing KNN model (k=10) on training data
knn_3b = train(popularity ~ grooming_needs+
              playfulness+
              protectiveness+
              barking,
              data = train,
              method = 'knn',
              trControl = ctrl,
              preProcess = c('center', # centering (subtracting the mean)
                             'scale'), # scaling (dividing every point by the standard deviation)
              tuneGrid = expand.grid(k = 10)) # setting k = 10
```

<p class='text-right' style='background-color: #f5fbff'>*Code chunk generates out-of-sample MSE for KNN model:*</p>
```{r}
# getting out of sample predictions
predictions_3b = predict(knn_3b, newdata = test)

# getting out of sample MSE
mse_knn_3b = MSE(actual = test$popularity,
                 predicted = predictions_3b)

# getting R squared
rsquared_3b <- 1 - sum((test$popularity - predictions_3b)^2) / sum((test$popularity - mean(test$popularity))^2)
cat('The out-of-sample R squared is:', rsquared_3b, '\n')

# adjusting R squared for data size:
n = nrow(test) # number of rows
k = 4 # number of features

out_of_samp_adj_Rsq_3b = 1 - ((1 - rsquared_3b) * (n - 1) / (n - k - 1))

```

<p class='text-right' style='background-color: #f5fbff'>*Code chunk appends results table with KNN model out-of-sample MSE:*</p>
```{r}
# appending results data frame
regression_results = rbind(regression_results, c('knn_3b', mse_knn_3b, out_of_samp_adj_Rsq_3b))

# resetting as numeric
regression_results$OutofSample_MSE = as.numeric(regression_results$OutofSample_MSE)
regression_results$OutofSample_Adj_R_squared = as.numeric(regression_results$OutofSample_Adj_R_squared)

# Format the numeric columns
regression_results$OutofSample_MSE = sprintf('%.4f', regression_results$OutofSample_MSE)
regression_results$OutofSample_Adj_R_squared = sprintf('%.4f', regression_results$OutofSample_Adj_R_squared)

# making updated kable
kable(regression_results,
      caption = NULL,
      col.names = c('Model', 'Out-of-Sample MSE', 'Out-of-Sample Adjusted R squared'),
      align = 'c',
      format = 'html',
      table.attr = 'style="width:60%;margin:auto;"'   # centring kable
)%>%
  kable_styling(bootstrap_options = c('bordered'))
```

<hr style='border: 1px solid #333;'>

### 3.c.

<p class='text-right' style='background-color: #f5fbff'>*Code chunk tests different k values to find improved range:*</p>
```{r}
# visually examining MSEs with elbow curve to find better k value

# setting k values for testing
k_values = c(2, 3, 4, 5, 7, 8, 9, 10, 15, 20, 25, 30, 50)

# making dataframe to store results
knn_3c_results = data.frame(k = numeric(length(k_values)), MSE = numeric(length(k_values)))

# creating loop to store MSE values for each k
for (i in seq_along(k_values)) {
  k = k_values[i]

  # Train the KNN model
  model_3c = train(popularity ~ grooming_needs +
                      playfulness +
                      protectiveness +
                      barking,
                 data = train,
                 method = 'knn',
                 trControl = ctrl,
                 preProcess = c('center', 'scale'),
                 tuneGrid = expand.grid(k = k))

  # Make predictions on the test set
  predictions_3c = predict(model_3c, newdata = test)

  # Calculate Mean Squared Error (MSE)
  mse = MSE(actual = test$popularity,
                 predicted = predictions_3c)

  # Store results in the data frame
  knn_3c_results[i, ] = c(k, mse)
}

```

<p class='text-right' style='background-color: #f5fbff'>*Code chunk plots 'elbow-curve' for KNN model:*</p>
```{r, fig.align='center'}
# plotting elbow curve of MSE vs K
ggplot(knn_3c_results, aes(x = k, y = MSE)) +
  geom_point() +  # Add points for the scatter plot
  geom_smooth(method = 'loess', se = FALSE, span = 0.8) +
  theme_minimal() +
  scale_y_reverse() # reversing y scale for aesthetics
```
visually k = 20 appears the best, so i will use a range around this.

<p class='text-right' style='background-color: #f5fbff'>*Code chunk finds improved KNN model paramters by testing different weight functions and distance values within improved K value range:*</p>
```{r}
# Training KNN model on different weight functons around optimal k value
# using whole data because of caret's 5 fold cross validation
weights_model_3c = train(
    popularity ~ grooming_needs +
      playfulness +
      protectiveness +
      barking,
    data = dogs, # using whole data set because caret cross validates
    method = 'kknn',
    trControl = ctrl,
    preProcess = c('center', 'scale'),
    tuneGrid = expand.grid(kmax = 15:25, # testing around k = 20
                           distance = 15:30,
                        kernel = c('gaussian',  # different weighting types in kknn
                               'triangular',
                               'rectangular',
                               'epanechnikov',
                               'cos',
                               'optimal'))
  )

# optimal:
# Fitting kmax = 20, distance = 17, kernel = gaussian on full training set
```

Optimal: Fitting kmax = 20, distance = 17, kernel = gaussian on full training set

<p class='text-right' style='background-color: #f5fbff'>*Code chunk generates KNN model using improved K, distance, and weight function, and cross validation control:*</p>
```{r}
# constructing final KNN model using optimised parameters
final_knn_3c = train(
    popularity ~ grooming_needs +
      playfulness +
      protectiveness +
      barking,
    data = dogs,
    method = 'kknn',
    trControl = ctrl,
    preProcess = c('center', 'scale'),
    tuneGrid = expand.grid(kmax = 20, # testing around k = 20
                           distance = 17,
                        kernel = 'gaussian')
  )

# printing MSE and R squared
cat('final_knn_3c MSE:', final_knn_3c$results$RMSE^2, '\n')
cat('final_knn_3c R squared:', final_knn_3c$results$Rsquared, '\n')

# adjusting R squared for data size:
n = nrow(dogs) # number of rows
k = 4 # number of features

adj_Rsq_final_knn_3c = 1 - ((1 - final_knn_3c$results$Rsquared) * (n - 1) / (n - k - 1))

cat('final_knn_3c adjusted R squared:', adj_Rsq_final_knn_3c, '\n')
```

<p class='text-right' style='background-color: #f5fbff'>*Code chunk appends results table with improved KNN model out-of-sample MSE:*</p>
```{r, warning=FALSE}
# appending results data frame
regression_results = rbind(regression_results, c('final_knn_3c', final_knn_3c$results$RMSE^2, adj_Rsq_final_knn_3c))

# resetting as numeric
regression_results$OutofSample_MSE = as.numeric(regression_results$OutofSample_MSE)
regression_results$OutofSample_Adj_R_squared = as.numeric(regression_results$OutofSample_Adj_R_squared)

# Format the numeric columns
regression_results$OutofSample_MSE = sprintf('%.4f', regression_results$OutofSample_MSE)
regression_results$OutofSample_Adj_R_squared = sprintf('%.4f', regression_results$OutofSample_Adj_R_squared)

# making updated kable
kable(regression_results,
      caption = NULL,
      col.names = c('Model', 'Out-of-Sample MSE', 'Out-of-Sample Adjusted R squared'),
      align = 'c',
      format = 'html',
      table.attr = 'style="width:60%;margin:auto;"'   # centring kable
)%>%
  kable_styling(bootstrap_options = c('bordered'))
```

<hr style='border: 1px solid #333;'>

### 4.

final_knn_3c performed the best. It has the highest adjusted R squared and the lowest MSE. So I am looking to see its variable importances:

```{r}
# retrieving the variable importances from  final_knn_3c with permutations

fitted_final_knn_3c <- final_knn_3c$finalModel

# making prediction function for pred_wrapper argument
predict_function <- function(model, newdata) {
  predict(model, newdata = newdata)
}

# Compute variable importances using the vip package
importance <- vip::vi(fitted_final_knn_3c,
                      method = "permute",
                      train = dogs,
                      target = dogs$popularity,
                      metric = "rmse", # using rmse for metric
                      smaller_is_better = TRUE, # lower rmse is better
                      pred_wrapper = predict_function)

# Print the variable importances
print(importance)
```

'playfulness' is the most important feature here. With 'grooming_needs' second, 'barking' third, and 'protectiveness' fourth.

Given the model scores, no I do not believe the model is good at predicting the popularity of dogs. The adjusted R squareds for all the models were low (<0.2), this was the just best option of those models.

I think this is due to the variable nature of the data, the non continuous nature of the data, and the small size of the data.

<hr style='border: 1px solid #333;'>

# Classification: 'How can you tell if a dog is an 'Editor's choice?'

<hr style='border: 1px solid #333;'>

### 1.a.

<p class='text-right' style='background-color: #f5fbff'>*Code chunk fits decision tree model:*</p>
```{r, warning=FALSE}
# making decision tree with minimum leaf size of 10

# making new ctrl object with arguments for decision trees
ctrl_1a = trainControl(method = 'cv',
                     number = 5,
                     classProbs = TRUE, # to calculate class probabilities
                     summaryFunction = twoClassSummary,
                     # should weight true positive and negatives equally
                     verboseIter = FALSE,
                     savePredictions = 'final' # storing performance for each fold for examining
                     )

# setting editors_choice as a factor for classification
dogs$factor_editors_choice = as.factor(dogs$editors_choice)
dogs$factor_editors_choice = factor(make.names(as.character(dogs$factor_editors_choice)))

# using the whole data because of cross validation in control method
tree_1a_c = caret::train(
  x = dogs[, c('grooming_needs',
               'playfulness',
               'protectiveness',
               'barking',
               'popularity')],
  y = dogs$factor_editors_choice,
  method = 'rpart',
  trControl = ctrl_1a,
  tuneLength = 10, # sets how many different values for the main parameter
  control = rpart::rpart.control(minbucket = 10) # minimum leaf size
  )
# rpart uses ROC instead of accuracy to optimise

# setting no other constraints (eg. tuneGrid, cp range etc) to let caret train find optimal
tree_1a_c

tree_1a_c$final
```

<hr style='border: 1px solid #333;'>

### 1.b.

<p class='text-right' style='background-color: #f5fbff'>*Code chunk prints decision tree:*</p>
```{r, fig.align='center'}
rpart.plot(tree_1a_c$finalModel)
```

In plain language, here is the decision tree route to the terminal node which predicts a dog is an editor's choice:

The data is first split on the condition of whether popularity score less than 0.87. If the popularity score is greater than, or equal to, 0.87, then the data is split again on the condition of 'is the grooming score greater than, or equal to, 3'. If this condition is not met, then the data is split again on the condition 'is the popularity score greater than or equal to 0.92'. If this condition is not met, then the dogs are predicted to be editor's choice with a 73% accuracy.