Projects/faces.Rmd at master · nguyenmmegan/Projects · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
---
title: "faces"
author: "Megan Nguyen"
date: "8/30/2018"
output: html_document
---

```{r}
library(tidyverse)
library(tree)
library(randomForest)
library(gbm)
library(ROCR)
library(e1071)
library(imager)
```

Fundamentals of the bootstrap

In the first part of this problem we will explore the fact that approximately 1/3 of the observations in a bootstrap
sample are out-of-bag.

a) Given a sample of size n, what is the probability that any observation j is not in in a bootstrap sample? Express
your answer as a function of n.
$$(1 - \frac{1}{n})^n$$

b) Compute the above probability for n = 1000
$$(1 - \frac{1}{1000})^1000$$
```{r}
#Probability for n=1000
(1-(1/1000))^1000
```

c) We verify that your calculation is reasonable by resampling the numbers 1 to 1000 with replace and printing the
number of missing observations.

```{r}
#Probabilities for n = 1:1000
prob <- c()
for(i in 1:1000){
  p = (1-(1/i))^i
  prob <- c(prob,p)
}
length(prob)
unique <- unique(prob)
length(unique)
unique
```
Each value is unique.  Almost all of the values are around 0.367 which is equal to the same probability calculated for n = 1000.

Here we’ll use the bootstrap to compute uncertainty about a parameter of interest.

d) By November 18, 2017, an NBA basketball player, Robert Covington, had made 50 out of 101 three point shot
attempts. The regular season began on October 17 and ended on April 11. At that time, his three point field
goal percentage, 0.495, was the best in the league and would have ranked in the 10 ten all time for single season
three point shooting if he were to keep it up. Use bootstrap resampling on a sequence of 50 1’s (makes) and 51
0’s (misses). For each bootstrap sample compute and save the sample mean (e.g. bootstrap FG% for the player).
Use 1000 bootstrap samples to plot a histogram of those values. Compute the 95% bootstrap confidence interval
for Robert Covington’s “true” FG% using the quantile function in R. Print the endpoints of this interval.
Why would you expect that his end-of-season field goal percentage was lower than his percentage on 11/18?
Your answer should reference a well-known statistical phenomenon.

```{r}
#Sequence
fg <- c(rep.int(1, 50), rep.int(0, 51))
sum(fg == 1)/length(fg)
#Mean
x <- c()
for(i in c(1:1000))
{ boot.sample <- sample(fg, size=1000, replace=TRUE)
 x <- c(x,mean(boot.sample))
}
#Histogram
hist(x, breaks = 30, col = 'seagreen1', main = "Histogram of Bootstrap Field Goal %", xlab = "Field Goal %")
#Plot endpoints of the 95% CI in the histogram.
abline(v=quantile(x, c(0.025,0.975))[1], col = 'red', lwd = 2.5)
abline(v = quantile(x, c(0.025,0.975))[2], col = 'red', lwd = 2.5)
abline(v = median(x), col = 'blue', lwd = 2.5)
quantile(x, c(0.025,0.975))
```
We would expect the average to go down due to the law of large numbers.

Eigenfaces

We will use PCA to explore variation in images of faces. Load the data saved in faces_array.RData. This will load a 100 x 100 x 1000 array of data. An array is a generalization of a matrix to
more than 2 dimensions. In this example, the first two dimensions index the pixels in a 100 x 100 black and white
image of a face. The last dimension is the index for one of 1000 face images. The faces used in this example are from
1000 images scraped from the internet. See https://cyberextruder.com/face-matching-data-set-download/ for more
info.
```{r}
load("/Users/megannguyen/Downloads/faces_array (1).RData")
```
Although it is natural to think about a stack of 1000 matrices representing each of the face images, to run PCA we
need to input a single matrix. To do this, we’ll convert each 100 x 100 matrix to a single vector of length 100*100 =
10000. When you call as.numeric on a matrix, it stacks each of the columns in the matrix into one large vector.
Thus, we can think of our data as 1000 observations of a 10000 variables (one variable per pixel). Run the following
code to get a matrix of face observations.
```{r}
face_mat <- sapply(1:1000, function(i) as.numeric(faces_array[, , i])) %>% t
```
When we want to visualization an image, we need to take the 10000 dimensional vector and reconstruct it as a matrix.
The code plot_face takes a single 10000 dimensional vector (e.g. a column of face_mat), converts it back to a
matrix, and plots the resulting image. You can test this functionality by printing a random face from the dataset:
plot_face(face_mat[sample(1000, 1), ]).
```{r}
plot_face <- function(image_vector) {
plot(as.cimg(t(matrix(image_vector, ncol=100))), axes=FALSE, asp=1)
}
```
a) Find the “average” face in this dataset by averaging all of the columns in face_mat. Plot the average face by
calling plot_face on the average.
```{r}
#Average eigenface
avg <- colMeans(face_mat)
plot_face(avg)
```
b) Run PCA on face_mat setting center=TRUE and scale=FALSE. In class we mentioned that in general it is best
if scale=TRUE because it puts all variables on the same scale and we don’t have to worry about the units of the
variables (remember, the scale of the variables affects our results). In general, this is good practice, especially
when the predictor variables are of mixed types. Here, each variable represents a single pixel intensity (in black
& white) and so all variables already have the same units and same scale (minimum of 0 and maximum of 255).
In this case, setting scale=FALSE actually seems to give slightly better results. Plot the PVE and cumulative
PVE from the PCA. How many PCs do you need to explain at least 50% of the total variation in the face
images?
```{r}
#PCA
pr <- prcomp(face_mat, center = TRUE, scale = FALSE)
pr
```

```{r}
#PCA Variance
pr$sdev
pr.var <- pr$sdev ^2
pr.var
```

```{r}
#PVE
pve <- pr.var/sum(pr.var)
pve
#Cumulative PVE
cum_pve <- cumsum(pve)
cum_pve
```

```{r}
#PVE Plot
plot(pve, xlab = "Principal Component",
     ylab = "Proportion of Variance Explained", ylim = c(0,1), type = "b")
```

```{r}
#Cumulative PVE Plot
plot(cum_pve, xlab = "Principal Component",
     ylab = "Cumulative Proportion of Variance Explained", ylim = c(0,1), type = "b")
abline(a = 0.5, b = 0)

```
You need atleast 4 PC's to explain atleast 50% of the variation

c) Plot the first 16 principle component directions as faces using the plot_face function (these are the columns
of the rotation matrix). Early researchers termed these “eigenfaces” since they are eigenvectors of the
matrix of faces. The code below will adjust the margins of you plot and specifies a layout for the 16 images.
par(mfrow=c(4,4)) specifies a grid of 4 x 4 images. Each time you call plot_face it will plot the next face in
one of the new grid cells. All you need to do is call plot_face 16 times (please use a for loop). Note that these
images describe “directions” of maximum variability in the face images. You should interpret light and dark
regions in the eigenfaces as regions of high contrast, e.g. your interpretation should not change if you inverted
black and white in the images.
```{r}
#Plot eigenfaces
par(mar=c(1,1,1,1))
par(mfrow=c(4,4))
for(i in 1:16){
  plot_face(face_mat[i, ])
}

```

d) In this part, we will examine faces that have the highest and lowest values for specific PCs. Plot the faces
with the 5 largest values on PC1 and the 5 smallest values for PC1. Based on the example faces, and the first
eigenface from the previous part and the 10 example images, what aspect of variability in the face images is
captured by the first component.
```{r}
#Find matrix indices with smallest and largest PC1 values
pc1 <- pr$x[ ,1]
small1 <- sort(pc1)[1:5]
large1 <- sort(pc1, decreasing = TRUE)[1:5]
small1_ind <- which(pc1 %in% small1)
large1_ind <- which(pc1 %in% large1)

#Plot eigenfaces
par(mfrow=c(2,5))
for(i in c(small1_ind,large1_ind)){
  plot_face(face_mat[i, ])
}


```
PC1 captures the variability of high contrast of the background.

e) Repeat part d) but now display example faces with the largest and smallest values on principal component 5.
Again, discuss what aspect of variability in the face images is best captured by this principal component. Based
on your results, which principal component, (1 or 5) would be more useful as a feature in a face recognition
model (e.g. a model which predicts the identity of the individual in an image)
```{r}
#Find matrix indices with smallest and largest PC5 values
pc5 <- pr$x[ ,5]
small5 <- sort(pc5)[1:5]
large5 <- sort(pc5, decreasing = TRUE)[1:5]
small5_ind <- which(pc5 %in% small5)
large5_ind <- which(pc5 %in% large5)

#Plot faces
par(mfrow=c(2,5))
for(i in c(small5_ind,large5_ind)){
  plot_face(face_mat[i, ])
}

```
PC5 captures the variability of contrast of hair and the border.

3. Predicting insurance policy purchases

This question involves the use of the “Caravan” data set, which contains 5822 real customer records. Each record
consists of 86 variables, containing sociodemographic data (variables 1-43) and product ownership (variables 44-86),
grouped by zip code. In this problem we will focus on predicted the variable “Purchase” which indicates whether
the customer purchased a caravan insurance policy. For more information see http://www.liacs.nl/~putten/library/
cc2000/data.html.

a) When you load the “ISLR” library, the variable Caravan is automatically loaded into your environment. Split
Carvan into a training set consisting of the first 1000 observations and a test set consisting of the remaining
observations.
```{r}
#Training and testing data sets
library(ISLR)
cara_train <- Caravan[c(1:1000), ]
cara_test <- Caravan[-c(1:1000), ]
```

b) Fit a boosting model to the training set with Purchase as the response and the other variables as predictors.
Use the gbm to fit a 1,000 tree boosted model and set the shrinkage value of 0.01. Which predictors appear to
be the most important (Hint: use the summary function)?
```{r}
#Fit boosting model
cara_gbm <- gbm(ifelse(Purchase=="Yes",1,0) ~ ., data = cara_train, n.trees = 1000, shrinkage = 0.01)
summary(cara_gbm)
```
PPERSAUT, MKOOPKLA, MOPLHOOG, PBRAND, and MBERMIDD have the greatest influences.

c) Now fit a random forest model to the same training set from the previous problem. Set importance=TRUE but
use the default parameter values for all other inputs to the randomForest function. Print the random forest
object returned by the random forest function. What is the out-of-bag estimate of error? How many variables
were subsampled at each split in the trees? How many trees were used to fit the data? Look at the variable
importance. Is the order of important variables similar for both boosting and random forest models?
```{r}
#Fit random forest model
cara_bag <- randomForest(Purchase ~ ., data = cara_train, importance = TRUE)
cara_bag
imp <- head(sort(cara_bag$importance[,2], decreasing = TRUE))
imp
```
The out of bag estimate of error is 6.1%.  9 variables were subsamples at each split. 500 trees were used to fit the data.  The order of the important variables are not the same.  MOPLHOOG and PPERSAUT are both in the top five importance variables, while all other variables differ for the boosting and random forest models.

d) Use both models to predict the response on the test data. Predict that a person will make a purchase if the
estimated probability of purchase is greater than 20 %. Print the confusion matrix for both the boosting and
random forest models. In the random forest model, what fraction of the people predicted to make a purchase
do in fact make one?
```{r}
#Prediction for boosting model
boost_pred <- predict(cara_gbm, newdata = cara_test, n.trees = 1000, type = "response")
boost_pred

#0.2 threshold
boost_pred_test = cara_test %>%
  mutate(pred_purchase = as.factor(ifelse(boost_pred > 0.2, "Yes", "No")))
boost_pred_test$pred_purchase

#Boosting confusion matrix
true_purchase <- cara_test$Purchase
boost_pred_purchase <- boost_pred_test$pred_purchase
error_boost <- table(boost_pred_purchase, true_purchase)
error_boost
```
```{r}
#Prediction for random forest model
bag_pred <- predict(cara_bag, newdata = cara_test, n.trees = 1000, type = "response")
bag_pred

#Random forest confusion matrix
true_purchase <- cara_test$Purchase
error_bag <- table(bag_pred, true_purchase)
error_bag
```
In the random forest model, 7/32 people predicted to make a purchase in fact made a purchase.

4. An SVMs prediction of drug use

In this problem we return to an analysis of the drug use dataset. Load the drug use data using read_csv:
```{r}
drug_use <- read_csv('/Users/megannguyen/Downloads/drug (1).csv',
col_names = c('ID','Age','Gender','Education','Country','Ethnicity',
'Nscore','Escore','Oscore','Ascore','Cscore','Impulsive',
'SS','Alcohol','Amphet','Amyl','Benzos','Caff','Cannabis',
'Choc','Coke','Crack','Ecstasy','Heroin','Ketamine','Legalh','LSD',
'Meth', 'Mushrooms', 'Nicotine', 'Semer','VSA'))
```

a) Split the data into training and test data. Use a random sample of 1500 observations for the training data and
the rest as test data. Use a support vector machine to predict recent_cannabis_use using only the subset of
predictors between Age and SS variables as on the midterm. Use a “radial” kernel and a cost of 1. Generate
and print the confusion matrix of the predictions against the test data.
```{r}
#Create new factor recent_cannabis_use
drug_use <- drug_use %>%
  mutate(recent_cannabis_use = factor(ifelse(Cannabis > "CL2", "Yes", "No"), levels = c("No", "Yes")))

#Training and test data sets
train <- sample(1:nrow(drug_use), 1500)
drug_train <- drug_use[train, ]
drug_test <- drug_use[-train, ]

true_drug <- drug_test$recent_cannabis_use

#Subset
drug_train_subset <- drug_train %>% select(Age:SS, recent_cannabis_use)

#Support vector machine
drug_svm <- svm(recent_cannabis_use ~ ., data = drug_train_subset, kernel = "radial", cost = 1, scale = TRUE)
drug_svm_pred <- predict(drug_svm, newdata = drug_test)

#Confusion matrix
error_drug <- table(drug_svm_pred, true_drug)
error_drug
```

b) Use the tune function to perform cross validation over the set of cost parameters: cost=c(0.001, 0.01, 0.1,
1,10,100). What is the optimal cost and corresponding cross validated training error for this model? Print the
confusion matrix for the best model. The best model can be found in the best.model variable returned by
tune.
```{r}
#Tune function for CV
tune_drug <- tune(svm, recent_cannabis_use ~ ., data = drug_train_subset, kernel = "radial", ranges = list(cost=c(0.001, 0.01, 0.1, 1, 10, 100)))
summary(tune_drug)
```
The optimal cost is 0.1, and its corresponding cross validation training error is 0.1780000.
```{r}
#Subset
drug_test_subset <- drug_test %>% select(Age:SS, recent_cannabis_use)

#Best model
best_model <- tune_drug$best.model

#Confusion matrix
table(true = drug_test_subset$recent_cannabis_use, pred = predict(best_model, newdata = drug_test_subset))
```

5. Logistic regression with polynomial features
a) In class, we have used polynomial linear regression several times as an example for model complexity and the
bias variance tradeoff. We can also introduce polynomial logistic regression models to derive more sophisticated
classification functions by introducing additional features. Use read_csv to load nonlinear.csv and plot the
data. Plot each point colored according to its class, Y.
```{r}
nl <- read_csv("/Users/megannguyen/Downloads/nonlinear (1).csv")

#Rainbow colors
rainbow_colors <- rainbow(2)
plot_colors <- ifelse(nl$Y == 0, rainbow_colors[1], rainbow_colors[2])

#Plot color according to class Y
nl_new <- nl[-4]
par(mfrow = c(1,2))
plot(Z~X1, data = nl_new, col = plot_colors)
plot(Z~X2, data = nl_new, col = plot_colors)
#Red is 0, blue is 1
```

b) Fit a logistic regression model of Y on X1 and X2. Visualizing the decision boundary. The decision boundary
can be visualized by making predictions of class labels over finely sampled grid points that cover your region
(sample space) of interest. The following code will create grid points over the sample space as below:
For each point in gr, predict a class label using the logistic regression model. You should classify based on the
probability being greater or less than 1/2. Plot predictions at each point on the grid, again colored by class label.
```{r}
# grid of points over sample space
gr <- expand.grid(X1=seq(-5, 5, by=0.1), # sample points in X1
X2=seq(-5, 5, by=0.1)) # sample points in X2
```

```{r}
#Fit logistic regression model
nl_glm <- glm(Y ~ X1 + X2, data = nl, family = binomial)
summary(nl_glm)

#Predictions
nl_pred <- predict(nl_glm, newdata = gr, type = "response")
nl_pred <- data.frame(nl_pred)

#0.5 threshold
nl_pred_class <- nl_pred %>%
  mutate(pred_y = as.factor(ifelse(nl_pred <= 0.5, "False", "True")))

#Plot predictions
colors <- ifelse(nl_pred_class$pred_y == "False", rainbow_colors[1], rainbow_colors[2])
plot(nl_pred_class$nl_pred, col = colors) #Decision boundary
```

c) Fit a model involving 2nd degree polynomial of X1 and X2 with interaction terms. You should use the poly()
function. Inspect result of the fit using summary(). Plot the resulting decision boundary. Explain the reason
for any strange behvaior.
```{r}
#Fit 2nd degree polynomial model with interaction terms
poly_model2 <- glm(Y ~ poly(X1,degree = 2) + poly(X2, degree = 2) + X1*X2, data = nl, family = binomial)
summary(poly_model2)
pred <- data.frame(predict(poly_model2, gr, type = "response"))
pred <- mutate(pred, Y = as.numeric(pred >= 0.5))

Y <- pred$Y

#Plot
library(ggplot2)
ggplot(gr, aes(x = X1, y = X2))+
geom_point(aes(col = Y == 1))
```
The variables that are class 0, or false, have values of around X2 = 0 and values of X1 roughly between 0 and 2.5.


d) Using the same procedure, fit a logistic regression model with 5-th degree polynomials without any interaction
terms. Inspect result of the fit using summary(). Plot the resulting decision boundary and discuss the result.
```{r}
#Fit 5th degree polynomial model without interaction terms
poly_model5 <- glm(Y ~poly(X1, degree = 5) + poly(X2, degree = 5), data = nl, family = binomial)
summary(poly_model5)
pred <- data.frame(predict(poly_model5, gr, type = "response"))
pred <- mutate(pred, Y = as.numeric(pred >= 0.5))
Y <- pred$Y

#Plot
ggplot(gr, aes(x = X1, y = X2)) +
  geom_point(aes(col = Y == 1))
```
Compared to part 5c, this model exhibits a stranger behavior, possibly due to the higher complexity.


e) Qualitatively, compare the relative magnitudes of coefficients of in the two polynomial models and the linear
model. What do you notice? Your answer should mention bias, variance and/or overfitting.

In the 2nd degree model, the magnitudes are relatively smaller than the 5th degree model.  We notice that the standard error of the 5th degree model is much larger than the other model, and looking at their p-values, all of the terms are insignificant.  The linear model has the smallest coefficient magnitudes, standard errors, and all of its terms are significant.  Although the terms are significant for the linear model, it has the highest AIC value, meaning that it is too general- it has a low variance, but a high bias.  The linear model is not a good fit.  The 5th degree model has the second highest AIC.  It has a low bias, but a very high variance, meaning this model is overfitted.  The 2nd degree model has the lowest AIC, making the best fit model out of the three.