-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathMP2_291.Rmd
More file actions
159 lines (118 loc) · 4.76 KB
/
MP2_291.Rmd
File metadata and controls
159 lines (118 loc) · 4.76 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
---
title: "Mini R Project 2"
author: "Natalia Iannucci and Sunni Raleigh"
output: html_document
---
##loading the dataset
```{r}
data <- read.table("Shortleaf.txt", header = T, sep = '\t')
attach(data)
```
##Initial study to observe the nature of the relationship between the volume of shortleaf pines and their diameters
__Graphical Method:__ Scatter Plot
```{r}
plot(Diameter, Volume, pch = 19, cex = 1.5, col = "blue", main = "Scatter plot between volume and diamater of shortleaf pines")
```
The scatterplot shows a positive trend, however; it appears to be slightly non-linear.
__Numerical Method:__ Pearson Linear Correlation Coefficient
```{r}
r <- cor(Diameter, Volume)
r
```
The r value of 0.9447509 suggests that there is a strong positive linear correlation between tree diameter and volume.
__Based on the r value, fitting a linear relationship might be reasonable. However, the non-linear trend in the scatter plot suggests that a linear relationship may not be best, so we will fit a linear relationship and then assess whether or not it is a good fit. __
##Fit a least square regression line to predict the volume based on the diameter.
```{r}
regModel <- lm(Volume~Diameter)
regModel
```
$$y_i = -41.568 + 6.837x_i + e_i$$ for $i=1,2,\cdots,70$ where y is the tree volume, x is the tree diameter, and e is the random error with the assumptions $e_i \stackrel{iid}\sim N(0, \sigma^2)$.
```{r}
plot(Diameter, Volume, pch = 16, main = "Fitted Regression Line")
abline(regModel, col= "blue", lwd = 2)
```
##Check Conditions:
###__Find Residuals and Predicted Values:__
```{r}
# find residuals
residuals <- resid(regModel)
residuals
# find predicted values
predValues <- predict(regModel)
predValues
```
###__Zero Mean:__
```{r}
m <- mean(residuals)
m
# the zero mean condition holds.
```
The mean of the residuals is very close to 0 and and the zero mean condition holds
###__Linearity and Constant Variance:__
```{r}
# plot the predicted values against the residuals
plot(predValues, residuals)
abline(0,0, col="red")
```
The condition of constant variance and linearity are not met, as the variance fans out and the scatterplot of residuals shows a clear nonlinear trend. This suggests that we need to perform a transformation.
###__Normality:__
```{r}
# Histogram
hist(residuals)
# Q-Q Plot
qqnorm(residuals, pch = 19, col= "blue")
qqline(residuals, col="red")
## Numerical Method
shapiro.test(residuals)
```
At a 5% level, the Shapiro-Wilk test suggests the residuals do not follow a normal distribution, as the p-value is less than 0.05. However, the histogram of residuals appears to be right-skewed, and the points on the normal Q-Q plot are not linear. Numerically and graphically normality does not hold. Therefore, we have chosen to do a transformation.
---
## Remedial Measures
For our transformation we have decided to transform Y because constant variance is one of the one of the conditions violated in our first model. $$ transformation: y = \sqrt{y}$$
```{r}
sqrt_Y = sqrt(Volume)
r = cor(sqrt_Y, Diameter)
r
plot(Diameter, sqrt_Y, pch = 16, ylab = "Volume", xlab = "Diameter", main = "scatter plot between volume and diamater")
```
The scatterplot drawn between the transformed Y and X reflects a linear relationship, with a r value of 0.987 suggesting a strong linear relationship between sqrt(Y) and X modeled by the following:
```{r}
newRegModel <- lm(sqrt_Y~Diameter)
newRegModel
```
$$\sqrt{y_i} = -1.1366 + 0.5832x_i + e_i$$ for $i=1,2,\cdots,70$ where y is the tree volume, x is the tree diameter, and e is the random error with the assumptions $e_i \stackrel{iid}\sim N(0, \sigma^2)$.
## Plot new model
```{r}
plot(Diameter, sqrt_Y, pch = 16, main = "Fitted Regression Line")
abline(newRegModel, col= "blue", lwd = 2)
```
##Check conditions for new model:
###__Find new Residuals and new Predicted Values:__
```{r}
# find residuals
newResiduals <- resid(newRegModel)
newResiduals
# find predicted values
newPredValues <- predict(newRegModel)
newPredValues
```
###__Zero Mean:__
```{r}
m <- mean(newResiduals)
m
```
###__Linearity and Constant Variance:__
```{r}
# plot the predicted values against the residuals
plot(newPredValues, newResiduals)
abline(0,0, col="red")
```
The variance of the residuals is fairly constant and there is a slight linear trend. This shows considerable improvement from our original model. Constant variance and linearity hold.
## Check normality of new residual values:
```{r}
hist(newResiduals)
qqnorm(newResiduals, pch = 19, col= "blue")
qqline(newResiduals, col="red")
shapiro.test(newResiduals)
```
While the distribution of the new residuals is slightly skewed left, the points on the Q-Q line are linear and the results from the Shapiro-Wilk test at a 5% level suggest that the points follow a normal distribution therefore the condition for normality holds.