-
Notifications
You must be signed in to change notification settings - Fork 1
Expand file tree
/
Copy pathMP1.Rmd
More file actions
91 lines (68 loc) · 3.29 KB
/
MP1.Rmd
File metadata and controls
91 lines (68 loc) · 3.29 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
---
title: "Mini R Project 1"
author: "Sunni Raleigh and Natalia Iannucci"
output: html_document
---
###Loading the dataset
```{r}
d <- read.table("Example 1_3 Health.txt", header = T, sep = '\t')
attach(d)
```
###1. Perform an initial study to determine whether a simple linear regression model is reasonable to predict the number of doctors based on the number of hospitals in the city.
####Initial study to observe the nature of the relationship between the number of hospitals and number of doctors in a city
__Graphical Method:__ Scatter Plot
```{r}
plot(NumMDs, NumHospitals, pch = 19, cex = 1.5, col = "blue", main = "Scatter plot between number of doctors in \na a city and the number of hospitals in a city")
```
The scatter plot shows a postive linear trend.
__Numerical Method:__ Pearson Linear Correlation Coefficient
```{r}
r <- cor(NumMDs, NumHospitals)
r
```
The r value of 0.9084686 indicates there is a strong positive linear correlation between the number of doctors in a city and the number of hospitals in a city.
__Based on the scatter plot and the r value, fitting a linear relationship is reasonable.__
###2. Fit a least square regression line to predict the number of doctors based on the number of hospitals in a city.
```{r}
regMode1 <- lm(NumMDs ~ NumHospitals)
regMode1
```
$y_i = -385.1 - 282x_i + e_i$ for $i=1,2,\cdots,83$ where y is the number of doctors in a city, x is the number of hospitals in a city, and e is the random error with the assumptions $e_i \stackrel{iid}\sim N(0, \sigma^2)$.
###3. Are you satisfied with the fit of this simple linear model?
Check linearity and constant variance conditions:
```{r}
# find residuals
residuals <- resid(regMode1)
residuals
# find predicted values
predValues <- predict(regMode1)
predValues
# plot the predicted values against the residuals
plot(predValues, residuals)
abline(0,0, col="red")
```
Everywhere in the residual plot, there are negative and positive residuals; however, they are not equally symmetrical about the line e = 0, as the variance increases as the predicted values increase. Therefore, there is a linear patter with non-constant variance. So, the condition of constant variance is NOT met.
Check normality of residuals:
```{r}
hist(residuals)
qqnorm(residuals, pch = 19, col= "blue")
qqline(residuals, col="red")
```
The points in the qq plot only lie approximately on the straight line in the middle part of the graph, between the theoretical quantiles -1 and 1. Therefore, it is unclear if the residuals are normal. Because we can not draw any strong conclusions from this, we will perform a statistical test to determine normality.
Numerical method to determine normality:
```{r}
shapiro.test(residuals)
```
$H_0$: the data follows a normal distrubution
$H_a$: the data does not follow a normal distrubution
alpha level: 0.05
p-value: 0.0002739
the p-value is less than 0.05:
Therefore, reject the null hypothesis at the 5% significant level. So, we can NOT confirm that the data follows a normal distribution.
Check zero mean condition:
```{r}
m <- mean(residuals)
m
```
Our regression model holds the codition of zero mean.
While the linearity and zero mean conditions are met, the constant variance and normality of residuals conditions are not met. Therefore, we are NOT satisfied with the fit of this simple linear model.