intro-stat-ds/13-putting-together.Rmd at master · NUstat/intro-stat-ds · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
# Putting it all together {#putting-together}

```{r setup-putting-together, include=FALSE, purl=FALSE}
chap <- 13
lc <- 0
rq <- 0
# **`r paste0("(LC", chap, ".", (lc <- lc + 1), ")")`**
# **`r paste0("(RQ", chap, ".", (rq <- rq + 1), ")")`**
knitr::opts_chunk$set(
  tidy = FALSE,
  out.width = '\\textwidth',
  fig.height = 4,
  fig.align='center',
  warning = FALSE
  )
options(scipen = 5, digits = 3)
# Set random number generator see value for replicable pseudorandomness.
set.seed(2018)
```


Over the course of this book, you have been introduced to methods for data cleaning, visualization, and analysis (Part I) and the underlying theory of estimation, generalizability, and causal inference (Parts II & III). In this chapter, we put all of these methods and theory together, illustrating how descriptive and inferential statistics can be used with real data.

### Needed Packages {#pkgs-13 .unnumbered}

Let’s load all the packages needed for this chapter (this assumes you’ve already installed them). If needed, read Section \@ref(packages) for information on how to install and load R packages.

```{r, message = FALSE, warning = FALSE}
library(tidyverse)
```

```{r message=FALSE, warning=FALSE, echo=FALSE}
# Packages needed internally, but not in text.
library(readr)
library(knitr)
library(kableExtra)
library(patchwork)
library(scales)
```


## A general process for using statistics

In general, there are two types of analyses that statistics are used for: Exploratory and Confirmatory.

**Exploratory research** typically doesn’t start with a clear question – it starts with data. For example, maybe you discover that it is possible to download data regarding all of the movies found in IMDB. In this case, just as in the beginning of this book, you explore the data visually and using summary statistics, maybe even comparing groups, variables, and relationships between them. Only after you’ve explored do you have a clear sense of some questions that you might want to ask, like: do action films gross more, on average, than comedies? Has this trend changed over time? In many cases, these questions can be answered just using **descriptive statistics**.

<!-- Figure: Descriptive research
Explore/ Estimate -> Question -> Hypothesize ->
-->

**Confirmatory research** starts with a clear question. This research typically builds on previous, exploratory research, and may involve the collection of data – e.g., via a survey or an experiment. In confirmatory research, after clearly defining the questions, you need to determine how data will be used to answer to these questions. For example, perhaps you want to know if allowing students to use laptops in class increases (or reduces) learning. To answer this, you randomly assign students to use laptops or not in class, and at the end of the semester you compare grades between the two groups. Determining if there is a difference here requires **inferential statistics** – e.g., p-values and hypothesis tests.

<!-- Figure: Confirmatory research -->
<!-- Question -> Model ->  Null hypothesis -> Estimate -> P-values -> Hypothesis Tests  -->

In either case, you need to pay attention to issues regarding **causality** and **generalizability**. When you use **descriptive statistics**, you are limiting your generalizations to the sample, not making any claims beyond the data you have in front of you. When you use **inferential statistics**, you are implicitly generalizing to other samples *like* the one that you have – i.e., gathered in the same way. But even then, this population needs to be clearly defined, indicating where the results generalize and where they do not. Similarly, if your question involves comparisons between groups or relationships between variables, you need to determine if you can (not just if you want to) infer causality from your data.

<!-- Figure: General process -->

<!-- Estimation - > Inferences -> Interpretation (Causality? Generalization?) -->

In the remainder of this chapter, we provide three examples illustrating **confirmatory research questions**. These complement the data analyses provided in the first part of the book that focus on exploratory research.

## Example: Treatment effect {#example-TE}
The Tennessee STAR experiment was conducted in the 1980s in order to determine if class size effects student learning. In each school in the experiment, Kindergarten students were randomized to either be in a small classroom (13-17 students) or a regular sized classroom (22-27 students). This design was replicated in about 80 schools throughout the state. For information on the study, see [here](https://en.wikipedia.org/wiki/Class-size_reduction).

For this example, we focus on data from a *single school*. In this school, there were 137 Kindergarten students that were randomly divided into 7 classes: 3 “small” classes and 4 “regular” classes. Teachers were also randomly assigned to these classrooms. At the end of the year, students were tested and achievement scores obtained. Let's load this data and take a look at it.

```{r, eval = FALSE, echo = FALSE}
## for internal use only - to show where star_data came from
library(haven)
star_student <- read_sav("data/star_student.sav")

#extract the school with the largest sample size for power reasons
#select only variables we want students to see for analysis
star_one_school <- star_student %>%
  filter(n_per_sc == max(n_per_sc)) %>%
  select(school_id, class_id, reading, math, class_si, small = class_ty)

write_csv(star_one_school, "data/star_data.csv")
```

```{r star-data, echo=FALSE, message=FALSE}
if(!file.exists("data/star_data.csv")){
  star_data <- "https://docs.google.com/spreadsheets/d/e/2PACX-1vTluVgdD7hdPLkKPMpN3OZjpBdZ1HloCGEQ1abcgfcWJHGYQppUGyXsGaPz74XQRwaMobDFLZgN3caU/pub?gid=0&single=true&output=csv" %>%
    read_csv(na = "")
    write_csv(star_data, "data/star_data.csv")
} else {
  star_data <- read_csv("data/star_data.csv")
}
```

```{r, message = FALSE}
star_data <- read_csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vTluVgdD7hdPLkKPMpN3OZjpBdZ1HloCGEQ1abcgfcWJHGYQppUGyXsGaPz74XQRwaMobDFLZgN3caU/pub?gid=0&single=true&output=csv")
glimpse(star_data)
```

This dataset includes 6 variables:

* `school_id` - school ID, which is the same for every student in the dataset, since we're only looking at one school
* `class_id` - indicates which of the 7 classes the student belongs to
* `reading` - end-of-year reading achievement score
* `math` - end-of-year math achievement score
* `class_si` - number of students in the classroom
* `small` - indicator variable for whether or not the class is "small" or "regular" (1 = small, 0 = regular)

**Question**: Did the type of class (small, regular) affect student math achievement, on average, in this school?

**Model**: To answer this, we can use a regression model, treating `small` as a dummy variable:

$$\widehat{math} = b_0 + b_1*small$$

**Null Hypothesis**: We can use a **stochastic proof by contradiction** in order to prove that there is a difference. To do so, we begin by assuming that there is **no difference** (our null) and then look to see if our data shows a difference large enough to contradict this.

$$H_0: \beta_1 = 0$$
$$H_A: \beta_1 \neq 0$$

Notice here that the *population parameter* we are interested in is $\beta_1$, which we estimate by the *sample statistic* $b_1$ in our model. Because the treatment of having small class sizes is fairly costly (as it would require hiring and paying more teachers), we want to make sure we have pretty strong evidence that it is effective before recommending that it should be adopted as a policy. Therefore, we design our test to have a Type I error of $\alpha = 0.01$. That is, we are only willing to tolerate a 1% probability of falsely finding the treatment to be effective when in fact it is not, so we will only reject the null hypothesis if $p < 0.01.$

**Estimate**: Our estimated model is found in the table below.

```{r}
star_model <- lm(math ~ small, data = star_data)
summary(star_model)
```
In this output, we can see that on average, students in small classes scored 19.46 points more than students in the regular-sized classes. We can use the values from the model output to compute the statistic $$test\_stat = \frac{Estimate - Null \ \ value}{SE(Estimate)} = \frac{b_1 - 0}{SE(b_1)} = \frac{19.46}{7.82} = 2.49,$$ which will help us to determine the probability that we would see a value this large if in fact there was no effect of class size. The p-value of $0.014$ reported in the regression output indicates that we would see an effect this large if there was actually no effect in about 1.4% of samples. Note this is the same p-value we would obtain by using the `pt()` function, our `t value` (i.e. $test\_stat$), and the appropriate `degrees of freedom`: `pt(2.49, df = (137 - 2), lower.tail = FALSE)*2 = 0.014`.

**Conclusion**: Since the p-value is 1.4%, which is greater than 1% ($p > \alpha$), we fail to reject our null hypothesis and therefore conclude that there is **insufficient evidence** that class-size reduction increased learning. Note that because this school was not randomly selected from a population, it is **difficult to generalize** the results beyond this school.

## Example: Estimate a proportion

The General Social Survey is an annual probability survey of American adults. Randomly selected adults are contacted via telephone and asked questions regarding their attitudes and experiences. In 2002, one question asked,

“People differ in their ideas about what it takes for a young person to become an adult these days. How important is it for them to be no longer living in their parents' household?”

The GSS showed that 29% of those asked felt that it was “Extremely important” for adults to no longer live with their parents. You can see this data [here](https://gssdataexplorer.norc.org/variables/2896/vshow).

In 2008, there was an economic downturn and increasing numbers of young adults returned to living with their parents. This led a researcher interested in understanding changing attitudes to ask: Are the attitudes among Beta University students in 2019 different than these trends in 2002? In order to answer this, the researcher conducted a survey of 100 students at the university. Let's take a look at this data.

```{r, message = FALSE, echo = FALSE}
beta_U_data <- read_csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vTfEw5anFVvZ6-lFXxhmeBdTJDkcJvJz-gHQxVwBARIMOouvpd6QwxJB190k3oxDCSjB6PqiEG0exey/pub?gid=0&single=true&output=csv")

glimpse(beta_U_data)
```

The variable `important` has a value of 1 if the student answered "extremely important" and a value of 0 otherwise.

**Question**: Do Beta University students in 2019 think it is as important for a young person to not live with their parents as people did in 2002?

**Null Hypothesis**: We wish to know if the average percent of students in 2019 is different from that in 2002. To answer this, we conduct a **stochastic proof by contradiction**, in which we assume that the percent of students in 2019 would be 29% just as in 2002. Our null hypothesis is thus that the *proportion* is 0.29.

$$H_0: \pi = 0.29$$
$$H_A: \pi \neq 0.29$$

Given the cost of collecting this data and associated trade-offs, we decided to set our Type I error at 10%.

**Estimate**: Let's use `prop.test()` to obtain our estimate. Recall that `prop.test()` requires an argument `x` that gives the counts of "successes," which in this case means the number of people who responded "extremely important" (i.e. `important == 1` in the data), and an argument `n` for the total number of "trials," which in this case means the total number of people surveyed (i.e. `n = 100`).

```{r, echo = FALSE, eval = FALSE}
set.seed(43)
beta_U_data <- data.frame(
  important = rbernoulli(100, p = .35)) %>%
  mutate(important = case_when(important == TRUE ~ 1,
                               important == FALSE ~ 0))

write_csv(beta_U_data, "beta_U_data.csv")
```

```{r beta-U-data, echo=FALSE, message=FALSE}
if(!file.exists("data/beta_U_data.csv")){
  beta_U_data <- "https://docs.google.com/spreadsheets/d/e/2PACX-1vTfEw5anFVvZ6-lFXxhmeBdTJDkcJvJz-gHQxVwBARIMOouvpd6QwxJB190k3oxDCSjB6PqiEG0exey/pub?gid=0&single=true&output=csv" %>%
    read_csv(na = "")
    write_csv(beta_U_data, "data/beta_U_data.csv")
} else {
  beta_U_data <- read_csv("data/beta_U_data.csv")
}
```

```{r}
beta_U_summary <- beta_U_data %>%
  summarize(important = sum(important),
            n = n())
beta_U_summary
```

The default for `prop.test()` assumes the null hypothesis is $\pi = 0.5$, so we need to override that here by specifying `p = 0.29`.
```{r}
prop.test(x = beta_U_summary$important, n = beta_U_summary$n, p = 0.29)
```

```{r, eval = FALSE, echo = FALSE}
#regression framework
univ_model <- lm(important ~ 1, data = beta_U_data)
summary(univ_model)
```

<!-- Importantly, this table automatically reports a p-value for the null hypothesis that the average outcome is 0% (i.e. $\beta_0 = 0$). However, this is not the null hypothesis we are interested in. Instead, we will need to conduct our analysis manually, using -->
<!-- 	$$test\_stat = \frac{b_0 - 0.29}{SE(b_0)} = \frac{0.32 - 0.29}{0.0469} = 0.64$$ -->
<!-- Since we have a sample of n = 100, we can compare this value to a t-distribution with $n – 1 = 99$ degrees of freedom (which is equivalent to the standard normal distribution). `2*pt(-0.64, df = 99)` = `r 2*pt(-0.64, df = 99)` -->

We see that the estimate from our sample is $\hat{\pi} = 0.32$. This results in a p-value of $0.6$. That is, if in fact 29% of Beta University undergraduates felt strongly that young adults needed to not live at home, then we would observe a value this different (32%) in our sample in about 60% of random samples collected in the same way.

**Conclusion**: Since our p-value is greater than the stated 10% Type I error, we would *not* reject our null hypothesis. We can thus conclude that there is **not enough evidence** to suggest that the percent of students that feel that young adults should not live at home is different at Beta University in 2019 than in the general public in 2002. Note that because the confidence interval `[0.232, 0.422]` contains the hypothesized value of 0.29, this is also an indication that there is insufficient evidence to overturn the null hypothesis. Importantly, this result only generalizes to all Beta University students if the sample was collected randomly. If it was collected based upon convenience, these results might not generalize. More information would be required to determine this.

## Example: Estimate the relationship between two variables

Let’s return to the Lego dataset we explored in the beginning of this course. One of the authors of this book is both a huge Harry Potter fan and a Lego connoisseur. After buying a few of these sets, he began to wonder about the relationship between the number of minifigures in these sets and the price of the set. To answer this, he returned to the `legosets` data, focusing only on the subset relevant to this question.

**Question**: What is the relationship between the number of minifigures in a Harry Potter lego set and the price of the set?

Let's load in the data, create the relevant Harry Potter subset, and `skim` the relevant variables. Recall that `USD_MSRP` is our price variable in US dollars.

```{r lego-data, echo=FALSE, message=FALSE}
if(!file.exists("data/legosets.csv")){
  legosets <- "https://docs.google.com/spreadsheets/d/e/2PACX-1vSwB0bCE4w7qrep4lIC-QVyF307mxFNUYQ08LZqiFsC_ks1bt_ZEJqiTEo7SaCl6g4TQ8gig2ZIfQJu/pub?gid=0&single=true&output=csv" %>%
    read_csv(na = "")
    write_csv(legosets, "data/legosets.csv")
} else {
  legosets <- read_csv("data/legosets.csv")
}

legosets_HP <- legosets %>%
  filter(Theme == "Harry Potter")
```

```{r, message = FALSE, eval = FALSE}
legosets <- read_csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vSwB0bCE4w7qrep4lIC-QVyF307mxFNUYQ08LZqiFsC_ks1bt_ZEJqiTEo7SaCl6g4TQ8gig2ZIfQJu/pub?gid=0&single=true&output=csv")

legosets_HP <- legosets %>%
  filter(Theme == "Harry Potter")

skim_with(numeric = list(hist = NULL))
legosets_HP %>%
  select(USD_MSRP, Minifigures) %>%
  skim()
```

```
Skim summary statistics
 n obs: 53
 n variables: 2

── Variable type:integer ──────────────────────────────────────────────────────────
    variable missing complete  n mean   sd p0 p25 p50 p75 p100
 Minifigures       2       51 53 3.71 2.62  1   2   3   4   12

── Variable type:numeric ──────────────────────────────────────────────────────────
 variable missing complete  n  mean    sd p0 p25 p50   p75   p100
 USD_MSRP       0       53 53 34.36 33.86  0  10  20 49.99 149.99
```

**Model**: A question about ‘relationships’ in statistics typically means that we are interested in the slope in a regression, i.e.,

$$\widehat{USD\_MSRP} = b_0 + b_1*Minifigures$$

This model can be estimated in R using the following syntax:
```{r}
lego_model <- lm(USD_MSRP ~ Minifigures, data = legosets_HP)
```

Note that we do not necessarily need to use a hypothesis test here to make a *decision* about anything; rather, we are interested in simply *estimating* the magnitude of the relationship between minifigures and price in the population of Harry Potter legosets. Therefore, we can simply construct a confidence interval, rather than conducting a hypothesis test.

<!-- **Null hypothesis**: Recalling that we seek to prove that there is a relationship, we invoke a **stochastic proof by contradiction** by assuming that there is **no relationship** between the number of minifigures and price. In the model, this means assuming that the slope $\beta_1 = 0$.  -->

<!-- We decide to use a small Type I error (1%) for this test. -->

**Estimate**: We can obtain our estimate $b_1$ from our data using `summary(lego_model)`, which provides us an estimate of this relationship.

```{r}
summary(lego_model)
```

Here the slope $b_1 = 10.54$; this means that for every additional minifigure in a Harry Potter lego set, the average price increases by \$10.54. How confident are we in this estimate? Is it precise? We can use the function `confint()` to construct a 95% confidence interval.

```{r}
confint(lego_model)
```

**Conclusion**: We are 95% confident that the true relationship between minifugures and price, $\beta_1$, is captured in the interval [8.38, 12.70]. Notably, this relationship cannot be generalized to all lego sets – only Harry Potter legosets.

<!-- Is this strong enough evidence to reject our null hypothesis? Since the null hypothesis is that $\beta_1 = 0$, we can use the built-in results from the model, which indicates that the p-value = `4e-13`. Note this is written in scientific notation $4*10^{-13} = 0.0000000000004$. Thus, if in fact there was no relationship between the number of minifigures and price, we would expect to find a slope this different in only a tiny percentage of samples.

**Conclusion**: Since our p-value is less than our pre-set Type I error (1%), we can reject the null hypothesis, thus concluding that there is a relationship between the number of minifigures and the price for Harry Potter lego sets. Notably, this relationship cannot be generalized to all lego sets – only Harry Potter legosets.  -->


## Final thoughts

As we wrap up this course, take a moment to reflect on what you have learned. You now know how to explore, describe and visualize data. You also know some of the basic principles of statistical theory – the role of randomization and chance, the idea that your data is one version of hypothetically many other versions you could have, the fundamentals of how to use data to test and prove claims. And you know how to apply these principles to real data, to understand uncertainty, to determine when patterns are common or rare, and to test hypotheses.