RstatsTextbook/12_Modelling.qmd at main · elenlefoll/RstatsTextbook · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
bibliography: references.bib
html:
  code-link: true
---

{{< include emoji_script.md >}}

# Int`R`oduction to statistical modelling {#sec-SLR}

```{r}
#| include: false
library(tidyverse)
```

### Chapter overview {.unnumbered}

This chapter will take you from statistical tests to statistical modelling. By the end of this chapter, you will be able to:

-   Use the `lm()` function to fit linear regression models with:
    -   a single numeric predictor variable
    -   a single (binary) categorical predictor variable.
-   Understand the relationship between:
    -   correlation tests and linear regression models with a single numeric predictor
    -   *t*-tests and linear regression models with a single binary categorical predictor.
-   Interpret the summary output of simple linear regression models.
-   Explain what the intercept of a simple linear regression model corresponds to.
-   Visualise the predictions of simple linear regression models.
-   Understand why "association does not imply causation".
-   Check the most important assumptions of linear regression models.

## Correlations as regression over a numeric variable {#sec-Correlationsregression}

This section explains how the principle of correlation, which we covered in @sec-Correlations, is integral to linear regression modelling. We begin with a toy example to familiarise ourselves with the concept of statistical modelling. It is important to take the time to genuinely understand how simple linear regression works before moving on to more complex, real-world research questions.

### A perfect prediction

In the following, we will fit a simple linear regression model to predict the percentage of correct answers that a student obtained in a multiple-choice test based on the number of questions that they correctly answered in this same test. Clearly, this is a purely hypothetical example because, if we know the number of questions that a student correctly answered and how many questions there were in the test, we can easily calculate the percentage of questions that the student answered correctly. If a test has 10 questions and a student answered 8 of them correctly, we can calculate the percentage of questions that they successfully answered like this:

```{r}
8 / 10 * 100
```

And we can simplify this operation to a single multiplication:

```{r}
8 * 10
```

We will fit our first simple linear regression model on a simulated dataset. It consists of 100 test results expressed:

a.  as the number of correctly answered questions (`N.correct`), and
b.  as a percentage (`Accuracy`).

The dataset contains 100 rows corresponding to the results of 100 test-takers. The first six rows of this simulated dataset (`test.results`) are displayed below.

```{r}
#| code-fold: true
#| code-summary: "Show `R` code to simulate this toy dataset."
#| echo: !expr "knitr::is_html_output()"

# First, we set a seed to ensure that the outcome of our randomly generated number series are always exactly the same:
set.seed(42)

# Second, we simulate the number of questions that 100 learners answered correctly using the `sample()` function that generates random numbers, specifying that learners got between 1 and 10 questions right:
N.correct <- sample(1:10, 100, replace = TRUE)

 # Third, we convert the number of correctly answered questions into the percentage of questions that these learners answered correctly:
Accuracy <- N.correct / 10 * 100

# Finally, we put these two variables together in a dataframe:
test.results <- data.frame(N.correct, Accuracy)
```

```{r}
head(test.results)
```

We can now plot the two variables of our simulated dataset as a scatter plot to visualise the correlation between the number of questions learners answered correctly (`N.correct`) and the percentage of questions they answered accurately (`Accuracy`). As expected, these two variables are perfectly correlated: every data point rests exactly on the regression line.

```{r}
#| code-fold: true
#| code-summary: "Show code to generate the scatter plot."
#| echo: !expr "knitr::is_html_output()"
#| label: "fig-RTScatterPLot"
#| fig-cap: "The correlation between learners' test results expressed as the number of correct answers and the percentage of questions they answered correctly"
#| fig-alt: "Scatter plot of learners' test results expressed as the number of correct answers they got on the x-axis and as the percentage of correctly answered questions on the y-axis. The data points form a perfect linear relationship. A straight blue regression line runs through all the data points."
#| fig-width: 4.5
#| out-width: 50%
#| fig-asp: 0.9

library(tidyverse)
ggplot(data = test.results,
       aes(x = N.correct, y = Accuracy)) +
  geom_smooth(method = "lm",
              se = FALSE) +
  geom_point(alpha = 0.5,
             size = 2) +
  labs(x = "Number of correct answers",
       y = "% of questions correctly answered") +
  scale_x_continuous(limits = c(0, 10.5), expand = c(0,0), breaks = c(0,2,4,6,8,10)) +
  scale_y_continuous(limits = c(0, 105), expand = c(0,0), breaks = c(0, 20, 40, 60, 80, 100)) +
  theme_bw()
```

This is confirmed by a correlation test (see below), which shows:

a.  a correlation coefficient (Pearson's *r*) of exactly `1`,
b.  an extremely small *p*-value (the smallest that `R` can display: `< 2.2e-16`), and
c.  the narrowest 95% confidence interval possible `[1, 1]`.

```{r}
cor.test(~ N.correct + Accuracy,
         data = test.results)
```

Now, we use the base `R` function `lm()` to fit a simple linear model to predict the percentage of questions that learners correctly answered based on the number of correct answers that they gave. Like the significance test functions that we saw in @sec-Inferential, `lm()` takes a **formula** as its first argument. We want to predict the proportion of questions that learners correctly answered as a percentage (`Accuracy`) so this variable comes *before* the tilde (`~`). In statistical modelling, the variable that we want to predict is referred to as the **outcome variable**.[^12_simplelinearregression-1] We want to use the number of correct answers that the learners gave to make our prediction, so we place the variable `N.correct` *after* the tilde. Variables used to predict the outcome are called **predictors**.[^12_simplelinearregression-2]

[^12_simplelinearregression-1]: This variable is sometimes referred to as the **dependent variable** (see @sec-IntSummary).

[^12_simplelinearregression-2]: These variables are also sometimes referred to as **independent variables** (see @sec-IntSummary).

```{r}
test.model <- lm(formula = Accuracy ~ N.correct,
                 data = test.results)
```

We have saved our model to our local environment as an `R` object called `test.model`. We can now use the `summary()` function to examine the model:

```{r}
#| message: true

summary(test.model)
```

In addition to this model summary, the code above outputs a warning about an "essentially perfect fit" in the Console. We will come to this warning at the end of this section, but for now let's start by looking at the **coefficients** of our model summary. Every model has an **intercept** coefficient estimate `(Intercept)`, which is the value that the model predicts for the outcome variable when the predictor variables are zero. In other words, it is where the regression line would cross the *y*-axis if the line were extended to go all the way to zero as in @fig-RTScatterPLotIntercept.

```{r}
#| code-fold: true
#| code-summary: "Show code to generate plot."
#| label: "fig-RTScatterPLotIntercept"
#| fig-cap: "The correlation between learners' test results expressed as the number of correct answers and the percentage of questions answered correctly with an extended regression line"
#| fig-alt: "This is the same plot as above. The only difference is that a dotted blue line extends the fit blue line to zero."
#| echo: !expr "knitr::is_html_output()"
#| fig-width: 4.5
#| out-width: 50%
#| fig-asp: 0.9

ggplot(data = test.results,
       aes(x = N.correct, y = Accuracy)) +
  geom_abline(slope = 10,
              intercept = c(0,0),
              linetype = "dotted",
              colour = "blue") +
  geom_smooth(method = "lm",
              se = FALSE) +
  geom_point(alpha = 0.5,
             size = 2) +
  labs(x = "Number of correct answers",
       y = "% of questions correctly answered") +
  scale_x_continuous(limits = c(0, 10.5), expand = c(0,0), breaks = c(0,2,4,6,8,10)) +
  scale_y_continuous(limits = c(0, 105), expand = c(0,0), breaks = c(0, 20, 40, 60, 80, 100)) +
  theme_bw()
```

In our case, we only have one predictor, so the intercept coefficient that our model estimates (`5.151e-15`) corresponds to the predicted percentage of questions answered correctly (outcome) when the number of correct answers (predictor) is equal to zero. Instinctively, we know that this value should be zero because 0 points in a test = 0% correct in the test. And, indeed, our model predicts an intercept extremely close to zero (`5.151e-15`). Using the `format()` function, we can convert this coefficient estimate from **scientific notation** to standard notation:

```{r}
format(5.151e-15, scientific = FALSE)
```

The second, `N.correct` coefficient estimate in our model summary is displayed as `1.000e+01` which is equal to:

```{r}
format(1.000e+01, scientific = FALSE)
```

This coefficient estimate means that, for every correct answer, our prediction for the percentage of questions answered correctly increases by 10%. For example, if a student answered 9 questions correctly, our model predicts that the percentage of correctly answered question is equal to the intercept coefficient of (nearly) zero plus 9 multiplied by 10, which is equal to 90%:

```{r}
-5.151e-15 + 9 * 10
```

In our model summary, the coefficient estimate for `N.correct` is associated with a *p*-value of `<2e-16`. It corresponds to the *p*-value of our `cor.test()` on the same data, and is actually the smallest value that `R` will display. In other words, it is extremely small. This *p*-value indicates that we can safely reject the null hypothesis that this coefficient is zero under the assumptions of a linear regression model. In other words, we can reject the null hypothesis that the number of correct answers is *not* a useful predictor to predict the percentage of correctly answered questions under the assumption of a linear regression model (more on these assumptions in @sec-Assumptions).

The penultimate line of the model summary is also important. It features two ***R*****-squared (*R*^2^) values**. Like the absolute values of correlation coefficients, these can range between 0 and 1. *R*^2^=0 means that the model accounts for 0% of the variance in the outcome variable. *R*^2^=1 means that the model accounts for 100% of the variance, i.e. that it can perfectly predict the values of the outcome variable from the predictor variable(s). As our simulated `test.results` dataset is based on a perfect correlation, our model is able to perfectly predict the outcome variable, hence both our *R*^2^ values equal `1`. In this chapter and @sec-MLR, however, we will focus on the **adjusted *R*-squared value** because it accounts for the fact that the more predictors we include in our model, the easier it is to predict the outcome variable.

Finally, you may have also noticed that the `lm()` function returned a warning message when we fitted this toy model:

```
Warning message:
In summary.lm(test.model) :
  essentially perfect fit: summary may be unreliable
```

With this message, the authors of the `lm()` function are warning us that our model can make almost perfect predictions. Given that in real-life research this is extremely unlikely, an "essentially perfect" prediction is usually a sign that we may have made an error of some kind. Here, however, we can safely ignore this warning because we know that the number of correct answers can, indeed, be converted to percentages of correctly answered questions with perfect accuracy (hence our *R*^2^ of `1` or 100%).

### A real-life prediction

In the remaining sections of the chapter, we fit simple regression models to real data from @DabrowskaExperienceAptitudeIndividual2019.

::: {.callout-warning collapse="false"}
#### Prerequisites {.unnumbered}

Our starting point for this chapter is the wrangled combined dataset that we created and saved in @sec-DataWrangling. Follow the instructions in @sec-filter to create this `R` object. Alternatively, you can download `Dabrowska2019.zip` from [the textbook's GitHub repository](https://github.com/elenlefoll/RstatsTextbook/raw/69d1e31be7394f2b612825f031ebffeb75886390/Dabrowska2019.zip){.uri}. To launch the project correctly, first unzip the file and then double-click on the `Dabrowska2019.Rproj` file.

```{r}
library(here)

Dabrowska.data <- readRDS(file = here("data", "processed", "combined_L1_L2_data.rds"))
```

Before you get started, check that you have correctly imported the data by examining the output of `View(Dabrowska.data)` and `str(Dabrowska.data)`. In addition, run the following lines of code to load the {tidyverse} packages and create "clean" versions of the L1 and L2 datasets as separate `R` objects:

```{r}
library(tidyverse)

L1.data <- Dabrowska.data |>
  filter(Group == "L1")

L2.data <- Dabrowska.data |>
  filter(Group == "L2")
```

Once you are satisfied that the data are correctly loaded, you are ready to start modelling! `r emoji("rocket")`
:::

Recall that, in @sec-Correlations, we saw that there is a **positive correlation** between the total number of years that English L1 speakers spent in formal education (`EduTotal`) and their English grammar comprehension test scores (`Grammar`). @fig-EducationYearsVocab suggests that this positive correlation also holds for the association between the number of years participants spent in formal education and their receptive vocabulary test scores (`Vocab`).

```{r}
#| code-fold: true
#| code-summary: "See code to generate plot."
#| echo: !expr "knitr::is_html_output()"
#| label: "fig-EducationYearsVocab"
#| fig-cap: "Relationship between years spent in formal education and vocabulary test scores among L1 speakers"
#| fig-alt: "Scatter plot of vocabulary scores on the y-axis, and  years of education on the x-axis. A blue regression line slopes upward, indicating a positive relationship: higher formal education tends to be associated with higher vocabulary scores."
#| fig-width: 4.5
#| out-width: 50%
#| fig-asp: 0.9

L1.data |>
  ggplot(mapping = aes(x = EduTotal,
                       y = Vocab)) +
  geom_point() +
  geom_smooth(method = "lm",
              se = FALSE) +
  theme_bw()
```

On average, the longer L1 speakers were in formal education, the better they performed on the vocabulary test. The correlation is larger than for `Grammar` scores, and it is statistically significant at an $\alpha$-level of 0.05 (*r* = 0.43, 95% CI \[0.24, 0.58\]):

```{r}
cor.test(formula = ~ Vocab + EduTotal,
         data = L1.data)
```

The straight (i.e. linear) regression line going through our scatter plot in @fig-EducationYearsVocab, does not go through all the data points like it did in @fig-RTScatterPLot. This is because, if we know how long an L1 participant was in formal education, we cannot *perfectly* predict how well they will perform on the `Vocab` test, even though we do know that, on average, having spent longer in formal education correlates with `Vocab` test scores. Indeed, @fig-EducationYearsVocab clearly shows that some of the individuals who attended formal education for the least amount of time scored very low, whilst others scored very high in the `Vocab` test. This is not surprising, as we can expect that many other factors will play a role in L1 vocabulary knowledge.

We now fit a simple linear regression model using the `lm()` function (which stands for **linear model**) to the L1 data from @DabrowskaExperienceAptitudeIndividual2019 with the aim of predicting `Vocab` scores (our outcome variable) based on the number of years that they spent in formal education (`EduTotal`; our predictor variable):

```{r}
model1 <- lm(formula = Vocab ~ EduTotal,
             data = L1.data)
```

We examine the model using the `summary()` function:

```{r}
summary(model1)
```

This time, we'll begin our interpretation of the model summary from the top:

-   The first line is a reminder of the model **formula** and of the **data** on which we fitted our linear model (`lm`).

-   The second paragraph provides information about the **residuals**. They represent the difference between participants' actual `Vocab` scores (the observed values) and the model's predictions (the predicted values). Hence, they correspond to the variance in `Vocab` scores that is left unaccounted for by the model. The closer to zero the residuals are, the better the model fits the data. However, no real-life model is perfect so we expect there to be some "left-over" variance.

    -   `Min`, `1Q`, `Median`, `3Q`, and `Max` are the minimum, first quartile, median, third quartile, and maximum residuals, respectively. These are the descriptive statistics of a distribution that are typically displayed in a boxplot (see @sec-IQR). These statistics give us a sense of the spread of the residuals. Whilst these values can be informative, it's best to plot residuals to get a sense of their distribution (see @sec-Residuals). Ideally, the residuals should be normally distributed and centred around zero (see @sec-AssumptionsLR).

-   The **intercept coefficient estimate** of `20.516` is the model's prediction for the vocabulary score of participants who spent zero years (!) in formal education (`EduTotal` = 0). It is the model's baseline or reference `Vocab` score.

-   The **EduTotal coefficient estimate** of `3.546` means that, for each year of formal education, our model predicts that `Vocab` test scores will increase by an additional 3.546 units on top of the baseline score of `20.516`, provided that all other variables remain the same. Hence, if an individual spent 14 years in formal education, their predicted `Vocab` score is:

    ```{r}
    20.5159 + 14*3.5460
    ```

-   The ***p-*****value** associated with the `EduTotal` coefficient is very small (`2.72e-05`).

    ```{r}
    format(2.72e-05, scientific = FALSE)
    ```

    It suggests that the predictor `EduTotal` makes a statistically significant contribution to our model predicting `Vocab` scores among L1 speakers (for more on *p*-values, see @sec-PValues).

-   The **adjusted *R*-squared (*R*^2^)** is `0.1729`, which means that our model accounts for ca. 17% of the total variance in `Vocab` scores. That may not seem a lot, but it is worth recalling that our model only includes one predictor variable. We can reasonably assume that other predictors will help us to predict L1 speakers' vocabulary knowledge more accurately (e.g. their age, how much they read, perhaps their profession, etc.).

::: callout-note
# It's all linear regression! `r emoji("exploding_head")`

Did you notice that the ***t*****-value** and the ***p*****-value** associated with the `EduTotal` coefficient in our linear regression model are exactly the same as those that we obtained earlier using the `cor.test()` function? This is because linear regression with a single numeric predictor is -- under the hood -- exactly the same as a Pearson's correlation test!

Moreover, we said that ***R***^**2**^ stands for the **squared coefficient of correlation**, which is why, if we take the square root of our unadjusted *R*^2^ coefficient, we get 0.43, which corresponds to the **correlation coefficient** (Pearson's *r*) that we obtained from the `cor.test()` function:

```{r}
sqrt(0.1822)
```

Having gone through @sec-Inferential, you may now ask yourself: why do we need correlation tests if simple linear regression does the same thing? Honestly, we don't. @sec-Inferential introduced correlation tests and *t*-tests because they are widely used in the language sciences, so it is important that you understand them, but the truth is: you can achieve much more by taking a modelling rather than a testing approach to analysing data. Read on to find out more!
:::

### Predicted values and residuals {#sec-PredictedResiduals}

@fig-EduPredictedVocab visualises the `Vocab` scores that our first model (`model1`) predicts as a function of the number of years that L1 speakers have spent in formal education. We can access our model's predictions by applying the `predict()` function to our model object (`model1`). By definition, regression models predict perfect linear associations. As shown in @fig-EduPredictedVocab, here, our model predicts a perfect, positive linear association between our predictor variable (`EduTotal`) and our outcome variable (`Vocab`).

```{r}
#| code-fold: true
#| code-summary: "See code to generate plot."
#| echo: !expr "knitr::is_html_output()"
#| label: "fig-EduPredictedVocab"
#| fig-cap: "Relationship between years spent in formal education and predicted vocabulary test scores"
#| fig-alt: "A scatter plot showing the relationship between years in formal education and predicted vocabulary scores. The points form an upward linear trend and their size corresponds to the number of people in the L1 dataset who reported being in formal education for that many years. Most people reported spending between 11 and 14 years in formal education but some as many 21 years."
#| fig-width: 4.5
#| out-width: 50%
#| fig-asp: 0.9

ggplot(L1.data,
       aes(x = EduTotal,
           y = predict(model1))) +
  geom_abline(slope = 3.55,
              intercept = 20.51,
              linetype = "dotted",
              colour = "blue") +
  geom_count() +
  scale_size_area(guide = NULL) +
  labs(y='Predicted Vocab scores',
       x='Years in formal education') +
  theme_bw()
```

You may have noticed that there are fewer dots on @fig-EduPredictedVocab than data points in the dataset that we used to fit the model (N = 90). This is because, if more than one L1 participant reported having been in formal education for the same duration of time, the model predicts the same `Vocab` score for all these people and this means that the dots overlap on the plot of predicted values. To ensure that we keep in mind that a single dot may represent more than one participant, in @fig-EduPredictedVocab, the `geom_count()` function is used to map the size of each dot onto the number of participants that it represents.

This is all very well, but as we learnt from @fig-EducationYearsVocab, in reality, our predictor variable `EduTotal` does not correlate perfectly with `Vocab` scores — far from it! Let us now compare the **predicted values** of our model with the **observed values** from the data. How well does our model fit the data? To help us answer this question, @fig-ActualPredictedVocab visualises the relationship between L1 participants' actual `Vocab` scores (*x*-axis) and the scores that the model predicted for these participants (*y*-axis).

```{r}
#| code-fold: true
#| code-summary: "Show code to generate annotated plot."
#| echo: !expr "knitr::is_html_output()"
#| label: "fig-ActualPredictedVocab"
#| fig-cap: "Relationship between actual and predicted `Vocab` test scores"
#| fig-alt: "Scatter plot with predicted Vocab scores on the x-axis and actual Vocab scores on the y-axis with a dotted blue diagonal line showing where the dots would lie if the model's prediction were perfect. The model's predictions are far from perfect; few dots fall on the dotted line. For two dots, there is a blue line drawn from the dot to the dotted line; the length of these lines represent the model's residuals for these two participants."
#| fig-width: 4.5
#| out-width: 50%
#| fig-asp: 0.9


pred_vals   <- predict(model1) # Vector of predictions
xmin_pred   <- min(pred_vals)  # Lowest predicted score

highlight_df <- data.frame(
  x = c(xmin_pred, pred_vals[68]),
  y = c(L1.data$Vocab[which.min(pred_vals)],
      91))

ggplot(L1.data,
       aes(x = pred_vals,
           y = Vocab)) +
  geom_point() +
  # 45° reference line (the "perfect‑prediction" line):
  geom_abline(intercept = 0,
              slope = 1,
              linetype = "dotted",
              colour = "blue") +
  annotate("segment",
           x = xmin_pred,
           y = 29,
           yend = xmin_pred,
           colour = "blue",
           arrow = arrow(length = unit(0.25, "cm"))) +
  annotate("segment",
           x = pred_vals[68],
           y = 91,
           yend = pred_vals[68],
           colour = "blue",
           arrow = arrow(length = unit(0.25, "cm"))) +
  geom_point(data = highlight_df,
             aes(x = x, y = y),
             colour = "blue") +
  scale_y_continuous(limits = c(0, 100),
                     expand = c(0,0)) +
  scale_x_continuous(limits = c(0, 100),
                     expand = c(0,0)) +
  labs(x = "Predicted Vocab scores",
       y = "Actual Vocab scores") +
  theme_bw()
```

In @fig-ActualPredictedVocab, the dotted blue line represents where the points would lie, if `model1` perfectly predicted `Vocab` scores based only on the number of years that participants spent in formal education. The dots that fall on or very close to the dotted line stand for L1 participants for whom our model accurately predicts their `Vocab` score based solely on the number of years they spent in formal education. The further the dots are from the dotted line, the worse the predictions for these speakers. The distances between each point and the dotted line in @fig-ActualPredictedVocab correspond to the model's residuals. **Residuals** are therefore the "left-over" variability in the data that our model cannot account for.

Residuals can be positive or negative, depending on whether the dots on the predicted vs. actual values plot are above or below the dotted line of perfect fit. However, what matters more are their absolute values. The larger the residuals, the worse the model fit. Returning to the question, "How good is our model fit?", we can conclude from @fig-ActualPredictedVocab that it's not great. But we already knew that from the model's fairly low *R*^2^ value and, crucially, we also know that many other factors are likely to account for some of the remaining variation in vocabulary scores.

::: {#nte-Causation .callout-important title="Association does not imply causation!"}
You may be familiar with the Latin phrase:

-   ***Cum hoc ergo propter hoc*** ('with this, therefore because of this')

or its English equivalent:

-   ***Correlation does not imply causation***.

They both refer to the human tendency to assume that, if two things are regularly associated, one probably *causes* the other. However, this is a logical fallacy. For example, even if we had observed a very high correlation between the number of years participants were in formal education and their receptive vocabulary test scores (and had therefore reported an excellent model fit with very small residuals), this would by no means imply that, on average, spending more time in formal education *leads to* greater vocabulary knowledge. A correlation merely describes an association; it tells us nothing about any causal relationship.

It is often tempting to interpret the results of statistical tests and models causally, but this is the result of a well-known human cognitive bias [see e.g. @matuteIllusionsCausalityHow2015; @kaufmanIllusionCausalityCognitive2018]. In the language sciences and across the social sciences more generally, there are usually far too many potential factors at play to warrant any causal interpretation. What we are modelling throughout this chapter and @sec-MLR are **statistical associations**. We cannot draw any conclusions about what *caused* what from these data and models. In the context of statistical modelling, we can extend the famous saying to **association does not imply causation**.

There are three main reasons for this:

1.  There may be one or more **confounding variables** that we have not accounted for in our model. For instance, we can predict with a fair degree of accuracy how much vocabulary an infant understands based on their weight. However, the weight of young children is, of course, highly correlated with their age and, by extension, the amount of language exposure they've had so far in their lives. Clearly, it is not their weight that is *causing* them to understand more words. Unfortunately, it's not always that obvious. Going back to `model1`, it is very likely that some of the younger participants in the @DabrowskaExperienceAptitudeIndividual2019 dataset were still in formal education at the time of data collection; this includes all those who reported being students. In other words, for at least some of the participants, the `EduTotal` variable confounds age and years spent in formal education.

2.  Even if there is a causal effect, it may not be in the **direction** that we assume. For example, the vocabulary knowledge of adult learners of a foreign language is probably a fairly good predictor of how many foreign language classes these adult learners have attended. However, this does not mean that vocabulary knowledge *causes* adults to attend classes. If anything, it is more likely to be the other way around. Whilst it seems rather obvious in this example, in linguistics and education, complex interdependencies between predictors are very common. Consider reading ability: the more a child reads, the better they become at reading. Therefore, we might conclude that spending more time reading leads to better reading skills. However, for some children, poor reading skills may contribute to a lack of motivation and interest in reading in the first place. Clearly, determining causal relationships is tricky!

3.  Finally, it is important to consider the possibility that even a statistically significant association could be entirely **spurious**. This happens more often than you might think, and it is more likely to happen with smaller datasets. Moreover, the more we test associations, the more likely we are to find statistically significant ones (see @sec-pHacking on the risks of multiple testing and Type 1 errors). When associations are very strong, we may be tempted to think that they are real, but this may not be the case. To raise awareness of this risk, [Tyler Vigen](https://tylervigen.com/about-spurious-correlations) conducted correlation tests on millions of combinations of randomly selected variables and created a website featuring thousands of examples of very large and highly significant correlations that are all entirely spurious (e.g. @fig-SpuriousCorrelation).

    ![Find this spurious correlation and thousands more by Tyler Vigen at <https://tylervigen.com/spurious-correlations> (CC BY 4.0)](images/Vigen_unhealthy-air-quality-in-johnstown-pennsylvania_correlates-with_masters-degrees-awarded-in-education.png){#fig-SpuriousCorrelation fig-alt="A linear line chart with years as the x-axis and two variables on the y-axis. The first variable is air pollution in Johnstown, Pennsylvania and the second variable is Master's degrees awarded in education. The chart goes from 2012 to 2021, and the trends in the two variables match closely over that time. The following statistics are displayed on the graph: Pearson's r = 0.989, R-squared = 0.979, p < 0.01. The source is also labelled: tylervigen.com/spurious/correlation/3139" width="70%"}
:::

:::: {.content-visible when-profile="OER"}
::: {.callout-tip collapse="false"}
#### Your turn! {.unnumbered}

In this task, you will seek answers to the research question: To what extent can the results of the non-verbal IQ `Blocks` test be used to predict `Vocab` scores among L1 and L2 speakers of English based on the data from @DabrowskaExperienceAptitudeIndividual2019? As we will be answering this question within a linear regression framework, what we are really asking is: To what extent are these variables linearly associated with each other?

[**Q12.1**]{style="color:green;"} Fit a linear model to predict `Vocab` scores among L1 participants based on their `Blocks` test scores. What is the adjusted *R*^2^ value of this model?

```{r}
#| echo: false
#| label: "Q12.1"
library(checkdown)
check_question("0.07",
               options = c("0.0009",
                           "0.006",
                           "0.007",
                           "0.07",
                           "0.08",
                           "0.7",
                           "0.9"),
               button_label = "Check answer",
               q_id = "Q12.1",
               type = "radio",
               right = "That's right, well done! 🎉",
               wrong = "No, that's not it. Are you modelling the L1 data only?")
```

```{r}
#| code-fold: true
#| code-summary: "Show sample code to answer Q12.1."
#| eval: false

taskmodel1 <- lm(formula = Vocab ~ Blocks,
                 data = L1.data)

summary(taskmodel1) # Adjusted *R*-squared:  0.07232
```

[**Q12.2**]{style="color:green;"} According to your model, which of these values is the predicted `Vocab` score of an L1 speaker with a `Blocks` score of 20?

```{r}
#| echo: false
#| label: "Q12.2"

check_question("About 75",
               options = c("About 75",
                           "About 21",
                           "About 55",
                           "About 1",
                           "Just over 0",
                           "About -54"),
               button_label = "Check answer",
               q_id = "Q12.2",
               type = "radio",
               right = "Awesome! Note that we also use the predict() function to extract the model's predicted value for any combination of predictor values, e.g. `predict(taskmodel1, data.frame(Blocks = 20))`",
               wrong = "Humm, are you sure?")
check_hint("You can answer this question either by plotting the predicted values of your model and reading the value off your graph or by doing some maths using the coefficient estimates provided in the summary of your model.",
           hint_title = "🦉 Hover over the owl for a first hint.",
           type = "onmouseover")
check_hint("If you choose to do solve this question numerically, you will need to multiply the coefficient estimate for the Blocks variable by 20 and add this to the coefficient estimate for the intercept.",
           hint_title = "🐭 Click on the mouse for a second hint.")
```

```{r}
#| code-fold: true
#| code-summary: "Show code to answer Q12.2"
#| eval: false

# a) Taking a numerical approach, you can add up the model coefficients printed in the model summary. As always, start with the intercept coefficient and add to it the coefficient corresponding to an increase in one point on the Blocks test multiplied by the number of points that this participant achieved (20):
summary(taskmodel1)

54.5614 + 1.0527*20

# Alternatively, we can take the coefficient estimate values directly from the model object to get an even more precise prediction like this:
taskmodel1$coefficients[1] + (taskmodel1$coefficients[2] * 20)

# b) Taking a graphical approach, we need to plot the model's predicted Vocab scores against all Blocks test scores in the dataset in order to find out how what the model's prediction is for a Blocks test score of 20. The ggplot code below adds an arrow and a dotted line to help you read the predicted Vocab score from the plot:
ggplot(L1.data,
       aes(x = Blocks,
           y = predict(taskmodel1))) +
  geom_point() +
  annotate("segment",
           x = 20,
           y = 50,
           yend = 75,
           colour = "blue",
           arrow = arrow(length = unit(0.25, "cm"))) +
  annotate("segment",
           x = 0,
           xend = 19.5,
           y = 75.6,
           colour = "blue",
           linetype = "dotted") +
  labs(y='Predicted Vocab scores',
       x='Blocks test scores') +
  theme_minimal()
```

[**Q12.3**]{style="color:green;"} Now fit a new linear model to predict `Vocab` scores among L2 participants based on their `Blocks` test scores. Comparing the adjusted *R*^2^ value of this new L2 model to your previous L1 model, which of these statements is/are true?

```{r}
#| echo: false
#| label: "Q12.3"

check_question("Blocks test scores are a much more useful predictor of Vocab scores for L1 than L2 participants.",
               options = c("Blocks test scores are a much more useful predictor of Vocab scores for L1 than L2 participants.",
                           "Blocks test scores are a much useful predictor of Vocab scores for L2 than L1 participants.",
                           "Higher Blocks test scores lead to statistically significantly higher Vocab scores among L1 participants.",
                           "Blocks test scores allow us to almost perfectly predict Vocab scores among L1 participants.",
                           "Blocks test scores allow us to almost perfectly predict Vocab scores among L2 participants."),
               button_label = "Check answer",
               q_id = "Q12.3",
               type = "check",
               right = "That's right, well done! 🎉",
               wrong = "No, compare the adjusted *R*-squared values of the two models.")
check_hint("Remember that we cannot interpret linear regression models causally.",
           hint_title = "🦉 Hover over the owl for a first hint.",
           type = "onmouseover")
check_hint("Only one of these statements is true.",
           hint_title = "🐭 Click on the mouse for a second hint.")

```

```{r}
#| code-fold: true
#| code-summary: "Show sample code to answer Q12.3."
#| eval: false

taskmodel2 <- lm(formula = Vocab ~ Blocks,
                 data = L2.data)

summary(taskmodel2) # Adjusted *R*-squared:  0.007177
summary(taskmodel1) # Adjusted *R*-squared:  0.07232
```

[**Q12.4**]{style="color:green;"} In your L2 model, the *p*-value for the `Blocks` predictor should be `0.228620`. True or false: This *p*-value means that there is a 23% chance of obtaining a Blocks coefficient estimate of 0.72 or higher in a random sample of 67 L2 speakers of English when there is actually no association between the results of the `Blocks` test and those of the `Vocab` test?

```{r}
#| echo: false
#| label: "Q12.4"

check_question("True.",
               options = c("True.",
                           "False."),
               type = "radio",
               q_id = "Q12.4",
               button_label = "Check answer",
               right = "Indeed! ✅",
               wrong = "No, remember that the *p*-value is statement about the likelihood of the data given the null hypothesis of no association.")

```
:::
::::

## *t*-tests as regression over a binary variable {#sec-Ttestsregression}

Let us now see how much of the variation across all receptive English vocabulary test scores in the @DabrowskaExperienceAptitudeIndividual2019 data we can accurately predict if we only include a single **categorical binary variable** predictor in our model: whether the `Vocab` scores belong to L1 or L2 speakers of English. Hence, in the following model, we keep the same outcome variable (`Vocab`), but now we try to predict it using the `Group` variable as our single predictor.

We know that, on average, L1 speakers score better on the `Vocab` test than L2 speakers. However, as illustrated in @fig-IQL1L2, we also know that there is quite a bit of variability around the mean scores of the L1 and L2 speakers.

```{r}
#| code-fold: true
#| code-summary: "See code to generate plot."
#| echo: !expr 'knitr::is_html_output()'
#| label: "fig-IQL1L2"
#| fig-cap: "Comparison of non-verbal IQ (Blocks) test scores between L1 and L2 groups (mean values highlighted in purple)"
#| fig-alt: "Boxplot comparing non-verbal IQ scores between two groups, L1 and L2. A blue dashed trend line connects the medians, showing that the median L1 score is lower than the median L2 score. Both groups display outliers at the lower end of their distributions."
#| out-width: 70%
#| fig-width: 7

group_means <- Dabrowska.data |>
  group_by(Group) |>
  summarise(mean_vocab = mean(Vocab))

ggplot(data = Dabrowska.data,
       aes(x = Group,
           y = Vocab)) +
  geom_boxplot(width = 0.5) +
  stat_summary(
    fun = mean,                     # what to plot
    geom = "point",                 # as a point
    shape = 18,                     # in a diamond shape
    size = 4,                       # a little larger than the default
    colour = "purple") +
  geom_line(
    data = group_means,
    aes(x = as.numeric(Group),
        y = mean_vocab),
    colour = "purple",
    linewidth = 1,
    linetype = "dashed") +
  geom_text(data = group_means,
    aes(x = as.numeric(Group),
        y = mean_vocab,
        label = sprintf("%.2f", mean_vocab)), # print the means to two decimal points
    vjust = 1.5,
    colour = "purple") +
  labs(x = NULL,
    y = "Non-verbal IQ (Blocks) test") +
  theme_bw(base_size = 14)

```

A *t*-test shows that the mean difference between L1 and L2 speakers is statistically significant (*p* \> 0.001).

```{r}
t.test(Vocab ~ Group,
       data = Dabrowska.data)

library(effectsize)

cohens_d(Vocab ~ Group,
         data = Dabrowska.data)
```

Moreover, Cohen's *d* is large (`0.77`) and the 95% confidence interval does not go anywhere near zero (`[0.44, 1.09]`). Hence, we can be confident that the `Group` variable will be a useful predictor to predict `Vocab` in a simple linear regression model. To fit this model, we use the same formula syntax as earlier:

```{r}
model2 <- lm(formula = Vocab ~ Group,
             data = Dabrowska.data)

summary(model2)
```

Let's break down the model summary:

-   The **intercept coefficient estimate** is the predicted `Vocab` score when `Group` is at its reference level. By default, in `R`, the reference level of a categorical variable is the level that comes first alphabetically. Here, it's therefore 'L1' for the English native speaker group. The model's estimated `Vocab` test score for an L1 participant is therefore `69.136` points. If you check @fig-IQL1L2, you will see that this value corresponds to the mean L1 `Vocab` score in our data.

-   The **`GroupL2` coefficient estimate** is the estimated change in `Vocab` score for an L2 speaker as compared to the reference level of an L1 speaker. The estimate is `-16.333`, which means that L2 speakers are predicted to score 16.333 points *lower* than L1 speakers on this test. If you subtract this `GroupL2` coefficient estimate from the `Intercept`, you will find that it corresponds to the mean L2 `Vocab` score in our data:

    ```{r}
    69.136 - 16.333
    ```

-   The ***p*****-value** associated with the coefficient `GroupL2` is extremely small (`4.64e-06`). It informs us that, in this model, `Group` is a statistically significant predictor of receptive English vocabulary knowledge.

-   The **adjusted *R*-squared** value indicates the proportion of variation in `Vocab` scores that the model can account for when we know a participant's native-speaker status: 12.13% (`0.1213`). This value might not seem particularly impressive. However, this is not surprising given that our model attempts to predict vocabulary scores based solely on whether someone is an L1 or L2 speaker of English. With only this information, the model can only predict that all L1 speakers will achieve the mean L1 score and all L2 speakers the mean L2 score of the sample data.

::: {.callout-note collapse="false"}
# It's (still) all linear regression! `r emoji("exploding_head")`

Here, too, it is worth noticing that fitting a simple linear regression model with a single binary predictor is under the hood exactly the same thing as conducting an **independent two-sample *t*-test**. Compare the *t*-statistic of the *t*-test that we computed earlier with that of the *t*-statistic of the `GroupL2` coefficient reported in the summary of our second model. If we ignore the signs of the *t*-values, we can see that they are almost identical!

Also, notice how the *p*-value computed by the `t.test()` function and that reported for our `GroupL2` coefficient are almost identical. They both correspond to values that are extremely close to zero.

The very small differences between these values are due to the default correction that the `t.test()` function in `R` makes for unequal group variances (see @sec-Variability). If we switch off this correction, both the *t*-statistic and the *p*-value are exactly the same as those reported in the summary of our second linear model above:

```{r}
t.test(Vocab ~ Group,
       data = Dabrowska.data,
       var.equal = TRUE)
```
:::

:::: {.content-visible when-profile="OER"}
::: {.callout-tip collapse="false"}
#### Your turn! {.unnumbered}

[**Q12.5**]{style="color:green;"} Draw a boxplot showing the distribution of `Vocab` scores among male and female participants in `Dabrowska.data`. Based on your boxplot, do you expect `Gender` to be a statistically significant predictor of `Vocab` scores?

```{r}
#| echo: false
#| label: "Q12.5"

check_question("No, because there is a lot of overlap between the two distributions and the female and male median Vocab scores are very close to each other.",
               options = c("No, because there is a lot of overlap between the two distributions and the female and male median Vocab scores are very close to each other.",
                           "No, because the female and male mean Vocab scores are very close to each other.",
                           "Yes, because the mean Vocab score of male participants is higher than that of female participants.",
                           "Yes, because there are a lot of very low Vocab scores among female participants.",
                           "Yes, because there are significantly more female participants in this dataset."),
               type = "radio",
               q_id = "Q12.5",
               button_label = "Check answer",
               right = "You're right.",
               wrong = "No, that's not it. Remember that the middle line in a boxplot represents the median.")

```

```{r}
#| code-fold: true
#| code-summary: "Show sample code to answer Q12.5."
#| eval: false

Dabrowska.data |>
  ggplot(mapping = aes(x = Gender,
                       y = Vocab)) +
  geom_boxplot() +
  labs(x = "Gender",
       y = "Vocab scores") +
  theme_minimal()
```

[**Q12.6**]{style="color:green;"} Fit a model with `Vocab` as the outcome variable and `Gender` as the predictor to test your intuition based on your boxplot. Is `Gender` a statistically significant predictor of `Vocab` score in this simple linear regression model?

```{r}
#| echo: false
#| label: "Q12.6"

check_question("No, the p-value associated with GenderM coefficient is higher than 0.05.",
               options = c("No, the p-value associated with GenderM coefficient is higher than 0.05.",
                           "Yes, the p-value associated with GenderM coefficient is higher than 0.05.",
                           "No, the p-value associated with GenderM coefficient is lower than 0.05.",
                           "Yes, the p-value associated with GenderM coefficient is lower than 0.05."),
               type = "radio",
               q_id = "Q12.6",
               button_label = "Check answer",
               right = "That's right! ✅",
               wrong = "Are you sure? You should have obtained a p-value of 0.957.")
check_hint("You will need to fit a linear model with `Vocab` as the outcome variable and `Gender` as the predictor.",
           hint_title = "🐭 Click on the mouse for a hint.")

```

```{r}
#| code-fold: true
#| code-summary: "Show code to answer Q12.6"
#| eval: false

taskmodel3 <- lm(formula = Vocab ~ Gender,
                 data = Dabrowska.data)

summary(taskmodel3) # p-value associated with coefficient estimate for GenderM = 0.957
```

[**Q12.7**]{style="color:green;"} The adjusted *R*^2^ coefficient of the model is `-0.006433`. True or false: This means that male participants, on average, score 0.006433 fewer points than female participants on the `Vocab` test?

```{r}
#| echo: false
#| label: "Q12.7"

check_question("False.",
               options = c("True.",
                           "False."),
               type = "radio",
               q_id = "Q12.7",
               button_label = "Check answer",
               right = "You are correct. This statement is completely wrong: the model's *R*-squared coefficients tell us how much of the variance in Vocab scores our model accounts for. The values of the multiple *R*-squared are always between 0 and 1: a value of 0 means that the model cannot predict anything using these predictors. A value of 1 means that it can make perfect predictions. Here, the multiple *R*-squared coefficient is almost 0 (0.0000187). In this model, the multiple *R*-squared coefficient is so low that, once adjusted, it falls slightly below zero. The adjustment made is based on the number of predictors (or, in the case of categorical predictors, the number of levels in each predictor) that are entered into the model.",
               wrong = "No, that's incorrect.")

```
:::
::::

## Regressing over a categorical predictor with more than two levels {#sec-Categoricalpredictor}

The beauty of the statistical modelling approach -- as opposed to the testing approach -- is that we can include all kinds of predictors in our models using a single function, `lm()`, and `R`'s formula syntax. So far, we have seen that predictors can be numeric variables (e.g. `EduTotal`) and categorical binary variables (e.g. `Group`). In this section, we see how a categorical variable with more than two levels can be entered in a linear regression model.

To demonstrate this, we will now attempt to predict participants' `Vocab` scores based on their occupational group (`OccupGroup`) which, in this dataset, can be one of four broad categories (see @sec-geoms):

> **C**: Clerical positions\
> **I**: Occupationally inactive (i.e. unemployed, retired, or homemakers)\
> **M**: Manual jobs\
> **PS**: Professional-level jobs or studying for a degree

According to the descriptive statistics visualised in @fig-VocabByOccupation, it looks like participants' professional occupation group is unlikely to be terribly useful to predict their vocabulary knowledge. Nonetheless, we can see that -- perhaps somewhat counter-intuitively -- so-called "occupationally inactive" participants ("I") tend to score higher than the other participants.

```{r}
#| code-fold: true
#| code-summary: "See code to generate plot."
#| echo: !expr 'knitr::is_html_output()'
#| label: "fig-VocabByOccupation"
#| fig-cap: "Boxplots comparing the distribution of vocabulary test scores across the four occupational groups with mean values highlighted as diamonds"
#| fig-alt: "Boxplot showing predicted vocabulary scores across four occupational groups (C, I, M, and PS). Individual data points are overlaid as dots on each boxplot. All groups show substantial variability and some extreme outliers at the lower end of the distribution."
#| out-width: 80%

OccupGroup_means <- Dabrowska.data |>
  group_by(OccupGroup) |>
  summarise(mean_vocab = mean(Vocab))

Dabrowska.data |>
  ggplot(mapping =
           aes(y = Vocab,
               x = OccupGroup,
               colour = OccupGroup)) +
  geom_boxplot() +
  stat_summary(
    fun = mean,                     # plotting the means
    geom = "point",                 # as a point
    shape = 18,                     # in a diamond shape
    size = 3) +
  geom_text(data = OccupGroup_means,
    aes(x = as.numeric(OccupGroup),
        y = mean_vocab,
        label = sprintf("%.2f", mean_vocab)), # print the means to two decimal points
    vjust = 1.5) +
  scale_color_viridis_d(option = "H", guide = "none") +
  labs(x = "Participants' occupational group", y = "English vocabulary test scores") +
  theme_bw()
```

We now fit a simple linear model to predict `Vocab` scores using participants' `OccupGroup` as our predictor variable and examine the model summary:

```{r}
model3 <- lm(formula = Vocab ~ OccupGroup,
             data = Dabrowska.data)

summary(model3)
```

-   The estimate of the **intercept** corresponds to the reference level of the `OccupGroup` variable. Remember that the **reference level** is always the first level. We can check the order of the levels in any factor variable using the `levels()` function:

    ```{r}
    levels(Dabrowska.data$OccupGroup)
    ```

    By default, the levels are ordered alphabetically. In the `OccupGroup` variable, the first level is "C", which corresponds to a clerical position. This means that our third model predicts that English speakers in a clerical position will score `63.333` points on the `Vocab` test. Those that belong to the "inactive" group ("I") will score `12.393` more points than those, i.e.:

    ```{r}
    63.333 + 12.393
    ```

    Participants who have manual jobs ("M") are predicted to score `-9.133` points fewer than those in clerical positions, and so-called "professionals" ("PS") will do slightly worse than than those in clerical professors ("C" `-2.261` points). Compare these predicted values to those observed in the data shown in @fig-VocabByOccupation.

-   The **adjusted *R*^2^ value** for our model is relatively low (`0.07509`). This means that our model accounts for about 7.5% of the total variance in `Vocab` scores across the dataset. Given that the *R*^2^ of `model1` was `0.1213`, this indicates that `OccupGroup` is a considerably less useful predictor variable to help us predict participants' `Vocab` scores than whether or not they are an L1 speaker (`Group`).

-   What's more, the model summary reveals that only one of the three coefficient estimates is statistically significantly different from 0 at the $\alpha$-level of 0.05: `OccupGroupI` with a ***p*****-value** of `0.0335`. This means that we can reject the null hypothesis that there is no difference in `Vocab` scores between speakers of the occupational group "C" (the reference level) and those of the occupational group "I". For the other two occupational groups, we do not have enough evidence to reject the null hypothesis of no difference compared to the reference level "C".

We can display the predicted `Vocab` scores for each occupational group, together with a 95% confidence interval around these predicted values using the `emmeans()` function from the [{emmeans}](https://rvlenth.github.io/emmeans/index.html) package [@lenthEmmeansEstimatedMarginal2025a]:

```{r}
#install.packages("emmeans")
library(emmeans)

emmeans(model3, ~ OccupGroup)
```

```{r}
#| eval: false
#| include: false
library(marginaleffects)

estimate_means(model3, "OccupGroup")
```

These estimated means are informative, but model estimates are best visualised — ideally together with a visual depiction of the model accuracy. The [{visreg}](https://pbreheny.github.io/visreg/index.html) package [@brehenyVisualizationRegressionModels2017] provides an efficient way of visualising model predictions and residuals (see @sec-PredictedResiduals). In @fig-PredictedVocabByOccupation we visualise the model's predicted values (as blue lines) and the confidence intervals output by the `emmeans()` function (as grey bands) together with the model's residuals (as grey dots).

```{r}
#| label: "fig-PredictedVocabByOccupation"
#| fig-cap: "Distribution of predicted vocabulary test scores across four occupational groups"
#| fig-alt: "Boxplot showing predicted vocabulary scores across four occupational groups (C, I, M, and PS). Individual residual points are shown as gray dots overlaid on gray boxplots with blue median lines."
#| out-width: 80%

#install.packages("visreg")
library(visreg)

visreg(model3,
       gg = TRUE) + # <1>
  labs(x = "Occupational groups",
       y = "Predicted Vocab scores") +
  theme_bw()
```

1.  With the argument `gg = TRUE` the `visreg()` function outputs a ggplot object that we can then manipulate and customise just like any other ggplot object, e.g. with `theme_bw()` (see @sec-DataViz).

```{r}
#| eval: false
#| include: false

# Alternative plot using the marginaleffects package instead of visreg.
library(marginaleffects)
library(modelbased)

estimate_means(model3, "OccupGroup") |>
  plot(show_residuals = TRUE) +
  labs(x = "Occupational groups",
       y = "Predicted Vocab scores") +
  theme_bw()
```

::: {.callout-note collapse="false"}
#### What about ANOVAs? `r emoji("thinking")` {.unnumbered}

The type of simple linear regression model that we have just covered, involving a continuous numeric outcome variable and a single categorical predictor with more than two levels, also comes under the guise of a statistical significance test, namely the **one-way ANOVA** (ANalysis Of VAriance).

If you compare the summary of the following one-way ANOVA with that of the third model (`model3`), you will spot some similarities:

```{r}
anova1 <- aov(formula = Vocab ~ OccupGroup,
    data = Dabrowska.data)

summary(anova1)
```

However, it is necessary to carry out so-called post-hoc tests to get as much information out of an ANOVA as we can get from the summary of a linear regression model, which is one of the (many) reasons why a modelling approach is better than a testing one.
:::

:::: {.content-visible when-profile="OER"}
::: {.callout-tip collapse="false"}
#### Your turn! {.unnumbered}

In this task, you will explore whether L2 participants' native language is a useful predictor when trying to model their English `Vocab` scores.

[**Q12.8**]{style="color:green;"} Fit a linear model to model L2 participants' `Vocab` score based on their native language (`NativeLg`). According to the model's adjusted *R*^2^ coefficient, how much of the variance in `Vocab` scores among L2 participants does this model account for?

```{r}
#| echo: false
#| label: "Q12.8"

check_question("About 13%",
               options = c("Practically 0%",
                           "About 5%",
                           "About 13%",
                           "About 26%",
                           "About 49%"),
               type = "radio",
               q_id = "Q12.8",
               button_label = "Check answer",
               right = "That's right! ✅",
               wrong = "No, this is not the correct value.")

```

```{r}
#| code-fold: true
#| code-summary: "Show sample code to answer Q12.8."
#| eval: false

taskmodel3 <- lm(formula = Vocab ~ NativeLg,
                data = L2.data)

summary(taskmodel3) # Adjusted *R*-squared:  0.1331
```

[**Q12.9**]{style="color:green;"} In this model, there is a greater difference between the multiple *R*^2^ and the adjusted *R*^2^ coefficients than in all previous models in this chapter. Why might that be?

```{r}
#| echo: false
#| label: "Q12.9"

check_question("Because the NativeLg variable includes 10 different levels.",
               options = c("Because knowing a participant's native language is a particularly useful predictor of Vocab scores.",
                           "Because the NativeLg variable includes 10 different levels.",
                           "Because a participant's native language is inherently connected to their English vocabulary knowledge.",
                           "Because the model's p-value is only just below 0.05."),
               type = "check",
               q_id = "Q12.9",
               button_label = "Check answer",
               right = "That's right, given that the dataset only includes 67 L2 participants, 10 different native languages is a lot! There is a strong risk that the model may overfit the data. The adjusted *R*-squared coefficient accounts for this by drastically reducing its estimation of the amount of variance in Vocab scores that can genuinely be accounted for by a model that only includes participants' native language.",
               wrong = "No, that's not the reason.")
check_hint("Only one of these reasons is correct.", hint_title = "🐭 Click on the mouse for a hint.")

```

[**Q12.10**]{style="color:green;"} One way to reduce the number of levels in the `NativeLg` variable is to model `Vocab` scores based on L2 participants' native language family, instead. Fit a model that attempts to predict `Vocab` scores among L2 participants based on the `NativeLgFamily` variable (the creation of this variable was a [Your turn!]{style="color:green;"} task in @sec-casewhen). Based on your comparison of the two models, which of the following statements is/are true?

```{r}
#| echo: false
#| label: "Q12.10"

check_question(c("When adjusted for the number of predictors, the language family model is worse at predicting Vocab scores.",
                 "The more predictors there are, the easier it is to predict the Vocab scores of the sample. This is why the language family model has a lower multiple *R*-squared coefficient"),
               options = c("When adjusted for the number of predictors, the language family model is better at predicting Vocab scores.",
                           "When adjusted for the number of predictors, the language family model is worse at predicting Vocab scores.",
                           "The more predictors there are, the easier it is to predict the Vocab scores of the sample. This is why the language family model has a lower multiple *R*-squared coefficient.",
                           "The language family model is statistically less significant than the other model because it features fewer predictors."),
               type = "check",
               q_id = "Q12.10",
               button_label = "Check answer",
               right = "Yes, well done! 🎉",
               wrong = "No, not quite.")

```

```{r}
#| code-fold: true
#| code-summary: "Show code to answer Q.12.10"
#| eval: false

taskmodel4 <- lm(formula = Vocab ~ NativeLgFamily,
                data = L2.data)

summary(taskmodel4)
```

[**Q12.11**]{style="color:green;"} According to your `NativeLgFamily` model, what is the predicted `Vocab` score of an L2 participant with a Baltic L1?