-
Notifications
You must be signed in to change notification settings - Fork 4
Expand file tree
/
Copy path7_VariablesFunctions.qmd
More file actions
896 lines (662 loc) · 44.2 KB
/
7_VariablesFunctions.qmd
File metadata and controls
896 lines (662 loc) · 44.2 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
---
engine: knitr
bibliography: references.bib
code-annotations: hover
format:
html:
df-print: default
---
{{< include emoji_script.md >}}
# Va`R`iables and functions {#sec-VaRiablesAndFunctions}
```{r include=FALSE}
library(here)
library(tidyverse)
L1.data <- read.csv(file = here("data", "L1_data.csv"))
L2.data <- read.csv(file = here("data", "L2_data.csv"))
```
#### **Chapter overview** {.unnumbered}
In this chapter, you will learn how to:
- Use base `R` functions to inspect a dataset
- Inspect and access individual variables from a dataset
- Access individual data points from a dataset
- Use simple base `R` functions to describe variables
- Look up and change the default arguments of functions
- Combine functions using two methods
::: {.callout-warning collapse="false"}
### Prerequisites
In this chapter and the following chapters, all analyses are based on data from @DabrowskaExperienceAptitudeIndividual2019. You will only be able to reproduce the analyses and answer the quiz questions if you have created an RProject and saved the data within the project directory. Detailed instructions to do so can be found from @sec-RProject to @sec-ImportingDataCSV.
Alternatively, you can download `Dabrowska2019.zip` from [the textbook's GitHub repository](https://github.com/elenlefoll/RstatsTextbook/raw/69d1e31be7394f2b612825f031ebffeb75886390/Dabrowska2019.zip){.uri}. To launch the project correctly, first unzip the file and then double-click on the `Dabrowska2019.Rproj` file.
Before we get started, import both L1 and the L2 datasets to your local `R` environment:
```{r}
#| eval: false
library(here)
L1.data <- read.csv(file = here("data", "L1_data.csv"))
L2.data <- read.csv(file = here("data", "L2_data.csv"))
```
Check the Environment pane in RStudio to ensure that everything has gone to plan (see @fig-DataLoaded).
{#fig-DataLoaded fig-alt="The environment pane contains two rows under the label Data: the L1.data object with 90 obs. of 31 variables and the L2.data object with 67 obs. of 45 variables" width="80%"}
:::
## Inspecting a dataset in `R` {#sec-InspectingData}
In @sec-ImportingErrors, we saw that we can use the `View()` function to display tabular data in a format that resembles that of a spreadsheet programme (see @fig-ViewData1).
The two datasets from @DabrowskaExperienceAptitudeIndividual2019 are both long and wide so you will need to scroll in both directions to view all the data. *RStudio* also provides a filter option and a search tool (see @fig-ViewData1). Note that both of these tools can only be used to visually inspect the data. You cannot alter the dataset in any way using these tools. And that's a good thing because any changes that we make should be documented in code (as we will learn in @sec-DataWrangling).
```{r eval=FALSE}
View(L1.data)
```
{#fig-ViewData1 fig-alt="RStudio tab showing the first 22 rows and 7 columns of the L1.dataset. The filter and search buttons at the top of the tab are circled in red." width="80%"}
:::: {.content-visible when-profile="OER"}
::: {.callout-tip collapse="false"}
#### Your turn! {.unnumbered}
[**Q7.1**]{style="color:green;"} The `View()` function is more user-friendly than attempting to examine the full table in the Console. Try to display the full L2.dataset in the Console by using the command `L2.data` which is shorthand for `print(L2.data)`. What happens?
```{r}
#| echo: false
#| label: "Q7.1"
library(checkdown)
check_question("R only displays the first 22 rows and the columns are not aligned because the Console window is not wide enough.",
options = c("R only displays the first 22 rows and the columns are not aligned because the Console window is not wide enough.",
"R produces an error message because there are too many rows and the Console window is not long enough.",
"The R Console prints the data in a randomly jumbled way.",
"The R Console prints out all the data, but not the column headers."),
type = "radio",
button_label = "Check answer",
q_id = "Q7.1",
random_answer_order = TRUE,
right = "That's right. The printed message \"[ reached 'max' / getOption(\"max.print\") -- omitted 45 rows ]\" is not an error, but merely a message informing us that not all the rows could be printed.",
wrong = "No, that's not what is happening. Try scrolling up within the Console window to better understand what is going on. Can you find the column headers? Check the hint below if you're stuck.")
check_hint("What happens when you increase or decrease the width of the Console pane in *RStudio* and then try to print the table again?",
hint_title = "🐭 Click on the mouse for a hint.")
```
:::
::::
In practice, it is often useful to printing subsets of a dataset in the Console to quickly check the sanity of the data. To do so, we can use the function `head()` that prints the first six rows of a tabular dataset.
```{r}
#| eval: !expr 'knitr::is_html_output()'
head(L1.data)
```
:::: {.content-visible when-profile="OER"}
::: {.callout-tip collapse="false"}
#### Your turn! {.unnumbered}
[**Q7.2**]{style="color:green;"} Six is the default number of rows printed by the `head()` function. Have a look at the function's help file using the command `?head` to find out how to change this default setting. How would you get `R` to print the first 10 lines of `L2.data`?
```{r}
#| echo: false
#| label: "Q7.2"
check_question(c("head(L2.data, n = 10)", "head(L2.data, n = 10L)", "head(L2.data, 10)"),
options = c("head(L2.data, n = 10)",
"head(L2.data n = 10L)",
"head(L2.data, n = 10L)",
"head(L2.data, L = 10)",
"head(L2.data, rows = 10)",
"head(L2.data, 10)"),
type = "radio",
button_label = "Check answer",
q_id = "Q7.2",
random_answer_order = TRUE,
right = "That's right! In fact, three of the above options will work. Try them out to work out which three they are!",
wrong = "Humm, have you tried whether this really does display the first 10 rows of the dataset? Is the name of the argument that controls the number of rows to be printed correct? Are the arguments within the command separated by a comma?")
check_hint("You can find the correct command argument in the help file of the `head()` function. Remember that you can access a function's help file using the commands `?` or `help`.", hint_title = "🐭 Click on the mouse for a hint.")
```
:::
::::
## Working with variables {#sec-Variables}
### Types of variables
In statistics, we differentiate between **numeric** (or **quantitative**) (see @fig-HorstNumeric) and **categorical** (or **qualitative**) (see @fig-HorstCategorical) variables. Each variable type can be subdivided into different subtypes. It is very important to understand the differences between these types of data as we frequently have to use different statistics and visualisations depending on the type(s) of variable(s) that we are dealing with.
Some numeric variables are **continuous**: they contain measured data that, at least theoretically, can have an infinite number of values within a range (e.g. time). In practice, however the number of possible values depends on the precision of the measurement (e.g. are we measuring time in years, as in the age of adults, or milliseconds, as in participants' reaction times in a linguistic experiment). Numeric variables for which only a defined set of values are possible are called **discrete** variables (e.g. number of occurrences of a word in a corpus). Most often, discrete numeric variables represent counts of something.
)](images/AHorst_NumericVariables.png){#fig-HorstNumeric fig-alt="Cartoon comparison of continuous versus discrete data. On the left: \"Continuous - measured data, can have infinite values within possible range.\" Below is an illustration of a chick, with text \"I am 3.1\" tall, I weight 34.16 grams.\" On the right: \"Discrete - observations can only exist at limited values, often counts.\" Below is an illustration of an octopus with text \"I have 8 legs and 4 spots!\"" width="50%"}
Categorical variables can be **nominal** or **ordinal**. Nominal variables contain unordered categorical values (e.g. participants' mother tongue or nationality), whereas ordinal variables have categorical values that can be ordered meaningfully (e.g. participants' proficiency in a specific language where the values *beginner, intermediate* and *advanced* or *A1*, *A2*, *B1*, *B2*, *C1* and *C2* have a meaningful order). However, the difference between each category (or level) is not necessarily equal. **Binary** variables are a special case of nominal variable which only has two mutually exclusive outcomes (e.g. *true* or *false* in a quiz question).
)](images/AHorst_CategoricalVariables.png){#fig-HorstCategorical fig-alt="Visual representations of nominal, ordinal, and binary variables. Left: Nominal (ordered descriptions) with illustrations below of a turtle, snail, and butterfly. Center: Ordinal (ordered descriptions) with illustrations below of three bees - one looks unhappy (saying \"I am unhappy\"), one looks ok (saying \"I am OK\"), and one looks very happy (saying \"I am awesome!\"). Right: Binary (only 2 mutually exclusive outcomes), with below a T-rex saying \"I am extinct\" and a shark saying \"HA.\"" width="80%"}
:::: {.content-visible when-profile="OER"}
::: {.callout-tip collapse="false"}
#### Your turn! {.unnumbered}
[**Q7.3**]{style="color:green;"} Which type of variable is stored in the `Occupation` column in `L1.data`?
```{r}
#| echo: false
#| label: "Q7.3"
check_question("Nominal",
options = c("Nominal",
"Binary",
"Continuous",
"Discrete",
"Other"),
button_label = "Check answer",
q_id = "Q7.3",
random_answer_order = TRUE,
type = "radio",
right = "Correct. These job categories cannot be meaningfully ordered.",
wrong = "No, check the definitions above.")
```
[**Q7.4**]{style="color:green;"} Which type of variable is stored in the `Gender` column in `L1.data`?
```{r}
#| echo: false
#| label: "Q7.4"
check_question("Binary",
options = c("Nominal",
"Binary",
"Continuous",
"Discrete",
"Other"),
button_label = "Check answer",
q_id = "Q7.4",
random_answer_order = TRUE,
type = "radio",
right = "That's right. Even though we know that human gender is not actually binary, in this dataset, the variable Gender is binary because all of the participants either identified as female (F) or male (M).",
wrong = "No, check the definitions above.")
```
[**Q7.5**]{style="color:green;"} Which type of variable is stored in the column `VocabR` in `L1.data`?
```{r}
#| echo: false
#| label: "Q7.5"
check_question("Discrete",
options = c("Nominal",
"Binary",
"Continuous",
"Discrete",
"Other"),
button_label = "Check answer",
q_id = "Q7.5",
random_answer_order = TRUE,
type = "radio",
right = "That's right. These are participants' score on a vocabularly test. The more items they got right, the more points they got.",
wrong = "No, re-read the definitions above.")
```
:::
::::
### Inspecting variables in `R`
In **tidy data** tabular formats (see Chapter 8), each row corresponds to one observation and each column to a variable. Each cell, therefore, corresponds to a single data point, which is the value of a specific variable (column) for a specific observation (row). As we will see in the following chapters, this data structure allows for efficient and intuitive data manipulation, analysis, and visualisation.
The `names()` functions returns the names of all of the columns of a data frame. Given that the datasets from @DabrowskaExperienceAptitudeIndividual2019 are 'tidy', this means that `names(L1.data)` returns a list of all the column names in the L1 dataset.
```{r}
names(L1.data)
```
### `R` data types {#sec-DataTypes}
A useful way to get a quick and informative overview of a large dataset is to use the function `str()`, which was mentioned in @sec-ImportingErrors. It returns the "internal structure" of any `R` object. It is particular useful for large tables with many columns
```{r}
str(L1.data)
```
At the top of its output, the function `str(L1.data)` first informs us that `L1.data` is a **data frame** object, consisting of 90 observations (i.e. rows) and 31 variables (i.e. columns). Then, it returns a list of all of the variables included in this data frame. Each line starts with a `$` sign and corresponds to one column. First, the name of the column (e.g. `Occupation`) is printed, followed by the column's `R` data type (e.g. `chr` for a character string vector), and then its values for the first few rows of the table (e.g. we can see that the first participant in this dataset was a "Student" and the second a "Student/Support Worker").
:::: {.content-visible when-profile="OER"}
::: {.callout-tip collapse="false"}
#### Your turn! {.unnumbered}
Compare the outputs of the `str()` and `head()` functions in the Console with that of the `View()` function to understand the different ways in which the same dataset can be examined in RStudio.
[**Q7.6**]{style="color:green;"} Use the `str()` function to examine the internal structure of the L2 dataset. How many columns are there in the L2 dataset?
```{r}
#| echo: false
#| label: "Q7.6"
check_question("45",
button_label = "Check answer",
q_id = "Q7.6",
random_answer_order = TRUE,
right = "That's right, well done!",
wrong = "No, check the first line of the output of `str(L2.data)`.")
```
[**Q7.7**]{style="color:green;"} Which of these columns can be found in the L2 dataset, but not the L1 one?
```{r}
#| echo: false
#| label: "Q7.7"
check_question(c("NativeLg", "FirstExp", "Arrival", "EngWork"),
options = c("NativeLg", "FirstExp", "Arrival", "EngWork", "EduYrs", "OtherLgs"),
type = "check",
button_label = "Check answer",
q_id = "Q7.7",
random_answer_order = TRUE,
right = "That's right! And there are more columns that are specific to the L2 dataset but which are not listed here.",
wrong = "Not quite. Compare the outputs of `names(L1.data)` and `names(L2.data)` again.")
check_hint("Try using the `str()` or the `names()` functions to compare the two datasets.", hint_title = "🐭 Click on the mouse for a hint.")
```
[**Q7.8**]{style="color:green;"} Which type of `R` object is the variable `Arrival` stored as?
```{r}
#| echo: false
#| label: "Q7.8"
check_question("integer",
options = c("integer",
"string character",
"digit",
"index",
"interest",
"intelligence"),
type = "check",
button_label = "Check answer",
q_id = "Q7.8",
random_answer_order = TRUE,
right = "That's right! `int` is a subset of the numeric data type. It stands for 'integer' which is a whole number without a decimal place.",
wrong = "No, that's not it.")
check_hint("You can find the answer in the output of the command `str(L2.data)`", hint_title = "🐭 Click on the mouse for a hint.")
```
[**Q7.9**]{style="color:green;"} How old was the third participant listed in the L2 dataset when they first moved to an English-speaking country?
```{r}
#| echo: false
#| label: "Q7.9"
check_question("19",
button_label = "Check answer",
q_id = "Q7.9",
right = "Yes, well done!",
wrong = "No. Have a look at the third number printed after `$ Arrival` in the output of the command `str(L2.data)`")
check_hint("You can find the answer in the output of the command `str(L2.data)`", hint_title = "🐭 Click on the mouse for a hint.")
```
[**Q7.10**]{style="color:green;"} In both datasets, the column `Participant` contains anonymised participant IDs. Why is the variable `Participant` stored as string character vector in `L1.data`, but as an integer vector in `L2.data`?
```{r}
#| echo: false
#| label: "Q7.10"
check_question("Because some of the L1 participants' IDs contain letters as well as numbers.",
options = c("Because some of the L1 participants' IDs contain letters as well as numbers.",
"Because there are more L1 participants than L2 participants.",
"Because the L1 participants' IDs only contain whole numbers with no decimal points.",
"Because the L1 participants' IDs are written out as words rather than digits."),
type = "check",
button_label = "Check answer",
q_id = "Q7.10",
random_answer_order = TRUE,
right = "That's right! 🎉",
wrong = "No, that's not it. Have you had a look at the full set of participant IDs in the L1 dataset?")
check_hint("Print the full list of values in both columns using the commands `L1.data$Participant` and `L2.data$Participant` to work it out.",
hint_title = "🐭 Click on the mouse for a hint.")
```
:::
::::
### Accessing individual columns in `R` {#sec-DollarSign}
We can call up individual columns within a data frame using the `$` operator. This displays all of the participants' values for this one variable. As shown below, this works for any type of data.
```{r}
L1.data$Gender
L1.data$Age
```
Before doing any data analysis, it is crucial to carefully visually examine the data to spot any problems. Ask yourself:
- Do the values look plausible?
- Are there any missing values?
Looking at the `Gender` and `Age` variables, we can see that the L1 participants declared being either 'male' (`"M"`) or 'female' (`"F"`). We note that the youngest were 17 years old, which seems reasonable and we also check that no participant was improbably old. A single improbable value is likely to be the result of a data entry error, e.g. a participant or researcher accidentally entered `188` as an age, instead of `18`. If you spot a whole string of improbable or outright impossible values (e.g. `C`, `I` and `PS` as age values!), something has likely gone wrong during the data import process (see @sec-ImportingErrors).
Just like we can save individual numbers and words as `R` objects to our `R` environment, we can also save individual columns as individual `R` objects. As we saw in @sec-WorkingRObjects, in this case, the values of the variable are not printed in the Console, but rather saved to our `R` environment.
```{r}
L1.Occupation <- L1.data$Occupation
```
If we want to display the content of this variable, we must print our new `R` object by calling it up with its name, e.g. `L1.Occupation`. Try it out! As listing all of the L1 participant's jobs makes for a very long list, below, we only display the first six values using the `head()` function.
```{r}
head(L1.Occupation)
```
## Accessing individual data points in `R` {#sec-SquareBrackets}
We can also access individual data points from a variable using the index operator: the square brackets (`[]`). For example, we can access the `Occupation` value for the fourth L1 participant by specifying that we only want the fourth element of the `R` object `L1.Occupation`.
```{r}
L1.Occupation[4]
```
We can also do this from the `L1.data` data frame object directly. To this end, we use a combination of the `$` and the `[]` operators.
```{r}
L1.data$Occupation[4]
```
We can access a consecutive range of data points using the `:` operator.
```{r}
L1.data$Occupation[10:15]
```
Or, if they are not consecutive, we can list the numbers of the values that we are interesting in using the combine function (`c()`) and commas separating each index value.
```{r}
L1.data$Occupation[c(11,13,29,90)]
```
It is also possible to access data points from a tabular `R` object by specifying both the number of the row and the number of the column of the relevant data point(s) using the following pattern: `[row,column]`.
For example, given that we know that `Occupation` is stored in the fourth column of `L1.data`, we can find out the occupation of the L1 participant in the 60^th^ row of the dataset like this:
```{r}
L1.data[60,4]
```
All of these approaches can be combined. For example, below we access the values of the second, third, and fourth columns for the 11^th^, 13^th^, 29^th^, and 90^th^ L1 participants.
```{r}
L1.data[c(11,13,29,90),2:4]
```
:::: {.content-visible when-profile="OER"}
::: {.callout-tip collapse="false"}
#### Your turn! {.unnumbered}
The following two quiz questions focus on the `NativeLg` variables from the L2 dataset (`L2.data`).
[**Q7.11**]{style="color:green;"} Use the index operators to find out the native language of the 26^th^ L2 participant.
```{r}
#| echo: false
#| label: "Q7.11"
check_question("Polish",
options = c("Cantonese", "Chinese", "German", "Italian", "Greek", "Lithuanian", "Mandarin", "Polish", "Russian", "Spanish"),
button_label = "Check answer",
q_id = "Q7.11",
right = "Brawo! 🇵🇱",
wrong = "No, that's not it. Check the hint if you need some help.")
check_hint("Use the command `L2.data$NativeLg[26]` to find out which it is.",
hint_title = "🐭 Click on the mouse for a hint.")
```
[**Q7.12**]{style="color:green;"} Which command(s) can you use to display only the Gender, Occupation, Native language, and Age of the last participant listed in the L2 dataset?
```{r}
#| echo: false
#| label: "Q7.12"
check_question(c("L2.data[67,c(2,3,5,9)]",
"L2.data[67,c(2:3,5,9)]",
"L2.data[67 , c(2,3,5,9)]"),
options = c("L2.data[67,c(2,3,5,9)]", "L2.data[67 , c(2,3,5,9)]", "L2.data[67,c(2:3,5,9)]", "L2.data[67:c(2,3,5,9)]", "L2.data[-1:c(2,3,5,9)]", "L2.data[90,c(2:3,5,9)]"),
button_label = "Check answer",
q_id = "Q7.12",
type = "check",
random_answer_order = TRUE,
right = "Great job!",
wrong = "Not quite. Try out these options to find out for yourself which ones work and which don't.")
check_hint("Using `-1` as an index to access the last row is a neat trick in Python, but it sadly doesn't work in `R`.", hint_title = "🦉 Hover over the owl for a hint", type = "onmouseover")
check_hint("Three of these will work. You'll need to find all three!",
hint_title = "🐭 Click on the mouse for a second hint.")
```
:::
::::
## Using built-in `R` functions {#sec-RFunctions}
We know from our examination of the L1 dataset from @DabrowskaExperienceAptitudeIndividual2019 that it includes 90 English native speaker participants. To find out their mean average age, we could add up all of their ages and divide the sum by 90 (see @sec-CentralTendency for more ways to report the central tendency of a variable).
```{r}
(21 + 38 + 55 + 26 + 55 + 58 + 31 + 58 + 42 + 59 + 32 + 27 + 60 + 51 + 32 + 29 + 41 + 57 + 60 + 18 + 41 + 60 + 21 + 25 + 26 + 60 + 57 + 60 + 52 + 25 + 23 + 42 + 59 + 30 + 21 + 21 + 60 + 51 + 62 + 65 + 19 + 65 + 29 + 38 + 37 + 42 + 20 + 32 + 29 + 29 + 27 + 28 + 29 + 25 + 33 + 25 + 25 + 25 + 52 + 25 + 53 + 22 + 65 + 60 + 61 + 65 + 65 + 61 + 30 + 30 + 32 + 30 + 39 + 29 + 55 + 18 + 32 + 31 + 20 + 38 + 44 + 18 + 17 + 17 + 17 + 17 + 17 + 17 + 17 + 17) / 90
```
Of course, we would much rather not write all of this out! Especially, as we are very likely to make errors in the process. Instead, we can use the base `R` function `sum()` to add up all of the L1 participant's ages and divide that by 90.
```{r}
sum(L1.data$Age) / 90
```
This already looks much better, but it's still less than ideal: What if we decided to exclude some participants (e.g. because they did not complete all of the experimental tasks)? Or decided to add data from more participants? In both these cases, 90 will no longer be the correct denominator to calculate their average age! That's why it is better to work out the denominator by computing the total number of values in the variable of interest. To this end, we can use the `length()` function, which returns the number of values in any given vector.
```{r}
length(L1.data$Age)
```
We can then combine the `sum()` and the `length()` functions to calculate the participants' average age.
```{r}
sum(L1.data$Age) / length(L1.data$Age)
```
Base `R` includes lots of useful functions to do statistics, which means that it includes a built-in function to calculate mean average values. It is called `mean()` and simplifies the procedure considerably:
```{r}
mean(L1.data$Age)
```
If we save the values of a variable to our `R` session environment, we do not need to use the name of the dataset and the `$` sign to calculate its mean. Instead, we can directly apply the `mean()` function to the stored `R` object:
```{r}
L1.Age <- L1.data$Age # <1>
mean(L1.Age) # <2>
```
1. Saving the values of the Age variable to a new `R` object called `L1.Age`
2. Applying the `mean()` function to this new `R` object
:::: {.content-visible when-profile="OER"}
::: {.callout-tip collapse="false"}
#### Your turn! {.unnumbered}
[**Q7.13**]{style="color:green;"} How does the average age of the L2 participants in @DabrowskaExperienceAptitudeIndividual2019 compare to that of the L1 participants?
```{r}
#| echo: false
#| label: "Q7.13"
check_question("On average, the L2 participants are younger than the L1 participants.",
options = c("On average, the L2 participants are younger than the L1 participants.",
"On average, the L2 participants are older than the L1 participants.",
"On average, the L2 participants are the same age than the L1 participants.",
"Age is not comparable across two different datasets."),
type = "radio",
button_label = "Check answer",
q_id = "Q7.13",
random_answer_order = TRUE,
right = "That's right!",
wrong = "No, this is incorrect. Compare the results of the commands `mean(L2.data$Age)` and `mean(L1.data$Age)`.")
```
:::
::::
:::: {.content-visible when-profile="OER"}
::: {.callout-tip collapse="false"}
#### Your turn! {.unnumbered}
For this [**task**]{style="color:green;"}, you first need to check that you have saved the following two variables from the L1 dataset to your `R` environment.
```{r}
#| eval: false
L1.Age <- L1.data$Age
L1.Occupation <- L1.data$Occupation
```
[**Q7.14**]{style="color:green;"} Below is a list of useful base `R` functions. Try them out with the variable `L1.Age`. What does each function do? Make a note by writing a comment next to each command (see @sec-Comments). The first one has been done for you.
```{r eval = FALSE}
mean(L1.Age) # The mean() function returns the mean average of a set of number.
min()
max()
sort()
length()
mode()
class()
table()
summary()
```
[**Q7.15**]{style="color:green;"} `Age` is a numeric variable. What happens if you try these same functions with a character string variable? Find out by trying them out with the variable `L1.Occupation` which contains words rather than numbers.
:::
::::
:::: {.content-visible when-profile="OER"}
::: {.callout-note collapse="true"}
#### Click here for the solutions to [**Q7.14**]{style="color:green;"}—[**Q7.15**]{style="color:green;"}.
As you will have seen, often the clue is in the name of the function - but not always! 😉
Hover your mouse over the numbers on the right for the solutions to appear.
```{r}
#| eval: false
mean(L1.Age) # <1>
mean(L1.Occupation) # <2>
min(L1.Age) # <3>
min(L1.Occupation) # <4>
max(L1.Age) # <5>
max(L1.Occupation) # <6>
sort(L1.Age) # <7>
sort(L1.Occupation) # <8>
length(L1.Age) # <9>
length(L1.Occupation) # <10>
mode(L1.Age) # <11>
mode(L1.Occupation) # <12>
class(L1.Age) # <13>
class(L1.Occupation) # <14>
table(L1.Age) # <15>
table(L1.Occupation) # <16>
summary(L1.Age) # <17>
summary(L1.Occupation) # <18>
```
1. The `mean()` function returns the mean average of a set of number.
2. It does not make sense to calculate a mean average value of a set of words, therefore `R` returns an `NA` (not applicable) and a warning in red explaining that the `mean()` function expects a numeric or logical argument.
3. For a numeric variable, `min()` returns the lowest numeric value.
4. For a string variable, `min()` returns the first value sorted alphabetically.
5. For a numeric variable, `min()` returns the highest numeric value.
6. For a string variable, `max()` returns the last value sorted alphabetically.
7. For a numeric variable, `sort()` returns all of the values of the variable ordered from the smallest to the largest.
8. For a string variable, `sort()` returns of all of the values of the variable in alphabetical order.
9. The function `length()` returns the number of values in the variable.
10. The function `length()` returns the number of values in the variable.
11. The function `mode()` returns the `R` data type that the variable is stored as.
12. The function `mode()` returns the `R` data type that the variable is stored as.
13. The function `mode()` returns the `R` object class that the variable is stored as.
14. The function `mode()` returns the `R` object class that the variable is stored as.
15. For a numeric variable, the function `table()` outputs a table that tallies the number of occurrences of each unique value in a set of values and sorts them in ascending order.
16. For a string variable, the function `table()` outputs a table that tallies the number of occurrences of each unique value in a set of values and sorts them alphabetically.
17. For a numeric variable, the function `summary()` outputs six values that, together, summarise the set of values contained in this variable: the minimum and maximum values, the first and third quartiles, and the mean and median (see Chapter 8).
18. For a string variable, the `summary()` function only outputs the length of the string vector, its object class and data mode.
:::
::::
### Function arguments
All of the functions that we have looked at this chapter so far work with just a single argument: either a vector of values (e.g. a variable from our dataset as in `mean(L1.data$Age)`) or an entire tabular dataset as in `str(L1.data)`. When we looked at the `head()` function, we saw that, per default, it displays the first six rows, but that we can change this by specifying a second argument in the function. In `R`, arguments within a function are always separated by a comma.
```{r}
head(L1.Age, n = 6)
```
The names of the argument can be specified but don't have to be if they are listed in the order specified in the documentation. You can check the "Usage" section of a function's help file (e.g. with `help(head)` or `?head`) to find out the order of a function's arguments. Run the following commands and compare their output:
```{r eval = FALSE}
head(x = L1.Age, n = 6)
head(L1.Age, 6)
head(n = 6, x = L1.Age)
head(6, L1.Age)
```
Whilst the first three return exactly the same output, the fourth returns an error because the argument names are not specified and are not in the order specified in the function's help file. To avoid making errors and confusing your collaborators and/or future self, it's best to explicitly name all the arguments except the most obvious ones.
:::: {.content-visible when-profile="OER"}
::: {.callout-tip collapse="false"}
#### Your turn! {.unnumbered}
Look at the following two lines of code and their (abbreviated) outputs.
```{r include=FALSE}
library(here)
library(tidyverse)
L1.data <- read.csv(file = here("data", "L1_data.csv"))
L2.data <- read.csv(file = here("data", "L2_data.csv"))
```
```{r}
#| eval: false
L1.data$Vocab
```
```{r}
#| echo: false
head(L1.data$Vocab)
```
```{r}
#| eval: false
round(L1.data$Vocab)
```
```{r}
#| echo: false
head(round(L1.data$Vocab))
```
[**Q7.16**]{style="color:green;"} Based on your observations, what does the `round()` function do?
```{r}
#| echo: false
#| label: "Q7.16"
check_question("The round() function rounds off numbers to the nearest whole number.",
options = c("The round() function displays just the first two digits of any number.",
"The round() function rounds off numbers to the nearest whole number.",
"The round() function is designed to save screen space for smaller displays.",
"The round() function displays fewer values for ease of reading."),
type = "radio",
button_label = "Check answer",
q_id = "Q7.16",
random_answer_order = TRUE,
right = "That's right.",
wrong = "Are you sure? Try out the `round()` function with other numbers to find out more.")
```
[**Q7.17**]{style="color:green;"} Check out the 'Usage' section of the help file on the `round()` function to find out how to round the `Vocab` values in the L1 dataset to two decimal places. How can this be achieved?
```{r}
#| echo: false
#| label: "Q7.17"
check_question(c("round(L1.data$Vocab, 2)",
"round(L1.data$Vocab, digits = 2)"),
options = c("round(L1.data$Vocab, 2)",
"round(L1.data$Vocab, digits = 2)",
"round(L1.data$Vocab: digits = 2)",
"round(L1.data$Vocab, digits = -4)"),
type = "check",
button_label = "Check answer",
q_id = "Q7.17",
random_answer_order = TRUE,
right = "That's right.",
wrong = "Not quite. Try out these commands to find out which ones work as expected.")
check_hint("Two options are correct because, as long as the arguments are listed in the correct order, we do not have to specify the names of the arguments.",
hint_title = "🐭 Click on the mouse for a hint.")
```
:::
::::
## Combining functions in `R`
Combining functions is where the real fun starts with programming! In @sec-RFunctions, we already combined two functions using a mathematical operator (`/`). Now, we are going to compute L1 participant's average age to two decimal places. To do this, we need to combine the `mean()` function and the `round()` function. We can do this in two steps:
```{r}
L1.mean.age <- mean(L1.Age) # <1>
round(L1.mean.age, digits = 2) # <2>
```
1. In step 1, we compute the mean value and save it as an `R` object.
2. In step 2, we pass this object through the `round()` function with the argument `digits = 2`
:::: {.content-visible when-format="html"}
::: callout-note
Hover over the numbers to the right of the code to see the code annotation.
:::
::::
In principle, there is nothing wrong with this method, but it often require lots of intermediary `R` objects, which can get rather tiresome and can lead to human errors as you can end up calling the wrong object. In the following, we will look at two further ways to combine functions in `R`: nesting and piping.
### Nested functions {#sec-Nesting}
The first method involves lots of brackets (also known as 'parentheses'). This is because in nested functions, one function is placed inside another function. The inner function is evaluated first, and its result is passed to the next outer function. In the following example, the `mean()` function is nested inside the `round()` function. The `mean()` function calculates the mean of `L1.Age`, and the result is passed to the `round()` function, which rounds the result to the nearest integer:
```{r}
round(mean(L1.Age))
```
We can also pass additional arguments to any of the functions, but we must make sure to place the arguments within the correct set of brackets. In the following example, the argument `digits = 2` belongs to the outer function `round()`; hence it must be placed within the outer set of brackets:
```{r}
round(mean(L1.Age), digits = 2)
```
In theory, we can nest as many functions as we like, but things can get quite chaotic after more than a couple of functions. We need to make sure that we can trace back which arguments and which brackets belong to which function (see @fig-NestedFunctions).
{#fig-NestedFunctions fig-alt="The pseudo code reads like this a) function(argument 1, argument 2), b) functionB(functionA(argument 1a, argument 2a), argument1b, argument2b), c) functionC(functionB(functionA(argument 1a, argument 2a), argument1b, argument2b), argument1c, argument2c)." width="80%"}
::: callout-tip
### Time to think!
Consider the three lines of code below. Without running them, can you tell which of the three lines of code will output the square root of L1 participant's average age to two decimal places?
```{r eval = FALSE}
round(sqrt(mean(L1.Age) digits = 2)) # <1>
sqrt(round(mean(L1.Age), digits = 2)) # <2>
round(sqrt(mean(L1.Age)), digits = 2) # <3>
```
1. This code will return an "unexpected symbol" error because it is missing a comma before the argument `digits = 2`.
2. This second line of code actually outputs `6.126989`, which has more than two decimal places! This is because `R` interprets the functions from the inside out: first, it calculates the mean value, then it rounds that off to two decimal places, and only then does it compute the square root of that rounded off value.
3. This third attempt, in contrast, does the rounding operation as the last step. Note that, in the two lines of code that do not produce an error, the brackets around the argument `digits = 2` are also located in different places.
It is very easy to make bracketing errors when writing code and especially so when nesting functions (see @fig-NestedFunctions) so watch your commas and brackets (see also @sec-Errors)!
:::
### Piped Functions {#sec-Piping}
If you found all these brackets overwhelming: fear not! There is a second method for combining functions in `R`, which is often more convenient and almost always easier to decipher. It involves the pipe operator, which in `R` is `|>`.[^7_variablesfunctions-1]
[^7_variablesfunctions-1]: This is the **native R pipe** operator, which was introduced in May 2021 with `R` version 4.1.0. As a result, you will not find it in code written in earlier versions of `R`. Previously, piping required an additional `R` library, the {magrittr} library. The {magrittr} pipe looks like this: `%>%`. At first sight, they appear to work is in the same way, but there are some important differences. If you are familiar with the {magrittr} pipe and want to understand how it differs from the native R pipe, I recommend this excellent blog post by Isabella Velásquez: <https://ivelasq.rbind.io/blog/understanding-the-r-pipe/>.
The `|>` operator passes the output of one function on to the first argument of the next function. This allows us to chain multiple functions together in a much more intuitive way, e.g.:
```{r}
L1.Age |>
mean() |>
round()
```
In the example above, the object `L1.Age` is passed on to the first argument of the `mean()` function. This calculates the mean of `L1.Age`. Next, this result is passed to the `round()` function, which rounds the mean value to the nearest integer.
To pass additional arguments to any function in the pipeline, we add them within the brackets that belong to that function:
```{r}
L1.Age |>
mean() |>
round(digits = 2)
```
Like many of the relational operators we learnt about in @sec-RelationalOperators, the `R` pipe is a combination of two symbols, the computer pipe `|` and the right angle bracket `>`. Don't worry if you're not sure where these two symbols are on your keyboard as *RStudio* has a handy shortcut for you: {{< kbd mac=Cmd-Shift-M win=Ctrl-Shift-M linux=Ctrl-Shift-M >}} (see also @fig-Pipes). I strongly recommend that you write this shortcut on a prominent post-it and learn it asap, as you will need it a lot when you are working in `R`![^7_variablesfunctions-2]
[^7_variablesfunctions-2]: If, in your version of RStudio, this shortcut produces `%>%` instead of `|>`, you have probably not activated the native `R` pipe option in your *RStudio* global options (see instructions in @sec-GlobalOptions).
{#fig-Pipes fig-alt="Collation of two images: One is a famous painting by René Magritte of a pipe with the caption \"Ceci n'est pas une pipe\" [This is not a pipe in French], and another, in the same style and colours, with the native R pipe operator and its keyboard shortcut with the caption \"Ceci est une pipe\" [This is a pipe in French]." width=80%}
:::: {.content-visible when-profile="OER"}
::: {.callout-tip collapse="false"}
#### Your turn! {.unnumbered}
[**Q7.18**]{style="color:green;"} Using the `R` pipe operator, calculate the average mean age of the L2 participants and round off this value to two decimal places. What is the result?
```{r}
#| echo: false
#| label: "Q7.18"
check_question("32.72",
button_label = "Check answer",
random_answer_order = TRUE,
right = "Yes, well done!",
wrong = "No, try again.")
```
[**Q7.19**]{style="color:green;"} Unsurprisingly, in @DabrowskaExperienceAptitudeIndividual2019's study, English L1 participants, on average, scored higher in the English vocabulary test than L2 participants. Calculate the difference between L1 and L2 participants' mean `Vocab` test results and round off this means difference to two decimal places.
```{r}
#| echo: false
#| label: "Q7.19"
check_question("16.33",
button_label = "Check answer",
q_id = "Q7.19",
random_answer_order = TRUE,
right = "That's right. Very well done!",
wrong = "That's not it. Check the code chunk below if you are unsure how to proceed.")
check_hint("There are several ways to solve this puzzle. It does not have to involve the pipe operator. Don't hesitate to work in several steps by saving an intermediary result as an `R` object.", hint_title = "🐭 Click on the mouse for a hint.")
```
:::
::::
:::: {.content-visible when-profile="OER"}
::: {.callout-note collapse="true"}
#### Click here for a detailed answer to [**Q7.19**]{style="color:green;"}
They are lots of ways to tackle this in `R`. Here is a first approach that involves the pipe operator:
```{r include=FALSE}
library(here)
library(tidyverse)
L1.data <- read.csv(file = here("data", "L1_data.csv"))
L2.data <- read.csv(file = here("data", "L2_data.csv"))
```
```{r}
(mean(L1.data$Vocab) - mean(L2.data$Vocab)) |>
round(digits = 2)
```
Note that this approach requires a set of brackets around the first subtraction operation, otherwise only the second mean value is rounded off to two decimal places. Compare the following lines of code:
```{r}
mean(L1.data$Vocab) - mean(L2.data$Vocab)
(mean(L1.data$Vocab) - mean(L2.data$Vocab)) |>
round(digits = 2)
mean(L1.data$Vocab) - round(mean(L2.data$Vocab), digits = 2)
```
An alternative approach would be to store the difference in means as an `R` object and, in a second line of code, pass this object to the `round()` function.
```{r}
mean.diff.vocab <- mean(L1.data$Vocab) - mean(L2.data$Vocab)
round(mean.diff.vocab, digits = 2)
```
Or, you could combine both approaches like this:
```{r}
mean.diff.vocab <- mean(L1.data$Vocab) - mean(L2.data$Vocab)
mean.diff.vocab |>
round(digits = 2)
```
There is often more than one way to solve problems in `R`. Choose whichever way you are most comfortable with. As you long as you understand what your code does (see @sec-AI), it doesn't matter if it's particularly elegant or efficient.
:::
::::
::: {.content-visible when-profile="OER"}
### Check your progress 🌟 {.unnumbered}
Well done! You have successfully completed [`r checkdown::insert_score()` out of 19 questions]{style="color:green;"} in this chapter.
:::
::: {.content-visible when-format="pdf"}
### Check your progress {.unnumbered}
It's time to complete this chapter's [tasks and quizzes](https://elenlefoll.github.io/RstatsTextbook/quizzes/7_quiz.html)! Good luck! `r emoji("four_leaf_clover")`
:::
Are you confident that you can...?
- [ ] Inspect a data set in `R` (@sec-InspectingData)
- [ ] Recognise different types of variables (@sec-Variables)
- [ ] Access individual columns and data points in `R` (@sec-DollarSign)--(@sec-SquareBrackets)
- [ ] Use built-in `R` functions and change function arguments (@sec-RFunctions)
- [ ] Combine functions in `R` using both the nesting and the piping methods (@sec-Nesting)--(@sec-Piping)
You are now ready to do statistics in `R`! `r emoji("muscle")` In @sec-DescRiptiveStats, we begin with descriptive statistics.