RstatsTextbook/2_Data.qmd at main · elenlefoll/RstatsTextbook · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
---
engine: knitr
bibliography: references.bib
format:
  html:
    df-print: kable
    #embed-resources: true
---

```{r include=FALSE}
library(here)
library(tidyverse)
```

{{< include emoji_script.md >}}

# Data files and formats {#sec-DataFormats}

#### **Chapter overview** {.unnumbered}

This chapter first considers what data means in the context of language research, before turning to how these data are formatted and stored. You will learn about:

-   Different types of data used in language research
-   Computer data formats and file extensions
-   Sharing and accessing research data and materials
-   Working with delimiter-separated values (DSV) files
-   The pitfalls of spreadsheet programs such as Microsoft Excel, Numbers, and Google Sheets

Along the way, you will get insights into an eye-tracking study involving cute Playmobil figures and a meta-science investigation that highlights the utmost importance of data literacy for research.

## Data in the language sciences {#sec-DataLanguageSciences}

In this book, we are concerned with empirical research in the language sciences, in other words, with research that is based on the analysis of **data**. But what are data exactly? Data can be collected via surveys, measurements, or observations. To begin with, however, these collected datasets are "raw". Data only becomes **information** once we have analysed and interpreted the data in a meaningful way. Hence, just like uncooked pasta does not make a flavourful meal, we must learn to "cook" the raw data to obtain meaningful information.

What kind of data are analysed in the language sciences? To get a rough idea of the range of data types analysed in the language sciences, let us take a look at the [IRIS database](https://iris-database.org).

> IRIS is a collection of instruments, materials, stimuli, data, and data coding and analysis tools used for research into languages, including first, second, and beyond, and signed language learning, multilingualism, language education, language use, and language processing. Materials are freely accessible and searchable, easy to upload (for contributions) and download (for use). [@IRIS2011]

As such, IRIS supports Open Science and Open Scholarship (see @sec-OpenScholarship).

:::: {.content-visible when-profile="OER"}
::: {.callout-tip collapse="false"}
#### Your turn! {.unnumbered}

For these quiz questions and many future tasks, we will make use of the IRIS database.

-   Connect to the [IRIS website](https://iris-database.org/) and navigate to its [Search and Download](https://iris-database.org/search) page.
-   Scroll down to the filter option 'Data Type'.
-   Click on 'Data Type' and browse through the different data types that are most commonly used in language-related research.

![Screenshot from the IRIS database search page (accessed on 17 April 2024)](images/IRISDataTypes.png){fig-alt="Data Type filter dropdown menu with the following visible categories: Oral production (168), Closed response format (157), Open response (105), Judgements (97), Written production (76), Reaction times (62), Qualitative (46)" width="250"}

[**Q2.1**]{style="color:green;"} For which kinds of studies could these different types of data have been collected? Think about both experimental and observational studies.

[**Q2.2**]{style="color:green;"} Which of these data types is most likely to be measured in milliseconds (ms)?

```{r}
#| echo: false
#| label: "Q2.2"
library(checkdown)
check_question(answer = c("Reaction times", "reaction times", "Reaction time", "Reaction Time", "Reaction Times", "reaction", "Reaction", "reaction time"),
button_label = "Check answer",
q_id = "Q2.2",
right = "Yes, well done!",
wrong = "No, check the list of data types on IRIS.")
```
:::
::::

## Types of research data {#sec-ResearchData}

Given the wide range of methods used in language research, it is no surprise that they are so many different types of research data [see e.g. @goodScopeLinguisticData2022]. Although the data types listed on the IRIS search page are very broad and the categories not clearly defined, the list illustrates the breadth of research data types typically analysed in language studies.

The first data type category, "Oral production", for instance, can equally refer to text transcriptions of language users' oral production, audio, or video files. It can also refer to either raw data or to (more or less) processed data. For example, a transcript of a conversation could have been automatically annotated for part-of-speech, meaning that every word would be marked for their word class (e.g. `This_DT is_VBZ not_RB raw_JJ text_NN data_NN ._PUNC`), or it could have been manually anonymised by adding placeholders (e.g. `Is <NAME> going out with <NAME>?`) indicating that certain words have been retracted for data protection reasons.

The second most frequent data type category, "Closed response format", includes different kinds of questionnaires and tests. Questionnaires may ask study participants to disclose personal information relevant to the research questions using single or multiple-choice questions, such as what language(s) they use at home, how long they have studied a language for, or how old they are. Tests may be designed to assess participants' language competences (e.g. in the form of a vocabulary or grammar test), as well as other aspects relevant to the research questions being investigated (e.g. short-term memory or baseline reaction times).

In this book, we will focus on the research processes that take place after the data have been collected. However, it is vital that we are aware of the conditions and context in which the data we are analysing were collected and pre-processed. It is no exaggeration to say that these steps in the research process can entirely change the results of the data analysis. Suppose we decide to compare the abilities of two groups of French L2 learners. To do this, we administered a language production test to two whole classes of secondary school students learning French as a second language using two different teaching methods. If one group had 15 minutes to complete the test and the other had up to 60 minutes, the results would not be comparable.

:::: {.content-visible when-profile="OER"}
::: {.callout-tip collapse="false"}
#### Your turn! {.unnumbered}

[**Q2.3**]{style="color:green;"} Which other reasons could potentially jeopardise the comparison of test results data from two different groups of pupils?

```{r}
#| echo: false
#| label: "Q2.3"

check_question(c("The two groups having different teachers.",
                 "One group having French lessons on Tuesday mornings, the other on Friday afternoons.",
                 "One group having a higher proportion of pupils from migrant families.", "One group having a single native speaker of French, whilst the other has none.", "
                 One group having a higher proportion of pupils with reading difficulties.",
                 "One group having a classroom decorated with French flags and posters about France."),
               options = c("The two groups having different teachers.",
                 "One group having French lessons on Tuesday mornings, the other on Friday afternoons.",
                 "One group having a higher proportion of pupils from migrant families.", "One group having a single native speaker of French, whilst the other has none.", "
                 One group having a higher proportion of pupils with reading difficulties.",
                 "One group having a classroom decorated with French flags and posters about France."),
               type = "checkbox",
random_answer_order = TRUE,
q_id = "Q2.3",
button_label = "Check answer",
right = "That's right! All of these factors could potentially influence the outcome of this comparison study. <br><br>Can you think of ways to ensure that these factors are controlled for?",
wrong = "Not quite. There are more reasons.")
```
:::
::::

Whilst there are many ways to ensure that as many factors as possible are controlled for, not all *can* be controlled for. What is crucial is that all aspects of the data collection process are well documented so that all factors, whether controlled or not, can be taken into account when analysing the data.

In research, we usually distinguish between **primary data**, which are the data that you collected yourself, and **secondary data**, which are data that were collected by others. Hence if you were to carry out a new study based on data that you found on IRIS, you would be conducting a secondary data analysis. Especially when conducting secondary data analyses, it is crucial that we have enough information about the data itself, i.e. **metadata**. Metadata is crucial for finding, sharing, evaluating, and reusing datasets [@trippelMetadataResearchData2025]. For some data and projects, it makes sense to create separate metadata files that contain additional or more detailed information about the collected data. For language-related data, various metadata tools and standards exist [see e.g. @paquotCoreMetadataSchema2024; @teiconsortiumTEIP5Guidelines2025; @windhouwerComponentMetadataInfrastructure2022; @withersMetadataManagementArbil2012] and it makes sense to try to stick to these standards as far as possible to ensure greater comparability and compatibility across different datasets [for more information on how to develop a Data Management Plan for a linguistics study, see @kungDevelopingDataManagement2022].

## Data formats and file extensions {#sec-FileExtensions}

Different data types come in different data formats. For audio files, you may be familiar with the MP3 format, but this is by no means the only format in which audio files can be saved. Many other audio file formats exist, such as **Wave**form Audio File Format (WAVE) and **F**ree **L**ossless **A**udio **C**odec (FLAC).

We can usually tell in what format a file is in by looking at its file extension. The file extension is the suffix of the filename. It comes at the end of the filename and is preceded by a dot. The file extension of a WAVE file is `.wav`, whereas that of an MP3 file is `.mp3`; hence the file `recording.wav` is a WAVE file, whereas `recording.mp3` is an MP3 file.

:::: {.content-visible when-profile="OER"}
::: {.callout-tip collapse="false"}
#### Your turn! {.unnumbered}

[**Q2.4**]{style="color:green;"} In which format are Microsoft Word files typically saved?

```{r}
#| echo: false
#| label: "Q2.4"

check_question(".docx", options = c(".odt", ".docx", ".docs", ".msword"), type = "radio",
random_answer_order = TRUE,
button_label = "Check answer",
q_id = "Q2.4",
right = "That's right! Though older versions of MS Word used the format `.doc`.",
wrong = "That's incorrect. Do you have a Word file on your computer whose file extension you could check?")
```

[**Q2.5**]{style="color:green;"} Which of these files are audio files?

```{r}
#| echo: false
#| label: "Q2.5"

check_question(c("dialogue001.mp3", "dialogue001.wav", "001_dialog.flac"),
               options = c("dialogue001.mp3", "dialogue001.wav", "dialog_01.csv", "001_dialog.flac", "DIALOGUE.audio"),
               type = "checkbox",
random_answer_order = TRUE,
button_label = "Check answer",
q_id = "Q2.5",
right = "That's right!",
wrong = "Not quite.")
check_hint("There are three audio files in this list.",
           hint_title = "🐭 Click on the mouse for a hint.")
```
:::
::::

Unfortunately, many modern operating systems have a tendency to hide file extensions by default. This results in the files `recording.wav` and `recording.mp3` both being displayed as `recording` in File Finder/Explorer windows (compare @fig-NoExtensions and @fig-Extensions). This is misleading and can lead to all kinds of problems.

::: {#fig-FileExtensions layout="[[50,-2,50], [100]]" layout-valign="center"}
![Displaying filenames without file extensions](images/FinderWindowNoExtensions.png){#fig-NoExtensions fig-alt="File Finder window showing 7 files with filenames that do not include a file extension."}

![Displaying filenames with their extensions](images/FinderWindowExtensions.png){#fig-Extensions fig-alt="File Finder window showing 7 files with file extensions such as .pdf, .html and .txt."}

Demonstrating the importance of seeing file extensions.
:::

To ensure that you can always see the extensions of the files on your computer in the File Explorer (on Windows) or the File Finder (on macOS), follow these instructions:

-   On Windows: <https://www.howtogeek.com/205086/beginner-how-to-make-windows-show-file-extensions/>.
-   On macOS: <https://support.apple.com/en-gb/guide/mac-help/mchlp2304/mac> (select the version of your operating system at the top of the page).

::: {#nte-Unzip .callout-note title="Archiving and compression file formats"}
When we want to share large files, it is often possible to compress them to reduce their size. File compression also allows us to reduce entire folders of files to a single compressed file, facilitating the sharing of an entire project directory, including its internal folder structure.

Different file compressing formats exist, including `.tar` and `.rar`, but the most common one is ZIP, which works natively on all operating systems. The extension for ZIP files is `.zip`.

To **decompress** (or 'unzip') a ZIP file and extract its contents on **Windows**:

1.  Double-click the `.zip` file
2.  Select 'Extract All'
3.  Select a folder
4.  Click 'Extract'.

On **Mac OS**, double-clicking a `.zip` filename in the will automatically unzip it.

If you are using a **Linux** terminal, use the command `unzip` followed by the name of the file to unzip it.
:::

## Sharing research data and materials {#sec-Sharing}

In line with the principles of Open Science (see @sec-OpenScholarship), it is important to ensure that both the materials that were used to collect research data (e.g. questionnaire items, audio, image or video stimuli, language aptitude tests, etc.) and the data themselves are made openly available to the research community, whenever legally possible and ethically responsible. Sharing materials ensures that studies can be replicated, for example with new participants or in a different language. Sharing research data also allows independent researchers to reproduce the results of studies, allowing them to verify the reported results and to conduct additional analyses that may confirm, contradict, or extend the conclusions of the original studies.

You may be wondering how linguists and language education researchers can make their research data and materials publicly available. @tbl-RepoTable lists a selection of such repositories where linguists can upload research data and materials. Some are specific to the language sciences, while others cater to all research disciplines. There are many more options and the easiest way to find suitable repositories is to search the registry of research data repositories: [re3data.org](https://www.re3data.org/). All of the examples, tasks, and exercises in this book are based on research data and materials that researchers have made available in open access on one or more of these repositories.

:::: {.content-visible when-format="html"}
::: column-margin
![[re3data.org](https://www.re3data.org/)](images/logo_re3data.png){width="150" alt-text="The re3data logo which are the letters RE to the power of three because these two letters appears three times in the full name: registry of research data repositories."}
:::
::::

```{r}
#| echo: false
#| label: tbl-RepoTable
#| tbl-cap: Non-exhaustive list of public repositories of research data and materials.
library(gt)

PublicRepos <- read.csv(here("data", "17-03-2026_repositories.csv"),
                        header = TRUE,
                        sep = ",",
                        quote = "\"",
                        encoding = "UTF-8")

PublicRepos |>
  rename(`Online since` = Online.since) |>
  gt() |>
  tab_style(
    style = cell_text(weight = "bold"),
    locations = cells_column_labels()
  ) |>
  fmt_url(columns = "Homepage") |>
  opt_table_font(size = 11)
```

In this chapter, we will look at a study by @schimkeFirstLanguageInfluence2018 (see @fig-SchimkeTitle), which is an example of a publication which was awarded the Open Data and the Open Materials badges (see @fig-OpenDataOpenMaterials). This means that the research materials and data associated with this study can be found in an open, online repository:

> This article has been awarded Open Materials and Open Data badges. All materials and data are publicly accessible via the IRIS Repository at <https://www.iris-database.org/iris/app/home/detail?id=york:934337>. [@schimkeFirstLanguageInfluence2018]

The authors could have chosen to upload their materials and data to any of the online repositories listed in @tbl-RepoTable but, in this case, they chose [IRIS](https://iris-database.org/).

::: {#fig-SchimkeEA2018 layout="[[70, 30]]" layout-valign="center"}
![Title page of the @schimkeFirstLanguageInfluence2018](images/SchimkeEA2018Abstract.png){#fig-SchimkeTitle fig-alt="Heading of PDF version of the article from the journal Language Learning: A Journal of Research in Language Studies. EMPIRICAL STUDY First Language Influence on SecondLanguage Offline and Online AmbiguousPronoun Resolution Sarah Schimke, Israel de la Fuente, Barbara Hemforth, and Saveria Colonnada from University Münster, University of Lille/CNRS, CNRS/University of Paris Diderot, and University of Paris 8/CNRS" width=80%}

![The Open Data and Open Materials badges](images/OpenDataOpenMaterials.jpg){#fig-OpenDataOpenMaterials fig-alt="The Open Data badge is blue and shows a simple barplot and the Open Materials is yellow and shows an open cardbox." width=60%}

An example of a publication for which both research materials and data have been published.
:::

Among other results, @schimkeFirstLanguageInfluence2018 report on two eye-tracking experiments. One of these experiments involved Spanish-speaking participants listening to ambiguous sentences in Spanish whilst looking at images of Playmobil figures (see @fig-Playmobil for an example).

::: {#nte-Eyetracking .callout-note title="How did the experiment work?"}
In this eye-tracking experiment, participants were instructed to decide whether the sentences they heard matched the Playmobil images or not. Consider the following two sentences from the experiment:

> 1.  *El barrendero se encontró con el cartero antes de que recogiera las cartas.*\
>     \[The street sweeper met the postman before he fetched the letters.\]
>
> 2.  *El barrendero se encontró con el cartero antes de que recogiera la escoba.*\
>     \[The street sweeper met the postman before he fetched the broom.\]

Up until the point at which either *las cartas* \[the letters\] or *la escoba* \[the broom\] are heard, it is unclear who is doing the fetching. From a grammatical point of view, it could be either the street sweeper or the postman.

Participants were presented with @fig-Playmobil as they were listening to either Sentence 1 or Sentence 2. At the same time, the researchers measured how long it took for the participants to look at the subject governing the verb *recogiera*. In other words, for Sentence 1, they were interested in how long it took participants to focus on the postman Playmobil figure and, in Sentence 2, on the street sweeper. Such fine measurements are made in milliseconds, i.e. in thousandths of seconds, using a special eye-tracking device.

![Cropped image CC BY-NC-SA @schimkeFirstLanguageInfluence2018  <https://doi.org/10.48316/YD90P-94kUi>.](images/Playmobil.png){#fig-Playmobil fig-alt="Photo of two Playmobil figures. On the left, a street sweeper holding a broom and, on the right, a postman with a trolley." width="60%"}
:::

:::: {.content-visible when-profile="OER"}
::: {.callout-tip collapse="false"}
#### Your turn! {.unnumbered}

Imagine that you want to run an experiment similar to the one carried out in @schimkeFirstLanguageInfluence2018 (see @fig-SchimkeTitle). You can reuse the Playmobil image files created by the researchers as they helpfully uploaded them to the IRIS database.

In which file format do you think the images are archived? To find out, click [here](https://iris-database.org/search/?s_publicationAPAInlineReference=Schimke%20et%20al.%20(2018)) to go directly to the list of data and materials associated with the study. There are four entries in the IRIS database that are associated with this study. Select the "Pictorial" entry which contains the images. It allows you to download a ZIP file called `Images_online.zip`. ZIP is an archive file format that can contain one or more compressed files. Download this ZIP file and decompress ('unzip') it. You should find that it contains a folder entitled 'Images', which contains 58 pictures of different combinations of Playmobil figures that correspond to the experiment's stimulus sentences.

[**Q2.6**]{style="color:green;"} In which file format are these Playmobil image files?

```{r}
#| echo: false
#| label: "Q2.6"

check_question("BMP",
               options = c("BMP", "GIMP", "JPEG", "PNG", "GIF"),
               type = "radio",
q_id = "Q2.6",
button_label = "Check answer",
right = "That's right!",
wrong = "This is an image file format, but these Playmobil images are not saved in this format.")
```

Image files typically contain metadata that is embedded in the image files themselves. This metadata may include the dimensions of the image and its colour profile. To view this metadata, right-click on one of the image files that you have extracted from the ZIP file and select the option to get more information about the file, e.g. "Get Info" or "Properties".

[**Q2.7**]{style="color:green;"} How wide are these Playmobil images in pixel?

```{r}
#| echo: false
#| label: "Q2.7"

check_question("1024",
button_label = "Check answer",
q_id = "Q2.7",
right = "That's right, well done!",
wrong = "No. Image dimensions are usually displayed as two numbers separated by an \"x\" symbol. As these Playmobil images are all in landscape format so that the larger number corresponds to the width of the images.")
```
:::
::::

## Working with tabular data {#sec-TabularData}

The measurements made by the eye-tracking device in @schimkeFirstLanguageInfluence2018's eye-tracking experiments were stored in the form of tables. @tbl-EyeTrackingTable is an extract of a table that contains processed eye-tracking data from @schimkeFirstLanguageInfluence2018. It forms part of the study's supplementary materials and can also be downloaded from the [IRIS database](https://iris-database.org/details/lYT9m-a75tV).

In this table, each row corresponds to the data associated with one participant's eye movements while listening to a single stimulus sentence and looking at the corresponding Playmobil image (e.g. @fig-Playmobil). The extract displayed as @tbl-EyeTrackingTable only shows the data associated with the first six stimulus sentences (`items`) that participant "s1", a Spanish L2 learner, listened to. The columns `crit1`, `crit2` and `crit3` contain values derived from the measurements made using the eye-tracking device.[^2_data-1] From @tbl-EyeTrackingTable, we can also see that participant "s1" was 19 years old when they started formally learning Spanish (`AoO` stands for "age of onset of formal instruction") and that they were 20 when the experiment was conducted.

[^2_data-1]: Details of what these values mean are not relevant here but, for those of you who are curious, they correspond to the "log odds of looks" that participant made towards one or the other Playmobil figure whilst listening to the experimental stimulus sentences at three time points, called "critical regions". These critical regions include the time window between the onset of the pronoun and 480 milliseconds after the onset of the disambiguating information. @schimkeFirstLanguageInfluence2018 [p. 768-769] explain that "\[a\] positive value of the log odds indicates more looks to the subject than to the object antecedent, while a negative value indicates the reverse pattern."

```{r message=FALSE, warning=FALSE}
#| echo: false
#| label: tbl-EyeTrackingTable
#| tbl-cap: Extract of table containing eye-tracking data from Schimke et al. (2018)\\'s appendix

SchimkeData <- read.delim(file = here("data", "logoddslearnersfinal.txt"),
                          sep = "\t",
                          quote = "\"",
                          header = TRUE)

SchimkeData |>
  filter(language == "S") |>
  head() |>
  gt() |>
  opt_table_font(size = 11)
```

When working with data, tables are ubiquitous. Data stored in tables are called **tabular data**. Hence, learning to work with tabular data is a crucial data literacy skill.

In the language sciences, the results of most studies (whether experimental or corpus studies) are stored in tables. For example, when researchers conduct an online survey, the data collected by the online survey platform (e.g. [Qualtrics](https://www.qualtrics.com), [LimeSurvey](https://www.limesurvey.org/), [SoSci Survey](https://www.soscisurvey.de/)) are automatically stored in the form of one or more table(s). These can then be exported from the survey platform in various tabular file formats (e.g. `.csv`, `.json`, `.xlsx`).

In some cases, data may be collected by analogue means, e.g. by getting participants to answer a paper questionnaire or collecting school children's work on paper. However, for quantitative analysis, analogue research data are first digitalised. Then, the data are typically stored as text files in file formats such as `.txt` or `.csv`.

### Delimiter-separated values (DSV) files {#sec-DSV}

Tables can be stored in many data formats but the simplest and most widely used in linguistic research are **text files with** **delimiter-separated values (DSV)**. For sharing and archiving research data, DSV files are favoured over formats specific to propriety software such as `.xslx` (Microsoft Excel files) or `.numbers` (Apple Numbers files). This is because DSV files can be "understood" by many different programs and on all operating systems. The fact that they are simple text files means that we will also be able to reliably read them in the future, even if programs such as Excel or Numbers have evolved or have been discontinued. Reliability and compatibility are fundamental to maintaining the integrity of research data and ensuring that data can be reused, even in the distant future.

In DSV files, each value (e.g. measurement or response) is separated by a specific **separator** character. In principle, any character can be used to separate values, but the most common separators are the comma (`,`), tab (`\t`), colon (`:`), and semicolon (`;`). Below is the `.csv` file corresponding to @tbl-RepoTable.

```
Repository,Discipline,Online since,Homepage
CLARIN,Linguistics,2012,https://www.clarin.eu/
Dryad,All,2008,https://datadryad.org/
Figshare,All,2012,https://figshare.com/
HAL,All,2001,https://hal.science/
IRIS,Linguistics,2011,https://www.iris-database.org/
"Open Science Framework, OSF",All,2011,https://osf.io/
TalkBank,Linguistics,1999,https://talkbank.org/
"Tromsø Repository of Language and Linguistics, TROLLing",Linguistics,2014,https://site.uit.no/trolling/
Vivil,Clinical research,2017,https://vivli.org/
Zenodo,All,2013,https://zenodo.org/
```

As you can see, the values are separated by commas.[^2_data-2] Additionally, some of the values are enclosed in, or **delimited** by, double quotation marks (`"`). This prevents any commas that may occur within an actual field value, e.g. the comma in the field `Open Science Repository, OSF`, from being interpreted as a separator character.

[^2_data-2]: Note that the file extension `.csv` stands for "comma-separated values". Confusingly, however, DSV files are often given a `.csv` extension even when the separator character is not the comma. As a result, even though the `.tsv` extension stands for "tab-separated values", `.csv` files are frequently separated by a tab (`\t`) rather than comma. Isn't that fun? `r emoji("upside_down_face")`

Given that DSV files are text files, it is possible to open them in a free plain-text editor (e.g. [Notepad++](https://notepad-plus-plus.org/) or [BBEdit](http://www.barebones.com/products/bbedit/)) or a text-processing program (e.g. Microsoft Word or LibreOffice Writer). However, these programmes will typically display DSV files as in @fig-DSVinWord.

![The `.csv` file corresponding to @tbl-RepoTable opened in Microsoft Word](images/DSVinWord.png){#fig-DSVinWord fig-alt="A screenshot showing a CSV file opened in Microsoft Word. It looks like a long list of words separated by commas." width="70%"}

We can probably agree that what we are seeing in @fig-DSVinWord is not a very reader-friendly way to display tabular data! This is why DSV files are more often opened in spreadsheet programs (e.g. LibreOffice Calc, Google Sheets, Microsoft Excel, Numbers) than in text-editing programs. Let's find out how in the next section.

### Opening DSV files in LibreOffice Calc {#sec-DSVLibreOffice}

There are several ways to open a DSV file in LibreOffice Calc but the safest is to launch LibreOffice (see @sec-OpenSource for instructions on how to install LibreOffice) and, from the list of options under 'Create', click on 'Calc Spreadsheet' to open up a blank spreadsheet. Then, from the 'File' drop-down menu, select 'Open...' or use the keyboard shortcut {{< kbd mac=Cmd-O win=Ctrl-O linux=Ctrl-O >}} and locate the DSV file that you wish to open.

On opening a DSV file in LibreOffice Calc, we get a dialogue box with various options (see @fig-DSVImportCalc).

![Text import dialogue in LibreOfficeCalc](images/DSVImportCalc.png){#fig-DSVImportCalc fig-alt="A screenshot showing the text import dialogue in LibreOffice Calc. It shows that the checkbox for comma is checked and that the string delimiter is set to double quotation mark. All other checkboxes are unchecked." width="50%"}

To correctly import this particular DSV file, it is necessary to specify that the character encoding is UTF-8 [to find out why text encodings matter, see e.g. @CharacterEncodingsBeginners], the separator character is the comma (`,`) and that the delimiter character is the double quotation mark (`"`) (see selected options in @fig-DSVImportCalc). With these settings in LibreOffice Calc, the table is rendered as in @fig-DSVinCalc.

![CSV file opened in LibreOffice Calc](images/DSVinCalc.png){#fig-DSVinCalc fig-alt="A screenshot showing a CSV file opened in LibreOffice Calc. It is formatted as a table." width="90%"}

Note that if you open a DSV file in Excel or Google Sheets, you will not be shown such a dialogue box. Instead, these programs assume that they can guess which separator and delimiter characters your file uses. Whilst this may, at first, sound convenient, this is *not* good news: *you* should be the one in control of how your data files are interpreted, not the program! In the next section, you will learn why opening DSV files such as `.csv` and `.tsv` files in Microsoft Excel, Google Sheets, or Numbers can be very dangerous. In some cases, these programs will 'corrupt', i.e. permanently damage, your DSV files, which can lead to irreversible data loss!

The bad news is that, if you are using Windows or MacOS, it is very likely that either Excel or Numbers is your default app to open DSV files. This means that if you double click on a `.csv` and `.tsv` file in your Finder/Explorer window, the file will likely automatically open up in either Excel or Numbers. This is why it is important you do *not* double-click on such files to open them: Opening a file just *once* with these programs can lead to data loss! If this happens to you with a file that you have downloaded from a repository, your best bet is to delete your local version of the file and download a fresh version so that you can start again from scratch.

::::: {.content-visible when-profile="OER"}
:::: {.callout-tip collapse="false"}
#### Your turn! {.unnumbered}

In this [**task**]{style="color:green;"}, we will practice opening a DSV file in LibreOffice Calc. Our example file is a real dataset from @schimkeFirstLanguageInfluence2018. We will begin by downloading it from the public repository [IRIS](https://iris-database.org/search/?s_publicationAPAInlineReference=Schimke%20et%20al.%20(2018)).

In addition to the eye-tracking experiments, @schimkeFirstLanguageInfluence2018 conducted two further experiments in which participants completed a gap-filling task via an online survey platform. In the first of these experiments, the participants were native (L1) speakers of French, German, and Spanish. In the second, they were French- and Spanish-speaking learners (L2) of German.

In both experiments, the L1 and L2 participants were shown ambiguous sentences similar to the ones used in the eye-tracking experiment with the Playmobil images (see @nte-Eyetracking). After having read each stimulus, the participants were asked to complete a gap-fill task according to their understanding of the preceding ambiguous sentence. Participants were told "that there were no incorrect responses and that they should answer spontaneously" [@schimkeFirstLanguageInfluence2018: 755]. Below is an example questionnaire item in the three languages examined:

<div>

> 1\. *Der Briefträger ist dem Straßenfeger begegnet, bevor er schnell ein Sandwich geholt hat. \_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_ hat ein Sandwich geholt.*
>
> 2\. *Le facteur a rencontré le balayeur avant qu'il prenne rapidement un sandwich. \_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_ a pris un sandwich.*
>
> *3a. El cartero se reunió con el barrendero antes de que él recogiera velozmente un emparedado. \_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_ recogió un emparedado.*
>
> 3b. *El cartero se reunió con el barrendero antes de que recogiera velozmente un emparedado. \_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_ recogió un emparedado.*

</div>

Note that, for Spanish, there were two types of stimuli: one with an overt pronoun (as in 3a. with *él*) and one without (as in 3b. with a null pronoun), as both variants are possible in Spanish. All three examples translate as:

> -   *The postman encountered the street sweeper before he quickly fetched a sandwich. \_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_ fetched a sandwich.*

To complete the gap, participants could either select 'The postman' or 'The street sweeper'.

1.  Go back to the study's page on [IRIS](https://iris-database.org/search/?s_publicationAPAInlineReference=Schimke%20et%20al.%20(2018)) and select the second entry entitled 'Other questionnaire' which, among other things, contains 'Written production data'.

Note that this database entry includes both research data and research materials: the file `sentences_offline_task.xlsx` contains the full list of questionnaire items, including both experimental and filler items, with which we could reconstruct the experiment to replicate it with a new set of participants. For now, however, we are not interested in obtaining materials to replicate the study, but rather in examining the study's original data.

This [IRIS entry](https://iris-database.org/details/lYT9m-a75tV) also contains three data files. The last file (`logoddslearnersfinal.txt`) is the DSV file that was used to create @tbl-EyeTrackingTable above.

In this [**task**]{style="color:green;"}, we are going to look at the questionnaire data corresponding to the gap-filling task experiment conducted with German L2 learners, which is contained in the data file `offlinedataLearners.txt`:

2.  Download the `offlinedataLearners.txt` file (which is the second listed) and save it on your computer (see @sec-FoldersPaths).
3.  Launch LibreOffice (see @sec-OpenSource if you have not yet installed LibreOffice) and, from the list of options under 'Create', click on 'Calc Spreadsheet' to open up a blank spreadsheet.
4.  From the 'File' drop-down menu, select 'Open...' or use the keyboard shortcut {{< kbd mac=Cmd-O win=Ctrl-O linux=Ctrl-O >}}. Find `offlinedataLearners.txt` in the folder where you saved it and click on 'Open'.
5.  A 'Text Import' dialogue box will pop up. This a DSV file, not a fixed-width file, so ensure that the option 'Separated by' is selected. If not already set by default, it is also a good idea to select 'Unicode (UTF-8)' for the 'Character set'.
6.  Experiment with the different 'Separator Options' until the preview at the bottom of the dialogue box looks like a table.
7.  Ensure that, apart from the 'Separator Options', all other options in the dialogue box are unselected and then click on 'OK'.

[**Q2.8**]{style="color:green;"} What is the separator character in the file `offlinedataLearners.txt`?

```{r}
#| echo: false
#| label: "Q2.8"

check_question("Tab", options = c("Tab", "Comma", "Semicolon", "Space", "All of them"), type = "radio",
random_answer_order = FALSE,
q_id = "Q2.8",
button_label = "Check answer",
right = "That's right!",
wrong = "That's incorrect. Try selecting the other separators options in the \"Text Import\" dialogue box until the preview of the table looks like a table.")
```

[**Q2.9**]{style="color:green;"} What is the delimiter character in the file `offlinedataLearners.txt`?

```{r}
#| echo: false
#| label: "Q2.9"

check_question("There is none.",
               options = c("\"", "\'", "There is none.", "Both \" and \'"),
               type = "radio",
random_answer_order = TRUE,
q_id = "Q2.9",
button_label = "Check answer",
right = "That's right! This particular DSV file does not use a string delimiter.",
wrong = "Not quite. Try opening the file in a plain-text editor or text-processing program to find out.")
```

[**Q2.10**]{style="color:green;"} How many observations does the file `offlinedataLearners.txt` contain?

```{r}
#| echo: false
#| label: "Q2.10"

check_question("700",
               options = c("700", "701", "5", "3500", "3505"),
               type = "radio",
random_answer_order = TRUE,
q_id = "Q2.10",
button_label = "Check answer",
right = "That's right! Each observation corresponds to a row. And there is an additional row for the column headers.",
wrong = "No. In this table, each observation corresponds to a row. And there is an additional row for the column headers.")
```

[**Q2.11**]{style="color:green;"} In this table, what does each observation correspond to?

```{r}
#| echo: false
#| label: "Q2.11"

check_question("A single participant\'s response to a single sentence gap.",
               options = c("A single participant\'s response to a single sentence gap.", "All the responses from a single participant.", "All the responses to a single sentence in a single language.", "All the responses in a single language."),
               type = "checkbox",
random_answer_order = TRUE,
q_id = "Q2.11",
button_label = "Check answer",
right = "That's right!",
wrong = "I'm afraid not.")
```
::::
:::::

::: {.callout-important collapse="false"}
#### What if I absolutely have to open a DSV file in Excel? `r emoji("cold_sweat")`

If you absolutely must open a DSV file (e.g. a `.csv` or `.tsv` file) in Excel (for example because you do not have sufficient permissions to install LibreOffice on the computer that you are using), do *not* open the file by double clicking on the file as this will automatically trigger Excel's problematic auto-formatting behaviour (see @sec-ExcelWarning)! Instead, first launch Excel and create a new blank workbook. Then navigate to the 'Data' tab, select the 'Get Data' option, and then 'From Text/CSV' (see @fig-ExcelGetData). In the following dialogue, you can specify how the data should be imported. The options are very similar to the ones offered in LibreOffice (see above).

Note that with this method it *may* be possible to prevent Excel from automatically (and irreversibly!) applying transformations to your data. However, sadly, this may not suffice. Read on to find out more...

![Importing DSV data into Excel](images/ExcelGetData.png){#fig-ExcelGetData fig-alt="Screenshot of the option to import data into excel from an Text/CSV file. It can be found under 'Get Data', 'From File', and 'From Text/CSV'. The info message for this option reads 'Import data from a text, comma-separated value or formatted text (space-delimited) file.'" width="50%"}
:::

## A word of warning about spreadsheet programs `r emoji("warning")`️ {#sec-ExcelWarning}

You should be aware that opening DSV files in spreadsheet programs can corrupt the files! Once a file is corrupted, it is often not possible to retrieve the original data so this is very bad news, indeed. Such problems are particularly frequent when opening DSV files with Microsoft Excel and Google Sheets. This is because the default settings in these programs surreptitiously modify files upon opening.

These 'auto-format' modifications include replacing certain values by dates (e.g. changing `3-4` to `March, 4th`) or numbers (e.g. changing `1.23E5` to `123000`)[^2_data-3], removing leading zeros (e.g. changing `001` to `1`), or misinterpreting certain characters (e.g. the value `-ism` will generate an error because the hyphen is interpreted as minus sign).

[^2_data-3]: In scientific notation, "E" stands for "exponent", which refers to the number of times a number needs to be multiplied by 10. This notation is used as a shorthand way of writing very large or very small numbers. This is why "1.23E5" is interpreted by Excel as 1.23 multiplied by 10 to the power of 5, which is to say: 1.23 multiplied by 100,000. This operation shifts the decimal point five places to the right, resulting in the number 123000.

Not only can these auto-format modifications lead to inaccurate data analysis but, in the worst of cases, they can even cause data loss. The crux of the problem is that often users do not realise what the program has done in the background. How bad can this be? Find out by completing the [**task**]{style="color:green;"} below.

:::: {.content-visible when-profile="OER"}
::: {.callout-tip collapse="false"}
#### Your turn! {.unnumbered}

In this [**task**]{style="color:green;"}, you will find out how genetics researchers who use spreadsheets for their analyses regularly have their data so badly damaged that it affects the results of their publication. Though we have no statistics on how spreadsheet errors affect the work of linguists, it is (unfortunately) very likely to be just as bad as in genetics.

@ziemannGeneNameErrors2016 reported that a fifth of genetics publications with supplementary `.xls` or `.xlsx` files with gene lists contained errors caused by Excel's auto-formatting behaviour. The results of this study shocked the research community and a report about it went viral. Click on the link below to read the open-access article "Gene name errors: Lessons not learnt" by @abeysooriyaGeneNameErrors2021 to find out whether the situation has improved since 2016 and answer the questions below.

> Abeysooriya, Mandhri, Megan Soria, Mary Sravya Kasu & Mark Ziemann. 2021. Gene name errors: Lessons not learned. PLOS Computational Biology. Public Library of Science 17(7). e1008984. <https://doi.org/10.1371/journal.pcbi.1008984>.

[**Q2.12**]{style="color:green;"} Has the proportion of genetics publications with Excel gene lists affected by auto-formatting errors decreased since 2016?

```{r}
#| echo: false
#| label: "Q2.12"

check_question("No, it has remained stable.",
               options = c("No, it has remained stable.", "No, it increased between 2016 and 2020.", "Yes, it decreased between 2016 and 2020."),
type = "radio",
q_id = "Q2.12",
random_answer_order = TRUE,
button_label = "Check answer",
right = "That's right: Even though the first widely-read report about this was published in 2016, it was just as bad a problem in 2020!",
wrong = "No. Take another look at Fig. 1C from the paper.")
```

[**Q2.13**]{style="color:green;"} Does using LibreOffice Calc (see @sec-OpenSource) also cause these same issues?

```{r}
#| echo: false
#| label: "Q2.13"

check_question("No, but whilst LibreOffice is better than Excel or Google Sheets, it is still less than ideal for data analysis.",
               options = c("No, but whilst LibreOffice is better than Excel or Google Sheets, it is still less than ideal for data analysis.", "Yes, LibreOffice is just as likely to cause such errors.", "No, if you cannot afford Excel, then LibreOffice Calc is an excellent open-source alternative."),
               type = "radio",
random_answer_order = TRUE,
q_id = "Q2.13",
button_label = "Check answer",
right = "That's right. Abeysooriya et al. (2021) found that LibreOffice and Gnumeric did not convert gene names to dates automatically, which means that they are safer to use than Excel or Google Sheets. However, other issues may still occur and the authors conclude that it is best to use R or Python for data analysis.",
wrong = "Not quite.")
```

[**Q2.14**]{style="color:green;"} Did highly reputable journals publish fewer articles with erroneous Excel gene lists?

```{r}
#| echo: false
#| label: "Q2.14"

check_question("No, they published more.",
               options = c("No, they published more.", "Yes, they published fewer.", "No, publications with problematic Excel files were found in more or less equal proportions in all journals."),
type = "radio",
q_id = "Q2.14",
random_answer_order = TRUE,
button_label = "Check answer",
right = "Yes, that's right. The authors explain why they think that's the case in the discussion section of the paper.",
wrong = "No. Re-read the section on the correlation between journal impact factor and error proportion.")

```
:::
::::

It is worth noting that, for some Windows users, these auto-formatting issues can corrupt files that they have *never* actively opened in Excel! `r emoji("exploding_head")`️ This happens when Windows applies Excel's default settings to all CSV files, regardless of what program they are actually opened with. To ensure that this does not happen to you, check that Excel is *definitely not* your default app to open `.csv` and `.tsv` files (see below for instructions).

::::: {.callout-tip collapse="false"}
#### Opening a `.csv` or `.tsv` file in LibreOffice from a File Finder/Explorer window {#sec-DefaultApp}

Remember that to open a `.csv` or `.tsv` file on your computer, should *never ever* double-click on it and let the default program open it! As we saw in @sec-ExcelWarning, this can break or 'corrupt' the file. To avoid accidentally double-clicking on a `.csv` or `.tsv` file and having the file corrupted, I recommend making either [LibreOffice](https://www.libreoffice.org/) or a plain-text editor (e.g. [Notepad++](https://notepad-plus-plus.org/) or [BBEdit](http://www.barebones.com/products/bbedit/)) your default application to open up such files.

On **MacOS**, you can change the default application used to open files of any file extensions by right-clicking a filename with this particular extension and than selecting 'Get Info' (@fig-ChangeDefaultAppMac01). In the example below, Numbers is the default application for all `.csv` files (see @fig-ChangeDefaultAppMac02). In the dropdown menu 'Open with:', you can then select LibreOffice (provided you have installed it beforehand!) and finally click on 'Change All...' (@fig-ChangeDefaultAppMac03). You will be asked to confirm your choice.

::: {#fig-DefaultExcelMac layout-ncol="3" layout-valign="center"}
![ ](images/ChangeDefaultAppMac01.png){#fig-ChangeDefaultAppMac01 fig-alt="A screenshot showing the context menu of a macOS file that can be accessed with a right click. The drop-down option 'Get Info' is highlighted."}

![ ](images/ChangeDefaultAppMac02.png){#fig-ChangeDefaultAppMac02 fig-alt="A screenshot showing the info dialog of a macOS File Finder window with the 'Open with' section highlighted. It shows that the Numbers app is the default to open such files on this particular computer."}

![ ](images/ChangeDefaultAppMac03.png){#fig-ChangeDefaultAppMac03 fig-alt="A screenshot showing the 'Open with' settings with the 'Change All...' button highlighted, allowing users to permanently change the default application for opening all files of this particular type."}

Changing the default application for a file extension on MacOS
:::

If your operating system is **Windows**, you should look in your Windows' settings for the option 'Default Apps' (see @fig-DefaultAppsWindows).

![Default apps in Windows settings](images/DefaultAppsSetting.png){#fig-DefaultAppsWindows fig-alt="Screenshot of the menu showing the location of the default apps setting in the Windows settings. It can be found under the section 'Apps'." width="70%"}

In the next step, select 'Choose default apps by file type'. Here, you can search for `.csv` as a file type, and choose which program you want to set as the default program for opening `.csv` files. If Excel is currently your default (as in @fig-ExcelDefaultApp), you can click on Excel and choose a different program. LibreOffice is a sensible, open-source alternative (see @fig-DefaultAppsChangeDialog). A plain-text editor such as Notepad would also be fine (also listed on @fig-DefaultAppsChangeDialog).

::: {#fig-DefaultExcelWindows layout="[[50,-2,50], [100]]" layout-valign="center"}
![ ](images/DefaultAppsExcel.png){#fig-ExcelDefaultApp fig-alt="Screenshot of the default setting for the file type '.csv'" width=80%}

![ ](images/DefaultAppsChangeDialog.png){#fig-DefaultAppsChangeDialog fig-alt="Screenshot of the selection dialogue for default apps." width=80%}

Changing the default app for opening `.csv` files in Windows
:::

If it is not possible to adjust the default app settings, either due to insufficient permissions or because you only have temporary access to this PC, do *not* to open `.csv` or `.tsv` files with the default program. Instead, right-click on the filename and, using the 'Open with' option, select the option to open the file with LibreOffice, if available, or else with a plain-text editor.
:::::

::: {.content-visible when-profile="OER"}
### Check your progress 🌟 {.unnumbered}

You have successfully completed [`r checkdown::insert_score()` out of 14 questions]{style="color:green;"} in this chapter.
:::

::: {.content-visible when-format="pdf"}
### Check your progress {.unnumbered}

It's time to complete this chapter's [tasks and quizzes](https://elenlefoll.github.io/RstatsTextbook/quizzes/2_quiz.html). They involve Playmobil and the story of a real-life Excel disaster!
:::

Are you confident that you can...?

-   [ ] Distinguish different types of research data (@sec-ResearchData)
-   [ ] Find and download openly available research data and materials (@sec-Sharing)
-   [ ] Distinguish different data formats using the file extensions (@sec-FileExtensions)
-   [ ] Open delimiter-separated values (DSV) files in LibreOffice Calc (@sec-DSV) - (@sec-DSVLibreOffice)
-   [ ] Explain the risks of opening DSV files in Microsoft Excel and Google sheets (@sec-ExcelWarning)

In @sec-DataManagement, you will learn how to name, save, and back-up research data files so as to facilitate sound data analysis.