workshop-r-intro/04_basic_data_processing.qmd at main · CSCAR/workshop-r-intro · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
# Basic data processing

Now we can apply our understanding of R to work with pre-made files of data.
The first step is to locate our working directory, which is the default
location where R will look for files we want to load and where it will put any
files we save. This directory is different on each computer, but we can find
it by running:
```{r get working directory}
#| eval: false
getwd()
```

```{r print fake working directory}
#| echo: false
print("C:/Users/user_name/workshop_folder/learning_r/code")
```

We can move our working directory to any folder on our computer by writing a
new [file path](https://www.codecademy.com/resources/docs/general/file-paths)
inside the function `setwd()`. I prefer to set my working directory to a folder
dedicated exclusively to whichever project I am currently working on. This way,
every file related to my project is in the same place. For example:

```{r}
#| eval: false
setwd("C:/Users/user_name/workshop_folder/learning_r/code")
```

We can also change our working directory by clicking on
`Session > Set Working Directory > Choose Directory` in the RStudio menu bar.
The graphical user interfaces in Windows and Mac  have similar options. If we
start R from a UNIX command line (as on Linux machines), the working directory
will be whichever directory we were in when we called R.

`list.files()` will show us what files are in our working directory. If the
file that we want to open is in our working directory, then we are ready to
proceed.

## Loading data

Once we can locate files in our computer, we can load them into R. Note,
however, that we need specific ways to open different file formats.

### Plain text files

A plain-text file stores a table of data in a text document. Each row of the
table is saved in its own line, and a simple symbol separates the cells within
a row. This symbol is most often a comma, and sometimes a tab or a pipe
delimiter `|`, but it can also be any other character. Each file uses only one
symbol to separate cells, which minimizes confusion.

Plain-text files are simple and many programs can read them. This is why many
organizations (e.g., the Census Bureau and the Social Security Administration)
publish their data as plain-text files.

We will work with data
from [this](https://github.com/CSCAR/workshop-r-intro/blob/main/data_files/flower.csv)^[You can find the original file [here](https://alexd106.github.io/intro2R/data.html),
courtesy of Douglas et al. (see references).] plain text file. Use
`Ctrl+Shift+s` to download the file. I will save it in a folder called
*data_files* inside my working directory under the name *flower.csv*. You can
save it wherever you want as long as you can keep track of it.

We can use `read.table()` to load plain-text files into R. The first argument
of `read.table()` is a string with the name of our file (if it is in your
working directory), or with the file path to our file (if it is not in our
working directory).
```{r loading flower_df}
flower_df <- read.table(
    "data_files/flower.csv",
    header = TRUE,
    sep = ",",
    encoding = "UTF-8"
)
```

In the code above, I added arguments `header`, `sep`, and `encoding`. `header`
tells R whether the first line of the file contains variable names instead of
values; this will help us identify the variables in the data frame. `sep`
tells R the symbol that the file uses to separate the cells; this will help us
preserve the correct location of the data cells. `encoding` tells R the type of
[machine-speak](https://en.wikipedia.org/wiki/Character_encoding) that it needs
to understand the symbols in the file; this can prevent having weird, mistaken
symbols in our loaded data.

Other useful arguments are `skip` and `nrow`. `skip` tells R to omit a
specific number of lines before it starts reading values from the file. This
argument is helpful when the file starts with text that is not part of the data
set, or when we want to read only part of a data set. `nrow` tells R to only
read a certain number of lines, starting from the top. Keep in mind that `nrow`
does not count the header in the number of rows it reads.

```{r}
flower_df_chunk <- read.table(
    "data_files/flower.csv",
    header = TRUE,
    sep = ",",
    skip = 0,
    nrow = 3
)
flower_df_chunk
```

R has shortcut functions that call `read.table()` in the background with
different default values for popular types of files:

+ `read.table` is the general purpose read function.
+ `read.csv` reads comma-separated values (.csv) files.
+ `read.delim` reads tab-delimited files.
+ `read.csv2` reads .csv files with European decimal format.
+ `read.delim2` reads tab-delimited files with European decimal format.

`read.table()` and its shortcuts allow us to load data files directly from a
website. Instead of using the file's path or name, we can directly use a web
address in the `file` argument of the function. Make sure to use the web
address that links directly to the file, not to a web page that has a link to
the file.

There is a special type of plain-text file called *fixed-width file* (.fwf).
Instead of a symbol, a fixed-width file uses its layout to separate data cells.
Each row is still in a single line, and each column begins at a specific number
of characters from the left-hand side of the document. To correctly position
its data, the file adds an arbitrary number of character spaces between data
entries.

If our flowers data came in a fixed-width file, the first few lines would look
like this:
```{r flowers as a fwf}
#| eval: false
treat  nitrogen block  height  weight  leafarea  shootarea  flowers
tip    medium   1      7.5     7.62    11.7      31.9       1
tip    medium   1      10.7    12.14   14.1      46.0       10
tip    medium   1      11.2    12.76   7.1       66.7       10
tip    medium   1      10.4    8.78    11.9      20.3       1
tip    medium   1      10.4    13.58   14.5      26.9       4
tip    medium   1      9.8     10.08   12.2      72.7       9
```

Fixed-width files may be visually intuitive, but computers find them difficult
to work with. This may explain why R has a function for reading fixed-width
files, but not for creating or saving them.

We can read fixed-width files into R with the function `read.fwf()`. This
function adds another argument to the ones from `read.table()`: `widths`, which
should be a vector of numbers. Each ith entry of the `widths` vector should
state the width (in characters) of the ith column of the data set.

```{r}
#| include: false
flowers_fwf_df <- read.fwf(
    "data_files/flowers.fwf",
    widths = c(6, 9, 6, 7, 7, 9, 10, 9),
    header = FALSE,
    skip = 1
)
flowers_fwf_df
```

### Excel files

The best way to load data from Excel files (.xlsx) is to first save these files
as .csv or .txt files and then use `read.table`. Excel files can include
multiple spreadsheets, macros, colors, dynamic tables, and other complicated
features that make it difficult for R to read the files properly. Plain text
files are simpler, so we can load and transfer them more easily.

Still, it is possible to load Excel files directly if we *really* need to. R
has no native way of loading these files, but we can use the package `readxl`,
which works on Windows, OS X, and Linux. We install it using
`install.packages("readxl")` and then load it using `library(readxl)`. Once we
load the package, we can use the function `read_excel()` to load files of the
type .xls and .xlsx (see `help("read_excel")` for more information).

### Files from other programs

As with Excel files, I suggest that you first try to transform files from other
programs to plain-text files. This transformation is usually the best way to
verify that our data are transcribed properly.

Still, sometimes we can't transform the file to a plain-text format---maybe
because we can't access the program that created the file (e.g., SAS or SPSS).
In these cases, we can resort to one of several libraries:

+ `haven`, for reading files from SAS, SPSS, and Stata.
+ `R.matlab` for reading files for versions MAT 4 and MAT 5.
+ `foreign` for reading minitab and Systat file formats. This library can also
read files from SAS, SPSS, and Stata, but I prefer to use `haven` in these
cases.

## Cleaning data

Once we load our data files as data.frames, we should verify that all of the
information has an appropriate format. The process of identifying, removing,
and correcting inaccurate information is often referred to as *data cleaning*.
We will practice data cleaning using a messy version of the flower data that
we loaded above. You can get this messy version from
[here](https://github.com/CSCAR/workshop-r-intro/blob/main/data_files/flower_messy.csv).
Again, you can use `Ctrl+Shift+s` to download the file.

Since this is a .csv file, we can load it using:
```{r loading messy flower data}
flower_messy_df = read.csv(
    file = "data_files/flower_messy.csv",
    header = TRUE,
    encoding = "UTF-8"
)
```

First, we should ensure the column names follow the rules we saw in section 1.
This will facilitate accessing the data in the columns later. We can check
these column names using the `colnames()` function:
```{r check colnames}
colnames(flower_messy_df)
```

If we opened the data file using something like Excel or Notepad, we would see
that the names for columns 6 and 7 had blank spaces inside them. When loading
the data, `read.csv()` automatically substitutes these blank spaces with
periods `.`, so that the names conform to R's conventions. `read.csv()` is
pretty good at checking column names and other things, but it's not perfect.
So, it's always a good idea to double-check everything ourselves.

The column names of `flower_messy_df` are legible, but we don't want to
struggle with their mix of upper and lower-case letters. Let's rewrite all the
names in lower case, which is quick and easy if we use `tolower()`.
```{r colnames to lower case}
new_colnames <- tolower(colnames(flower_messy_df)) # Modify column names
new_colnames
```

These new column names are better, but we still need to change them inside
`flower_messy_df`. Before moving on, let's create a new data set called
`flower_clean_df`.
```{r create flower_clean_df}
flower_clean_df <- flower_messy_df
```

Using a copy of the original data set makes it easier to track our changes
because we can always look back at the original version. It also makes it
faster to mend mistakes because it avoids reloading our original data, which
can take a long time with large files.

Now we can use our improved column names.
```{r}
colnames(flower_clean_df) <- new_colnames # Replace column names in data frame
colnames(flower_clean_df) # Verify replacement
```

The last change to these column names will be to substitute the periods with
underscores. In R, this is purely out of personal preference, but it's a good
excuse to meet `gsub()`, which substitutes patterns of strings:
```{r substitute periods with underscores in colnames}
colnames(flower_clean_df) <- gsub(
    pattern = "\\.", # What we want to remove
    replacement = "_", # What we want to have instead
    x = colnames(flower_clean_df) # The object we want to modify
)
colnames(flower_clean_df)
```

Note that I had to use `"\\."` instead of simply `"."` to match the period. The
reason is that `gsub()` interprets `"."` as saying "match any character". This
may sound silly but it helps when working with [regular expressions](https://en.wikipedia.org/wiki/Regular_expression)---a
syntax to find many different, complicated patterns in strings. Regular
expressions are too complicated to explain here, but if you expect to work with
text data regularly, I encourage you to learn more about them.

With our improved column names it will be easier to focus on giving every
column an appropriate format: numbers should be of type "double" or "integer",
and text should be of type "character" or "factor". Let's check the types of
the columns in our current data set.

```{r check column types}
str(flower_clean_df)
```

Column "flowers" seems to contain numbers but is classified as type
"character". The reason is that there are quotes around the first value in this
column:
```{r}
head(flower_clean_df[["flowers"]])
```

R recognizes that the value itself has quotes, so it adds a backslash `\` to
differentiate them from the quotes it uses to print strings. Let's remove these
confusing quotes.

```{r eliminate quotes from flowers column}
flower_clean_df["flowers"] <- gsub(
    pattern = "\"", # \" the backlash tells R to match quotes
    replacement = "", # This is how we write "nothing"
    x = flower_clean_df$flowers # x needs to be a vector, so use
                                # double brackets or dollar sign
)
head(flower_clean_df$flowers)
```

Now we can coerce the column "flowers" to be of type "double".
```{r coerce flowers column into double}
flower_clean_df["flowers"] <- as.numeric(flower_clean_df$flowers)
typeof(flower_clean_df$flowers)
head(flower_clean_df$flowers)
```

Columns "treat" and "nitrogen" are of type character. This is not wrong, but
they will be easier to handle if we convert them to factors.
```{r}
flower_clean_df["treat"] <- factor(flower_clean_df$treat)
flower_clean_df["nitrogen"] <- factor(flower_clean_df$nitrogen)
str(flower_clean_df)
```

Column "treat" looks fine, but column "nitrogen" looks suspicious. It is
supposed to have only three levels ("low", "medium", and "high"), but its
description counts eight. Let's examine these levels:
```{r check levels of nitrogen column}
levels(flower_clean_df$nitrogen)
```

Remember that R is case sensitive, so it interprets each of spelling "high"
and "low" as a different value. We can fix this using `tolower()` once more.
Note that this will convert the "nitrogen" column back to a simple character
type, so we have to reconvert it to factor.
```{r nitrogen column to all lowercase}
flower_clean_df["nitrogen"] <- tolower(flower_clean_df$nitrogen)
flower_clean_df["nitrogen"] <- factor(flower_clean_df$nitrogen)
levels(flower_clean_df$nitrogen)
```

Unless I have a good reason not to, I usually transform all character columns
to have only lower case letters.

## Data summaries and visualizations

Now that our data are clean, we can get a complete summary to understand them
better. Function `summary()` recognizes the type of each column and displays
some notable features:

```{r summary of flower_clean_df}
summary(flower_clean_df)
```

Now let's imagine we want to study the distribution of values for weight. We
can use a histogram to check the shape of this distribution.

```{r histogram for weight}
hist(
    flower_clean_df$weight,
    breaks = 15,
    xlim = c(5, 25),
    xlab = "Weight",
    main = "Few weights are above 20"
)
```

Or we can get a simpler description using a box plot.
```{r boxplot for weight}
boxplot(
    flower_clean_df$weight,
    ylab = "Weight",
    col = "darkgreen",
    main = "Most weights are between 9 and 15"
)
```


A single box plot is less descriptive than a histogram. But it is easier to
compare box plots to look for big differences between distributions. Let's
compare the distributions of height by nitrogen level:

```{r height by nitrogen boxplots}
boxplot(
    height ~ nitrogen,
    data = flower_clean_df,
    col = c("yellow", "blue", "pink"),
    main = "No clear association between height and nitrogen"
)
```

Now let's investigate the relationship between shoot area and leaf area. And
let's check whether this relationship changes depending on the value of treat.
We can use a scatter plot with shoot area and leaf area, and we can color each
point by their treat value.

```{r leaf area vs shoot area by treat}
plot(
    x = flower_clean_df$leaf_area,
    y = flower_clean_df$shoot_area,
    col = flower_clean_df$treat,
    main = "Shoot area seems proportional to leaf area",
    xlab = "Leaf area",
    ylab = "Shoot area"
)
# Add a legend to the plot
legend(
    x = "bottomright",
    legend = levels(flower_clean_df$treat),
    col = 1:length(levels(flower_clean_df$treat)),
    pch = 16
)
```

Now let's see how frequently the values of nitrogen and treat combine with each
other, but only for flowers with a leaf area greater than 13. We can use a
mosaic plot for this.

```{r mosaic plot for nitrogen vs treat}
nitrogen_by_treat_table = xtabs(
    formula = ~ nitrogen + treat,
    data = flower_clean_df[which(flower_clean_df$leaf_area > 13),]
)
nitrogen_by_treat_table
mosaicplot(nitrogen_by_treat_table, main = "Nitrogen by treat table")
```

## Success!

Dear reader, you are now a capable useR. I hope this humble introduction helps
you learn more about many different topics in R. Be curious, be bold, and,
above all, be patient. Rome wasn't built in a day. Best of luck!

## References

The subsection on loading data is based on ["Hands-On Programming with R"](https://rstudio-education.github.io/hopr/),
by Garret Grolemund; and on ["An Introduction to R"](https://intro2r.com/), by
Alex Douglas, Deon Roos, Francesca Mancini, Ana Couto & David Lusseau.