r4strings/data-cleaning.Rmd at master · simularis/r4strings · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
# Data Cleaning {#cleaning}

```{r echo = FALSE, message = FALSE}
library(rvest)
library(stringr)
```

## Introduction

In this application we are going to work with the Men's Long Jump World Record Progression data from wikipedia (see screenshot below).

https://en.wikipedia.org/wiki/Men%27s_long_jump_world_record_progression#Low_altitude_record_progression_1965%E2%80%931991

```{r echo = FALSE, out.width = NULL, fig.cap="Men's Long Jump World Record Progression"}
knitr::include_graphics("images/men-long-jump.png")
```

```{r echo = FALSE}
long_jump <- read_html('data/men-long-jump.html')
tbl <- html_table(html_node(long_jump, 'table'))
```


## Import Data

To import the data of the Record Progression table you can use a couple of functions from the package `rvest`.

```{r eval = FALSE}
library(rvest)

wiki_jump <- 'https://en.wikipedia.org/wiki/Men%27s_long_jump_world_record_progression'

long_jump <- read_html(wiki_jump)
tbl <- html_table(html_node(long_jump, 'table'))
```

The function `read_html()` reads the html file of the wikipedia page. This will produce an object of type `"xml_document"` which we can further manipulate with other functions in `"rvest"`.

Because the _Record progression_ data is in an html `table` node, you can use `html_node()` to locate such table in the XML document. And then _extract_ it with `html_table()`.

```{r}
str(tbl, vec.len = 1)
```

As you can tell, the extracted table `tbl` is a data frame with `r nrow(tbl)` rows and `r ncol(tbl)` columns.


## Extracting Meters

The first task consists of looking at the values in column `Mark`, and find how to retrieve the distance values expressed in meters. For example, the first element in `Mark` is:

```{r}
tbl$Mark[1]
```

The goal is to obtain the number `7.61`. One way to achieve this task is via the `substr()` function.

```{r}
substr(tbl$Mark[1], start = 1, stop = 4)
```

We can do that for the entire vector:

```{r}
meters <- substr(tbl$Mark, start = 1, stop = 4)
meters
```

Notice that the meter values are not really numeric but character. In order to have `meters` as numbers, we should coerce them with `as.numeric()`

```{r}
meters <- as.numeric(substr(tbl$Mark, start = 1, stop = 4))
meters
```


#### Extracting Meters with Regular Expressions {-}

Instead of using the function `substr()` to obtain the distance values, let's see how to achieve the same task using regular expressions. To do this we must determine a pattern to be matched. So, what is the pattern that all the distance values (in meters) have in common?

```{r}
mark1 <- tbl$Mark[1]
mark1
```

If you look at the `Mark` content, you will notice that the target pattern is formed by: a digit, followed by a dot, followed by two digits. Such pattern can be codified as: `"[0-9]\\.[0-9][0-9]"`. So let's test it and see if there's match:

```{r}
str_detect(mark1, pattern = "[0-9]\\.[0-9][0-9]")
```

To extract the distance pattern we use `str_extract()`

```{r}
str_extract(mark1, pattern = "[0-9]\\.[0-9][0-9]")
```

And then we can apply it on the entire column to get:

```{r}
str_extract(tbl$Mark, pattern = "[0-9]\\.[0-9][0-9]")
```


## Extracting Country

Consider the column `Athlete`. The first value corresponds to `Petter O'Connor` from Ireland.

```{r}
tbl$Athlete[1]
```

Let's create a vector `peter` for this athlete:

```{r}
peter <- tbl$Athlete[1]
```

How can we get the country abbreviation?

```{r}
substr(peter, nchar(peter)-4, nchar(peter))
```

That works but it is preferable to exclude the parentheses, that is, the third to last character, as well as the last character:

```{r}
substr(peter, nchar(peter)-3, nchar(peter)-1)
```

Now we can apply the `substr()` command with all the athletes:

```{r}
# extract country
substr(tbl$Athlete, nchar(tbl$Athlete)-4, nchar(tbl$Athlete))
country <- substr(tbl$Athlete, nchar(tbl$Athlete)-3, nchar(tbl$Athlete)-1)
country
```


## Cleaning Dates

Now let's consider tha values in column `Date`:

```{r}
# first 5 dates
tbl$Date[1:5]
```

Notice that all the date values are formed by the day-number, the name of the month, the year, and then the characters `[1]`. Obviously we don't need those last three characters `[1]`.

```{r}
date1 <- tbl$Date[1]
date1
```

First let's see how to match the pattern `[1]`. Perhaps the first option that an inexperience user would try is:

```{r}
str_detect(date1, pattern = "[1]")
```

According to `str_detect()`, there's is a match, so let's see what exactly `"[1]"` is matching:

```{r}
str_match(date1, pattern = "[1]")
```

Mmmm, not quite right. We are matching the character `"1"` but not `"[1]"`. Why? Because brackets are metacharacters. So in order to match brackets _as brackets_ we need to escape them:

```{r}
str_match(date1, pattern = "\\[1\\]")
```

Now we are talking. The next step involves using `str_replace()` to match the pattern `"\\[1\\]"` and replace it with an empty string `""`:

```{r}
str_replace(date1, pattern = "\\[1\\]", replacement = "")
```

Then, we can get an entire vector of clean dates:

```{r}
# clean dates
dates <- str_replace(tbl$Date, pattern = "\\[1\\]", replacement = "")
dates
```


## Month and Day

We can further manipulate the dates. For example, say we are interested in extracting the name of the month. In the first date, this corresponds to extracting `"August"`:

```{r}
dates[1]
```

How can we do that? Several approaches can be applied in this case. For example, let's inspect the format of the month names:

```{r}
dates[1:5]
```

They all begin with an upper case letter, followed by the rest of the characters in lower case. If we want to match month names formed by four letters (e.g. June, July), we could look for the pattern `"[A-Z][a-z][a-z][a-z]"`

```{r}
str_extract(dates, pattern = "[A-Z][a-z][a-z][a-z]")
```

The previous pattern `"[A-Z][a-z][a-z][a-z]"` not only matches `"June"` and `"July"` but also `"Augu"`, `"Sept"`, `"Octo"`. In addition, we have some missing values.

Because the month names have variable lengths, we can use a repetition or quantifier operator. More specifically, we could look for the pattern `"[A-Z][a-z]+"`, that is: an upper case letter, followed by a lower case letter, repeated one or more times. The plus `+` tells the regex engine to attempt to match the preceding token once or more:

```{r}
month_names <- str_extract(dates, pattern = "[A-Z][a-z]+")
month_names
```

Having extracted the name of the months, we can take advantage of a similar pattern to extract the days. How? Using a pattern formed by one digit range and the plus sign: `"[0-9]+"`

```{r}
str_extract(dates, pattern = "[0-9]+")
```


## Extracting Year

What about extracting the year number?

```{r}
dates[1]
```

One option that we have discussed already is to use `substr()` or `str_sub()`

```{r}
str_sub(dates, start = nchar(dates)-3, end = nchar(dates))
```

or simply indicate a negative starting position (to counting from the end of the string):

```{r}
str_sub(dates, start = -4)
```

Another option consists in using a pattern formed by four digits: `"[0-9][0-9][0-9][0-9]"`:

```{r}
str_extract(dates[1], pattern = "[0-9][0-9][0-9][0-9]")
```

An additional option consists in using an _end of string anchor_ with the metacharacter `"$"` (dollar sign), and combine with a repetition operator `"+"` like: `"[0-9]+$"`:

```{r}
str_extract(dates[1], pattern = "[0-9]+$")
```

What is this pattern doing? The part of the pattern `"[0-9]+"` indicates that we want to match one or more digits. In order to tell the engine to match the pattern at the end of the string, we must use the anchor `"$"`.

The same task can be achieved with a digit character class `\\d` and the repetition operator `+`:

```{r}
str_extract(dates[1], pattern = "\\d+$")
```


### Athlete's Name

```{r}
# First name
str_extract(tbl$Athlete, pattern = "[A-Z][a-z]+")
```

The pattern `"[A-Z][a-z]+"` fails to match the name of the fourth athlete `r tbl$Athlete[4]`. One way to match an optional upper case in the third position is with the following pattern: `"[A-Z][a-z][A-Z]?[a-z]+"`:

```{r}
# First name
str_extract(tbl$Athlete, pattern = "[A-Z][a-z][A-Z]?[a-z]+")
```

An alternative option is to use the _word_ character class `\\w` repeated one or more times: `"\\w+"`

```{r}
# First name
str_extract(tbl$Athlete, pattern = "\\w+")
```


## Athlete Names

Now let's try to extract the athletes' first and last names. We could specify a regex pattern for the first name `[A-Z][a-z][A-Z]?[a-z]+`, followed by a space, followed by an uper case letter, and one or more lower case letters ` [A-Z][a-z]+`:

```{r}
str_extract(tbl$Athlete, pattern = "[A-Z][a-z][A-Z]?[a-z]+ [A-Z][a-z]+")
```

What about the first athlete Peter O'Connor? The previous pattern does not include the apostrophe.

```{r}
# works for Peter O'Connor only
str_extract(tbl$Athlete, pattern = "[A-Z][a-z][A-Z]?[a-z]+ [A-Z]'[A-Z][a-z]+")
```

What about this other pattern?

```{r}
# still only works for Peter O'Connor
str_extract(tbl$Athlete, pattern = "[A-Z][a-z][A-Z]?[a-z]+ [A-Z]'[A-Z]?[a-z]+")
```

Recall that the quantifier (or repetition) operators have an effect on the preceding token. So, the pattern `"[A-Z]'[A-Z]?[a-z]+"` means: an upper case letter, followed by an apostrophe, followed by an optional upper case, followed by one or more lower case letters. In other words, the quantifier `"?"` only has an effect on the second upper case letter.

In reality, we want both the apostrophe and the second upper case letters to be optional, so we need to add quantifiers `"?"` to both of them:

```{r}
str_extract(tbl$Athlete, pattern = "[A-Z][a-z][A-Z]?[a-z]+ [A-Z]'?[A-Z]?[a-z]+")
```

If you want to treat a set of characters as a single unit, you must wrap them inside parentheses:

```{r}
str_extract(tbl$Athlete, pattern = "[A-Z][a-z][A-Z]?[a-z]+ [A-Z]('[A-Z])?[a-z]+")
```

We still have an issue with athlete `Igor Ter-Ovanesyan`. The patterns used so far are only matching the the characters in his last name before the hyphen. We can start by adding a escaped hyphen inside the character set `"[a-z\\-]"` at the end of the pattern:

```{r}
str_extract(tbl$Athlete, pattern = "[A-Z][a-z][A-Z]?[a-z]+ [A-Z]('[A-Z])?[a-z\\-]+")
```

Notice that this pattern does match the hyphen but fails to match the second part of the last name (the one after the hyphen). This is because our token is only matching lower case letters. So we also need to include upper case letters in the character set: `"[a-zA-Z\\-]"`

```{r}
str_extract(tbl$Athlete, pattern = "[A-Z][a-z][A-Z]?[a-z]+ [A-Z]('[A-Z])?[a-zA-Z\\-]+")
```

The regex patterns that involve a set such as `"[a-zA-Z]"` can be simplified with a repeated __word__ character class `"\\w+"` (recall that `"\\w+"` is equivalent to `"[0-9A-Za-z_]"`). We can try to use two repeated word classes:

```{r}
str_extract(tbl$Athlete, pattern = "\\w+ \\w+")
```

As you know, we also need to include an apostrphe and the hyphen. In this case, we can include them inside parentheses and separating them with the OR operator `"|"`:

```{r}
str_extract(tbl$Athlete, pattern = "\\w+ (\\w|-|')+")
```


-----

#### Make a donation {-}

If you find this resource useful, please consider making a one-time donation in any amount. Your support really matters.

<form action="https://www.paypal.com/cgi-bin/webscr" method="post" target="_top">
<input type="hidden" name="cmd" value="_donations" />
<input type="hidden" name="business" value="ZF6U7K5MW25W2" />
<input type="hidden" name="currency_code" value="USD" />
<input type="image" src="https://www.paypalobjects.com/en_US/i/btn/btn_donateCC_LG.gif" border="0" name="submit" title="PayPal - The safer, easier way to pay online!" alt="Donate with PayPal button" />
<img alt="" border="0" src="https://www.paypal.com/en_US/i/scr/pixel.gif" width="1" height="1" />
</form>