wikitablr/README.Rmd at master · baumer-lab/wikitablr · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
---
output: github_document
---

<!-- README.md is generated from README.Rmd. Please edit that file -->

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%"
)
```
# wikitablr <img src="man/figures/wikitablr_hex_logo.png" align="right" height=150/>

<!-- badges: start -->
[![Travis-CI Build Status](https://travis-ci.org/jkeast/wikitablr.svg?branch=master)](https://travis-ci.org/jkeast/wikitablr)
<!-- badges: end -->


`wikitablr` is an R package that has the tools to simply webscrape tables from wikipedia, and clean for common formatting issues. The intention here is to empower beginners to explore data on practically any subject that interests them (as long as there's a wikipedia table on it), but anyone can utilize this package.

`wikitablr` takes data that looks like this:

```{r, warning = FALSE, message = FALSE, echo = FALSE}
library(wikitablr)
head(read_wiki_raw("https://en.wikipedia.org/wiki/List_of_songs_recorded_by_the_Beatles", 2))

```

and makes it look like this:

```{r, warning = FALSE, message = FALSE, echo = FALSE}
head(read_wiki_table("https://en.wikipedia.org/wiki/List_of_songs_recorded_by_the_Beatles", 2))

```

## Installation

You can install the development version from [GitHub](https://github.com/) with:

``` r
# install.packages("devtools")
devtools::install_github("jkeast/wikitablr")
```
## Example

`wikitablr` allows you to either read in and clean your code all at once:

```{r}
#read in first table on page
head(read_wiki_table("https://en.wikipedia.org/wiki/List_of_colleges_and_universities_in_Massachusetts"))
```

Or use its cleaning functions seperately:

```{r}
#read in first table from url without cleaning it
example <- read_wiki_raw("https://en.wikipedia.org/wiki/List_of_colleges_and_universities_in_Massachusetts")
head(example)
```

```{r}
#clean names of table
example <- clean_wiki_names(example)

head(example)
```

```{r}
#remove footnotes from table
head(remove_footnotes(example))
```

#### Read all tables

`wikitablr` also allows you to read in and clean all tables on a page at once, putting them into a list.

This is the first table on the wikipedia page "List of World Series champions":

```{r}
example2 <- read_all_tables("https://en.wikipedia.org/wiki/List_of_World_Series_champions")

head(example2[[1]])
```

and the rest:
```{r}
head(example2[[2]])

head(example2[[3]])

head(example2[[4]])
```