diff --git a/manuscript/apply.Rmd b/manuscript/apply.Rmd index c4f8878..90c63f8 100644 --- a/manuscript/apply.Rmd +++ b/manuscript/apply.Rmd @@ -47,7 +47,7 @@ Note that the actual looping is done internally in C code for efficiency reasons It's important to remember that `lapply()` always returns a list, regardless of the class of the input. -Here's an example of applying the `mean()` function to all elements of a list. If the original list has names, the the names will be preserved in the output. +Here's an example of applying the `mean()` function to all elements of a list. If the original list has names, then the names will be preserved in the output. ```{r} @@ -86,7 +86,7 @@ x <- 1:4 lapply(x, runif, min = 0, max = 10) ``` -So now, instead of the random numbers being between 0 and 1 (the default), the are all between 0 and 10. +So now, instead of the random numbers being between 0 and 1 (the default), they are all between 0 and 10. The `lapply()` function and its friends make heavy use of _anonymous_ functions. Anonymous functions are like members of [Project Mayhem](http://en.wikipedia.org/wiki/Fight_Club)---they have no names. These are functions are generated "on the fly" as you are using `lapply()`. Once the call to `lapply()` is finished, the function disappears and does not appear in the workspace. @@ -165,7 +165,7 @@ where - `f` is a factor (or coerced to one) or a list of factors - `drop` indicates whether empty factors levels should be dropped -The combination of `split()` and a function like `lapply()` or `sapply()` is a common paradigm in R. The basic idea is that you can take a data structure, split it into subsets defined by another variable, and apply a function over those subsets. The results of applying tha function over the subsets are then collated and returned as an object. This sequence of operations is sometimes referred to as "map-reduce" in other contexts. +The combination of `split()` and a function like `lapply()` or `sapply()` is a common paradigm in R. The basic idea is that you can take a data structure, split it into subsets defined by another variable, and apply a function over those subsets. The results of applying the function over the subsets are then collated and returned as an object. This sequence of operations is sometimes referred to as "map-reduce" in other contexts. Here we simulate some data and split it according to a factor variable. Note that we use the `gl()` function to "generate levels" in a factor variable. @@ -413,13 +413,13 @@ With `mapply()`, instead we can do This passes the sequence `1:4` to the first argument of `rep()` and the sequence `4:1` to the second argument. -Here's another example for simulating randon Normal variables. +Here's another example for simulating random Normal variables. ```{r} noise <- function(n, mean, sd) { rnorm(n, mean, sd) } -## Simulate 5 randon numbers +## Simulate 5 random numbers noise(5, 1, 2) ## This only simulates 1 set of numbers, not 5 @@ -484,9 +484,9 @@ Pretty cool, right? * The loop functions in R are very powerful because they allow you to conduct a series of operations on data using a compact form -* The operation of a loop function involves iterating over an R object (e.g. a list or vector or matrix), applying a function to each element of the object, and the collating the results and returning the collated results. +* The operation of a loop function involves iterating over an R object (e.g. a list or vector or matrix), applying a function to each element of the object, and then collating the results and returning the collated results. -* Loop functions make heavy use of anonymous functions, which exist for the life of the loop function but are not stored anywhere +* Loop functions make heavy use of anonymous functions, which exist for the life of the loop function but are not stored anywhere. -* The `split()` function can be used to divide an R object in to subsets determined by another variable which can subsequently be looped over using loop functions. +* The `split()` function can be used to divide an R object into subsets determined by another variable which can subsequently be looped over using loop functions. diff --git a/manuscript/control.Rmd b/manuscript/control.Rmd index b9f98ef..1869326 100644 --- a/manuscript/control.Rmd +++ b/manuscript/control.Rmd @@ -28,7 +28,7 @@ Commonly used control structures are - `next`: skip an interation of a loop Most control structures are not used in interactive sessions, but -rather when writing functions or longer expresisons. However, these +rather when writing functions or longer expressions. However, these constructs do not have to be used in functions and it's a good idea to become familiar with them before we delve into functions. @@ -58,8 +58,7 @@ an `else` clause. ```r if() { ## do something -} -else { +} else { ## do something else } ``` @@ -123,12 +122,12 @@ if() { [Watch a video of this section](https://youtu.be/FbT1dGXCCxU) -For loops are pretty much the only looping construct that you will +`for` loops are pretty much the only looping construct that you will need in R. While you may occasionally find a need for other types of loops, in my experience doing data analysis, I've found very few situations where a for loop wasn't sufficient. -In R, for loops take an interator variable and assign it successive +In R, for loops take an iterator variable and assign it successive values from a sequence or vector. For loops are most commonly used for iterating over the elements of an object (list, vector, etc.) @@ -210,7 +209,7 @@ functions (discussed later). [Watch a video of this section](https://youtu.be/VqrS1Wghq1c) -While loops begin by testing a condition. If it is true, then they +`while` loops begin by testing a condition. If it is true, then they execute the loop body. Once the loop body is executed, the condition is tested again, and so forth, until the condition is false, after which the loop exits. @@ -223,7 +222,7 @@ while(count < 10) { } ``` -While loops can potentially result in infinite loops if not written +`while` loops can potentially result in infinite loops if not written properly. Use with care! Sometimes there will be more than one condition in the test. @@ -259,7 +258,7 @@ not commonly used in statistical or data analysis applications but they do have their uses. The only way to exit a `repeat` loop is to call `break`. -One possible paradigm might be in an iterative algorith where you may +One possible paradigm might be in an iterative algorithm where you may be searching for a solution and you don't want to stop until you're close enough to the solution. In this kind of situation, you often don't know in advance how many iterations it's going to take to get @@ -322,8 +321,8 @@ for(i in 1:100) { ## Summary -- Control structures like `if`, `while`, and `for` allow you to - control the flow of an R program +- Control structures, like `if`, `while`, and `for`, allow you to + control the flow of an R program. - Infinite loops should generally be avoided, even if (you believe) they are theoretically correct. diff --git a/manuscript/debugging.Rmd b/manuscript/debugging.Rmd index 53333e3..a8c1705 100644 --- a/manuscript/debugging.Rmd +++ b/manuscript/debugging.Rmd @@ -128,7 +128,7 @@ You can see now that the correct messages are printed without any warning or err ## Figuring Out What's Wrong -The primary task of debugging any R code is correctly diagnosing what the problem is. When diagnosing a problem with your code (or somebody else's), it's important first understand what you were expecting to occur. Then you need to idenfity what *did* occur and how did it deviate from your expectations. Some basic questions you need to ask are +The primary task of debugging any R code is correctly diagnosing what the problem is. When diagnosing a problem with your code (or somebody else's), it's important to first understand what you were expecting to occur. Then you need to identify what *did* occur and how did it deviate from your expectations. Some basic questions you need to ask are - What was your input? How did you call the function? - What were you expecting? Output, messages, other results? @@ -269,11 +269,11 @@ Enter a frame number, or 0 to exit Selection: ``` -The `recover()` function will first print out the function call stack when an error occurrs. Then, you can choose to jump around the call stack and investigate the problem. When you choose a frame number, you will be put in the browser (just like the interactive debugger triggered with `debug()`) and will have the ability to poke around. +The `recover()` function will first print out the function call stack when an error occurs. Then, you can choose to jump around the call stack and investigate the problem. When you choose a frame number, you will be put in the browser (just like the interactive debugger triggered with `debug()`) and will have the ability to poke around. ## Summary -- There are three main indications of a problem/condition: `message`, `warning`, `error`; only an `error` is fatal -- When analyzing a function with a problem, make sure you can reproduce the problem, clearly state your expectations and how the output differs from your expectation -- Interactive debugging tools `traceback`, `debug`, `browser`, `trace`, and `recover` can be used to find problematic code in functions +- There are three main indications of a problem/condition: `message`, `warning`, `error`; only an `error` is fatal. +- When analyzing a function with a problem, make sure you can reproduce the problem, clearly state your expectations and how the output differs from your expectation. +- Interactive debugging tools `traceback`, `debug`, `browser`, `trace`, and `recover` can be used to find problematic code in functions. - Debugging tools are not a substitute for thinking! diff --git a/manuscript/dplyr.Rmd b/manuscript/dplyr.Rmd index ce79ca1..8bf464d 100644 --- a/manuscript/dplyr.Rmd +++ b/manuscript/dplyr.Rmd @@ -40,7 +40,7 @@ Some of the key "verbs" provided by the `dplyr` package are * `%>%`: the "pipe" operator is used to connect multiple verb actions together into a pipeline -The `dplyr` package as a number of its own data types that it takes advantage of. For example, there is a handy `print` method that prevents you from printing a lot of data to the console. Most of the time, these additional data types are transparent to the user and do not need to be worried about. +The `dplyr` package has a number of its own data types that it takes advantage of. For example, there is a handy `print` method that prevents you from printing a lot of data to the console. Most of the time, these additional data types are transparent to the user and do not need to be worried about. @@ -52,7 +52,7 @@ All of the functions that we will discuss in this Chapter will have a few common 2. The subsequent arguments describe what to do with the data frame specified in the first argument, and you can refer to columns in the data frame directly without using the $ operator (just use the column names). -3. The return result of a function is a new data frame +3. The return result of a function is a new data frame. 4. Data frames must be properly formatted and annotated for this to all be useful. In particular, the data must be [tidy](http://www.jstatsoft.org/v59/i10/paper). In short, there should be one observation per row, and each column should represent a feature or characteristic of that observation. @@ -84,7 +84,7 @@ You may get some warnings when the package is loaded because there are functions ## `select()` -For the examples in this chapter we will be using a dataset containing air pollution and temperature data for the [city of Chicago](http://www.biostat.jhsph.edu/~rpeng/leanpub/rprog/chicago_data.zip) in the U.S. The dataset is available from my web site. +For the examples in this chapter we will be using a dataset containing air pollution and temperature data for the [city of Chicago](http://www.biostat.jhsph.edu/~rpeng/leanpub/rprog/chicago_data.zip) in the U.S. The dataset is available from my website. After unzipping the archive, you can load the data into R using the `readRDS()` function. @@ -101,7 +101,7 @@ str(chicago) The `select()` function can be used to select columns of a data frame that you want to focus on. Often you'll have a large data frame containing "all" of the data, but any *given* analysis might only use a subset of variables or observations. The `select()` function allows you to get the few columns you might need. -Suppose we wanted to take the first 3 columns only. There are a few ways to do this. We could for example use numerical indices. But we can also use the names directly. +Suppose we wanted to take the first 3 columns only. There are a few ways to do this. We could, for example, use numerical indices. But we can also use the names directly. ```{r} names(chicago)[1:3] @@ -197,7 +197,7 @@ and the last few rows. tail(select(chicago, date, pm25tmean2), 3) ``` -Columns can be arranged in descending order too by useing the special `desc()` operator. +Columns can be arranged in descending order too by using the special `desc()` operator. ```{r} chicago <- arrange(chicago, desc(date)) @@ -221,7 +221,7 @@ Here you can see the names of the first five variables in the `chicago` data fra head(chicago[, 1:5], 3) ``` -The `dptp` column is supposed to represent the dew point temperature adn the `pm25tmean2` column provides the PM2.5 data. However, these names are pretty obscure or awkward and probably be renamed to something more sensible. +The `dptp` column is supposed to represent the dew point temperature adn the `pm25tmean2` column provides the PM2.5 data. However, these names are pretty obscure or awkward and probably need to be renamed to something more sensible. ```{r} chicago <- rename(chicago, dewpoint = dptp, pm25 = pm25tmean2) @@ -365,11 +365,11 @@ Here we can see that `o3` tends to be low in the winter months and high in the s The `dplyr` package provides a concise set of operations for managing data frames. With these functions we can do a number of complex operations in just a few lines of code. In particular, we can often conduct the beginnings of an exploratory analysis with the powerful combination of `group_by()` and `summarize()`. -Once you learn the `dplyr` grammar there are a few additional benefits +Once you learn the `dplyr` grammar there are a few additional benefits: -* `dplyr` can work with other data frame "backends" such as SQL databases. There is an SQL interface for relational databases via the DBI package +* `dplyr` can work with other data frame "backends", such as SQL databases. There is a SQL interface for relational databases via the DBI package. -* `dplyr` can be integrated with the `data.table` package for large fast tables +* `dplyr` can be integrated with the `data.table` package for large fast tables. -The `dplyr` package is handy way to both simplify and speed up your data frame management code. It's rare that you get such a combination at the same time! +The `dplyr` package is a handy way to both simplify and speed up your data frame management code. It's rare that you get such a combination at the same time! diff --git a/manuscript/example.Rmd b/manuscript/example.Rmd index 3b5f0a6..3b0df0d 100644 --- a/manuscript/example.Rmd +++ b/manuscript/example.Rmd @@ -4,7 +4,7 @@ knitr::opts_chunk$set(comment = NA, fig.path = "images/", collapse = TRUE, prompt = TRUE) ``` -This chapter presents an example data analysis looking at changes in fine particulate matter (PM) air pollution in the United States using the Environmental Protection Agencies freely available national monitoring data. The purpose of the chapter is to just show how the various tools that we have covered in this book can be used to read, manipulate, and summarize data so that you can develop statistical evidence for relevant real-world questions. +This chapter presents an example data analysis looking at changes in fine particulate matter (PM) air pollution in the United States using the Environmental Protection Agency's freely available national monitoring data. The purpose of the chapter is to just show how the various tools that we have covered in this book can be used to read, manipulate, and summarize data so that you can develop statistical evidence for relevant real-world questions. [Watch a video of this chapter](https://youtu.be/VE-6bQvyfTQ) @@ -16,12 +16,12 @@ In this chapter we aim to describe the changes in fine particle (PM2.5) outdoor ## Loading and Processing the Raw Data -From the [EPA Air Quality System](http://aqsdr1.epa.gov/aqsweb/aqstmp/airdata/download_files.html) we obtained data on fine particulate matter air pollution (PM2.5) that is monitored across the U.S. as part of the nationwide PM monitoring network. We obtained the files for the years 1999 and 2012. +From the [EPA's Air Quality System](https://aqs.epa.gov/aqsweb/airdata/download_files.html), we obtained data on fine particulate matter air pollution (PM2.5) that is monitored across the U.S. as part of the nationwide PM monitoring network. We obtained the files for the years 1999 and 2012. ### Reading in the 1999 data -We first read in the 1999 data from the raw text file included in the zip archive. The data is a delimited file were fields are delimited with the `|` character and missing values are coded as blank fields. We skip some commented lines in the beginning of the file and initially we do not read the header data. +We first read in the 1999 data from the raw text file included in the zip archive. The data is a delimited file where fields are delimited with the `|` character and missing values are coded as blank fields. We skip some commented lines in the beginning of the file and initially we do not read the header data. ```{r read 1999 data,cache=TRUE,tidy=FALSE} @@ -99,10 +99,11 @@ Interestingly, from the summary of `x1` it appears there are some negative value ```{r check negative values} negative <- x1 < 0 +length(negative[negative=="TRUE"])/length(x1) # proportion of negative values mean(negative, na.rm = T) -```` +``` -There is a relatively small proportion of values that are negative, which is perhaps reassuring. In order to investigate this a step further we can extract the date of each measurement from the original data frame. The idea here is that perhaps negative values occur more often in some parts of the year than other parts. However, the original data are formatted as character strings so we convert them to R's `Date` format for easier manipulation. +There is a relatively small proportion of values that are negative (`r round(length(negative[negative=="TRUE"])/length(x1),3)`), which is perhaps reassuring. In order to investigate this a step further we can extract the date of each measurement from the original data frame. The idea here is that perhaps negative values occur more often in some parts of the year than other parts. However, the original data are formatted as character strings so we convert them to R's `Date` format for easier manipulation. ```{r converting dates,cache=TRUE} dates <- pm1$Date @@ -118,7 +119,7 @@ tab <- table(factor(missing.months, levels = month.name)) round(100 * tab / sum(tab)) ``` -From the table above it appears that bulk of the negative values occur in the first six months of the year (January--June). However, beyond that simple observation, it is not clear why the negative values occur. That said, given the relatively low proportion of negative values, we will ignore them for now. +From the table above it appears the bulk of the negative values occur in the first six months of the year (January--June). However, beyond that simple observation, it is not clear why the negative values occur. That said, given the relatively low proportion of negative values, we will ignore them for now. ### Changes in PM levels at an individual monitor @@ -141,7 +142,7 @@ str(site0) str(site1) ``` -Finaly, we want the intersection between the sites present in 1999 and 2012 so that we might choose a monitor that has data in both periods. +Finally, we want the intersection between the sites present in 1999 and 2012 so that we might choose a monitor that has data in both periods. ```{r} both <- intersect(site0, site1) @@ -195,11 +196,11 @@ plot(dates1, x1sub, pch = 20, ylim = rng, xlab = "", ylab = expression(PM[2.5] * abline(h = median(x1sub, na.rm = T)) ``` -From the plot above, we can that median levels of PM (horizontal solid line) have decreased a little from `r median(x0sub,na.rm=TRUE)` in 1999 to `r median(x1sub,na.rm=TRUE)` in 2012. However, perhaps more interesting is that the variation (spread) in the PM values in 2012 is much smaller than it was in 1999. This suggest that not only are median levels of PM lower in 2012, but that there are fewer large spikes from day to day. One issue with the data here is that the 1999 data are from July through December while the 2012 data are recorded in January through April. It would have been better if we'd had full-year data for both years as there could be some seasonal confounding going on. +From the plot above, we can that median levels of PM (horizontal solid line) have decreased a little from `r median(x0sub,na.rm=TRUE)` in 1999 to `r median(x1sub,na.rm=TRUE)` in 2012. However, perhaps more interesting is that the variation (spread) in the PM values in 2012 is much smaller than it was in 1999. This suggests that not only are median levels of PM lower in 2012, but that there are fewer large spikes from day to day. One issue with the data here is that the 1999 data are from July through December while the 2012 data are recorded in January through April. It would have been better if we'd had full-year data for both years as there could be some seasonal confounding going on. ### Changes in state-wide PM levels -Although ambient air quality standards are set at the federal level in the U.S. and hence affect the entire country, the actual reduction and management of PM is left to the individual states. States that are not "in attainment" have to develop a plan to reduce PM so that that the are in attainment (eventually). Therefore, it might be useful to examine changes in PM at the state level. This analysis falls somewhere in between looking at the entire country all at once and looking at an individual monitor. +Although ambient air quality standards are set at the federal level in the U.S. and hence affect the entire country, the actual reduction and management of PM is left to the individual states. States that are not "in attainment" have to develop a plan to reduce PM so that that they are in attainment (eventually). Therefore, it might be useful to examine changes in PM at the state level. This analysis falls somewhere in between looking at the entire country all at once and looking at an individual monitor. What we do here is calculate the mean of PM for each state in 1999 and 2012. diff --git a/manuscript/functions.Rmd b/manuscript/functions.Rmd index 80da217..0e5e002 100644 --- a/manuscript/functions.Rmd +++ b/manuscript/functions.Rmd @@ -29,7 +29,7 @@ treated much like any other R object. Importantly, - Functions can be passed as arguments to other functions. This is very handy for the various apply functions, like `lapply()` and `sapply()`. - Functions can be nested, so that you can define a function inside of - another function + another function. If you're familiar with common language like C, these features might appear a bit strange. However, they are really important in R and can be useful for data analysis. @@ -63,7 +63,7 @@ f() The last aspect of a basic function is the *function arguments*. These are the options that you can specify to the user that the user may -explicity set. For this basic function, we can add an argument that +explicitly set. For this basic function, we can add an argument that determines how many times "Hello, world!" is printed to the console. ```{r} @@ -75,7 +75,7 @@ f <- function(num) { f(3) ``` -Obviously, we could have just cut-and-pasted the `cat("Hello, world!\n")` code three times to achieve the same effect, but then we wouldn't be programming, would we? Also, it would be un-neighborly of you to give your code to someone else and force them to cut-and-paste the code however many times the need to see "Hello, world!". +Obviously, we could have just cut-and-pasted the `cat("Hello, world!\n")` code three times to achieve the same effect, but then we wouldn't be programming, would we? Also, it would be un-neighborly of you to give your code to someone else and force them to cut-and-paste the code however many times they need to see "Hello, world!". > In general, if you find yourself doing a lot of cutting and pasting, that's usually a good sign that you might need to write a function. @@ -98,7 +98,7 @@ print(meaningoflife) In the above function, we didn't have to indicate anything special in order for the function to return the number of characters. In R, the return value of a function is always the very last expression that is evaluated. Because the `chars` variable is the last expression that is evaluated in this function, that becomes the return value of the function. -Note that there is a `return()` function that can be used to return an explicity value from a function, but it is rarely used in R (we will discuss it a bit later in this chapter). +Note that there is a `return()` function that can be used to return an explicit value from a function, but it is rarely used in R (we will discuss it a bit later in this chapter). Finally, in the above function, the user must specify the value of the argument `num`. If it is not specified by the user, R will throw an error. @@ -145,7 +145,7 @@ Specifying an argument by its name is sometimes useful if a function has many ar ## Argument Matching -Calling an R function with arguments can be done in a variety of ways. This may be confusing at first, but it's really handing when doing interactive work at the command line. R functions arguments can be matched *positionally* or by name. Positional matching just means that R assigns the first value to the first argument, the second value to second argument, etc. So in the following call to `rnorm()` +Calling an R function with arguments can be done in a variety of ways. This may be confusing at first, but it's really handy when doing interactive work at the command line. R functions arguments can be matched *positionally* or by name. Positional matching just means that R assigns the first value to the first argument, the second value to second argument, etc. So in the following call to `rnorm()` ```{r} str(rnorm) @@ -301,14 +301,10 @@ paste("a", "b", se = ":") ## Summary -* Functions can be defined using the `function()` directive and are assigned to R objects just like any other R object - -* Functions have can be defined with named arguments; these function arguments can have default values - -* Functions arguments can be specified by name or by position in the argument list - -* Functions always return the last expression evaluated in the function body - +* Functions can be defined using the `function()` directive and are assigned to R objects just like any other R object. +* Functions can be defined with named arguments; these function arguments can have default values. +* Functions arguments can be specified by name or by position in the argument list. +* Functions always return the last expression evaluated in the function body. * A variable number of arguments can be specified using the special `...` argument in a function definition. diff --git a/manuscript/gettingstarted.Rmd b/manuscript/gettingstarted.Rmd index 31b199f..5f7e91d 100644 --- a/manuscript/gettingstarted.Rmd +++ b/manuscript/gettingstarted.Rmd @@ -16,7 +16,7 @@ There is also an integrated development environment available for R that is built by RStudio. I really like this IDE---it has a nice editor with syntax highlighting, there is an R object viewer, and there are a number of other nice features that are integrated. You can -see how to install RStudio here +see how to install RStudio here: - [Installing RStudio](https://youtu.be/bM7Sfz-LADM) @@ -28,7 +28,7 @@ site](http://rstudio.com). After you install R you will need to launch it and start writing R code. Before we get to exactly how to write R code, it's useful to get a sense of how the system is organized. In these two videos I talk -about where to write code and how set your working directory, which +about where to write code and how to set your working directory, which let's R know where to find all of your files. - [Writing code and setting your working directory on the Mac](https://youtu.be/8xT3hmJQskU) diff --git a/manuscript/nutsbolts.Rmd b/manuscript/nutsbolts.Rmd index 2d83046..6b23e4a 100644 --- a/manuscript/nutsbolts.Rmd +++ b/manuscript/nutsbolts.Rmd @@ -132,7 +132,7 @@ used in ordinary calculations; e.g. `1 / Inf` is 0. The value `NaN` represents an undefined value ("not a number"); e.g. 0 / 0; `NaN` can also be thought of as a missing value (more on that -later) +later). ## Attributes @@ -244,7 +244,7 @@ from R. Matrices are vectors with a _dimension_ attribute. The dimension attribute is itself an integer vector of length 2 (number of rows, -number of columns) +number of columns). ```{r} m <- matrix(nrow = 2, ncol = 3) @@ -297,7 +297,7 @@ x ``` We can also create an empty list of a prespecified length with the -`vector()` function +`vector()` function. ```{r} x <- vector("list", length = 5) @@ -311,7 +311,7 @@ x Factors are used to represent categorical data and can be unordered or ordered. One can think of a factor as an integer vector where each integer has a _label_. Factors are important in statistical modeling -and are treated specially by modelling functions like `lm()` and +and are treated specially by modeling functions like `lm()` and `glm()`. Using factors with labels is _better_ than using integers because @@ -347,7 +347,7 @@ x ## Missing Values -Missing values are denoted by `NA` or `NaN` for q undefined +Missing values are denoted by `NA` or `NaN` for undefined mathematical operations. - `is.na()` is used to test objects if they are `NA` @@ -381,7 +381,7 @@ is.nan(x) Data frames are used to store tabular data in R. They are an important type of object in R and are used in a variety of statistical modeling applications. Hadley Wickham's package -[dplyr](https://github.com/hadley/dplyr) has an optimized set of +[dplyr](https://github.com/tidyverse/dplyr) has an optimized set of functions designed to work efficiently with data frames. Data frames are represented as a special type of list where every @@ -467,8 +467,8 @@ know its confusing. Here's a quick summary: ## Summary -There are a variety of different builtin-data types in R. In this -chapter we have reviewed the following +There are a variety of different built-in data types in R. In this +chapter we have reviewed the following: - atomic classes: numeric, logical, character, integer, complex diff --git a/manuscript/overview.Rmd b/manuscript/overview.Rmd index cf15008..82ec124 100644 --- a/manuscript/overview.Rmd +++ b/manuscript/overview.Rmd @@ -56,7 +56,7 @@ figuring out how to make data analysis easier, first for themselves, and then eventually for others. In [Stages in the Evolution of -S](http://www.stat.bell-labs.com/S/history.html ), John Chambers +S](https://web.archive.org/web/20150305201743/http://www.stat.bell-labs.com/S/history.html), John Chambers writes: > “[W]e wanted users to be able to begin in an interactive environment, @@ -67,7 +67,7 @@ writes: The key part here was the transition from *user* to *developer*. They wanted to build a language that could easily service both -"people". More technically, they needed to build language that would +"people". More technically, they needed to build a language that would be suitable for interactive data analysis (more command-line based) as well as for writing longer programs (more traditional programming language-like). @@ -133,7 +133,7 @@ beginning and has generally been better than competing packages. Today, with many more visualization packages available than before, that trend continues. R's base graphics system allows for very fine control over essentially every aspect of a plot or graph. Other -newer graphics systems, like lattice and ggplot2 allow for complex and +newer graphics systems, like lattice and ggplot2, allow for complex and sophisticated visualizations of high-dimensional data. R has maintained the original S philosophy, which is that it provides a @@ -196,9 +196,9 @@ functionality of R. The R system is divided into 2 conceptual parts: 1. The "base" R system that you download from CRAN: -[Linux](http://cran.r-project.org/bin/linux/) -[Windows](http://cran.r-project.org/bin/windows/) -[Mac](http://cran.r-project.org/bin/macosx/) [Source +[Linux](http://cran.r-project.org/bin/linux/), +[Windows](http://cran.r-project.org/bin/windows/), +[Mac](http://cran.r-project.org/bin/macosx/), [Source Code](http://cran.r-project.org/src/base/R-3/R-3.1.3.tar.gz) 2. Everything else. @@ -221,7 +221,7 @@ When you download a fresh installation of R from CRAN, you get all of the above, which represents a substantial amount of functionality. However, there are many other packages available: -- There are over 4000 packages on CRAN that have been developed by +- There are over 4,000 packages on CRAN that have been developed by users and programmers around the world. - There are also many packages associated with the [Bioconductor @@ -305,7 +305,7 @@ this book. Also, available from [CRAN](http://cran.r-project.org) are - [R Installation and Administration](http://cran.r-project.org/doc/manuals/r-release/R-admin.html): - This is mostly for building R from the source code) + This is mostly for building R from the source code - [R Internals](http://cran.r-project.org/doc/manuals/r-release/R-ints.html): diff --git a/manuscript/parallel.Rmd b/manuscript/parallel.Rmd index 25b3eb3..a16c2ca 100644 --- a/manuscript/parallel.Rmd +++ b/manuscript/parallel.Rmd @@ -6,7 +6,7 @@ knitr::opts_chunk$set(comment = NA, prompt = TRUE, collapse = TRUE, fig.path = " Many computations in R can be made faster by the use of parallel computation. Generally, parallel computation is the simultaneous execution of different pieces of a larger computation across multiple computing processors or cores. The basic idea is that if you can execute a computation in $X$ seconds on a single processor, then you should be able to execute it in $X/n$ seconds on $n$ processors. Such a speed-up is generally not possible because of overhead and various barriers to splitting up a problem into $n$ pieces, but it is often possible to come close in simple problems. -It used to be that parallel computation was squarely in the domain of "high-performance computing", where expensive machines were linked together via high-speed networking to create large clusters of computers. In those kinds of settings, it was important to have sophisticated software to manage the communication of data between different computers in the cluster. Parallel computing in that setting was a highly tuned, and carefully customized operation and not something you could just saunter into. +It used to be that parallel computation was squarely in the domain of "high-performance computing", where expensive machines were linked together via high-speed networking to create large clusters of computers. In those kinds of settings, it was important to have sophisticated software to manage the communication of data between different computers in the cluster. Parallel computing in that setting was a highly tuned, carefully customized operation, and not something you could just saunter into. These days though, almost all computers contain multiple processors or cores on them. Even Apple's iPhone 6S comes with a [dual-core CPU](https://en.wikipedia.org/wiki/Apple_A9) as part of its A9 system-on-a-chip. Getting access to a "cluster" of CPUs, in this case all built into the same computer, is much easier than it used to be and this has opened the door to parallel computing for a wide range of people. @@ -18,7 +18,7 @@ You may be computing in parallel without even knowing it! These days, many compu ### Parallel BLAS -A common example in R is the use of linear algebra functions. Some versions of R that you use may be linked to on optimized Basic Linear Algebra Subroutines (BLAS) library. Such libraries are custom coded for specific CPUs/chipsets to take advantage of the architecture of the chip. It's important to realize that while R can do linear algebra out of the box, its default BLAS library is a *reference implementation* that is not necessarily optimized to any particular chipset. +A common example in R is the use of linear algebra functions. Some versions of R that you use may be linked to an optimized Basic Linear Algebra Subroutines (BLAS) library. Such libraries are custom coded for specific CPUs/chipsets to take advantage of the architecture of the chip. It's important to realize that while R can do linear algebra out of the box, its default BLAS library is a *reference implementation* that is not necessarily optimized to any particular chipset. When possible, it's always a good idea to install an optimized BLAS on your system because it can dramatically improve the performance of those kinds of computations. Part of the increase in performance comes from the customization of the code to a particular chipset while part of it comes from the multi-threading that many libraries use to parallelize their computations. @@ -44,22 +44,22 @@ Here, you can see that the `user` time is just under 1 second while the `elapsed Here's a summary of some of the optimized BLAS libraries out there: -* The [AMD Core Math Library](http://developer.amd.com/tools-and-sdks/archive/amd-core-math-library-acml/) (ACML) is built for AMD chips and contains a full set of BLAS and LAPACK routines. The library is closed-source and is maintained/released by AMD. +* The [AMD Optimizing CPU Libraries](http://developer.amd.com/tools-and-sdks/archive/amd-core-math-library-acml/) (AOCL) is built for AMD chips and contains BLIS (BLAS-like), LAPACK, and FFT routines. The library is mostly open-source and is maintained/released by AMD. -* The [Intel Math Kernel](https://software.intel.com/en-us/intel-mkl) is an analogous optimized library for Intel-based chips +* The [Intel Math Kernel](https://software.intel.com/en-us/intel-mkl) is an analogous optimized library for Intel-based chips. * The [Accelerate framework](https://developer.apple.com/library/tvos/documentation/Accelerate/Reference/AccelerateFWRef/index.html) on the Mac contains an optimized BLAS built by Apple. * The [Automatically Tuned Linear Algebra Software](http://math-atlas.sourceforge.net) (ATLAS) library is a special "adaptive" software package that is designed to be compiled on the computer where it will be used. As part of the build process, the library extracts detailed CPU information and optimizes the code as it goes along. The ATLAS library is hence a generic package that can be built on a wider array of CPUs. -Detailed instructions on how to use R with optimized BLAS libraries can be found in the [R Installation and Administration](https://cran.r-project.org/doc/manuals/r-release/R-admin.html#BLAS) manual. In some cases, you may need to build R from the sources in order to link it with the optimized BLAS library. +Detailed instructions on how to use R with optimized BLAS libraries can be found in the [R Installation and Administration](https://cran.r-project.org/doc/manuals/r-release/R-admin.html#BLAS) manual. In some cases, you may need to build R from source in order to link it with an optimized BLAS library. ## Embarrassing Parallelism -Many problems in statistics and data science can be executed in an "embarrassingly parallel" way, whereby multiple independent pieces of a problem are executed simultaneously because the different pieces of the problem never really have to communicate with each other (except perhaps at the end when all the results are assembled). Despite the name, there's nothing really "embarrassing" about taking advantage of the structure of the problem and using it speed up your computation. In fact, embarrassingly parallel computation is a common paradigm in statistics and data science. +Many problems in statistics and data science can be executed in an "embarrassingly parallel" way, whereby multiple independent pieces of a problem are executed simultaneously because the different pieces of the problem never really have to communicate with each other (except perhaps at the end when all the results are assembled). Despite the name, there's nothing really "embarrassing" about taking advantage of the structure of the problem and using it to speed up your computation. In fact, embarrassingly parallel computation is a common paradigm in statistics and data science. -A> In general, it is NOT a good idea to use the functions described in this chapter with graphical user interfaces (GUIs) because, to summarize the help page for `mclapply()`, bad things can happen. That said, the functions in the `parallel` package seem two work okay in RStudio. +A> In general, it is NOT a good idea to use the functions described in this chapter with graphical user interfaces (GUIs) because, to summarize the help page for `mclapply()`, bad things can happen. That said, the functions in the `parallel` package seem to work okay in RStudio. The basic mode of an embarrassingly parallel operation can be seen with the `lapply()` function, which we have reviewed in a [previous chapter](#loop-functions). Recall that the `lapply()` function has two arguments: @@ -69,7 +69,7 @@ The basic mode of an embarrassingly parallel operation can be seen with the `lap Finally, recall that `lapply()` always returns a list whose length is equal to the length of the input list. -The `lapply()` function works much like a loop--it cycles through each element of the list and applies the supplied function to that element. While `lapply()` is applying your function to a list element, the other elements of the list are just...sitting around in memory. Note that in the description of `lapply()` above, there's no mention of the different elements of the list communicating with each other, and the function being applied to a given list element does not need to know about other list elements. +The `lapply()` function works much like a loop -- it cycles through each element of the list and applies the supplied function to that element. While `lapply()` is applying your function to a list element, the other elements of the list are just... sitting around in memory. Note that in the description of `lapply()` above, there's no mention of the different elements of the list communicating with each other, and the function being applied to a given list element does not need to know about other list elements. Just about any operation that is handled by the `lapply()` function can be parallelized. This approach is analogous to the ["map-reduce"](https://en.wikipedia.org/wiki/MapReduce) approach in large-scale cluster systems. The idea is that a list object can be split across multiple cores of a processor and then the function can be applied to each subset of the list object on each of the cores. Conceptually, the steps in the parallel procedure are @@ -85,9 +85,9 @@ The differences between the many packages/functions in R essentially come down t ## The Parallel Package -The `parallel` package which comes with your R installation. It represents a combining of two historical packages--the `multicore` and `snow` packages, and the functions in `parallel` have overlapping names with those older packages. For our purposes, it's not necessary to know anything about the `multicore` or `snow` packages, but long-time users of R may remember them from back in the day. +The `parallel` package comes with your R installation. It represents a combining of two historical packages -- the `multicore` and `snow` packages, and the functions in `parallel` have overlapping names with those older packages. For our purposes, it's not necessary to know anything about the `multicore` or `snow` packages, but long-time users of R may remember them from back in the day. -The `mclapply()` function essentially parallelizes calls to `lapply()`. The first two arguments to `mclapply()` are exactly the same as they are for `lapply()`. However, `mclapply()` has further arguments (that must be named), the most important of which is the `mc.cores` argument which you can use to specify the number of processors/cores you want to split the computation across. For example, if your machine has 4 cores on it, you might specify `mc.cores = 4` to break your parallelize your operation across 4 cores (although this may not be the best idea if you are running other operations in the background besides R). +The `mclapply()` function essentially parallelizes calls to `lapply()`. The first two arguments to `mclapply()` are exactly the same as they are for `lapply()`. However, `mclapply()` has further arguments (that must be named), the most important of which is the `mc.cores` argument which you can use to specify the number of processors/cores you want to split the computation across. For example, if your machine has 4 cores on it, you might specify `mc.cores = 4` to break your operation (parallelize) across 4 cores (although this may not be the best idea if you are running other operations in the background besides R). The `mclapply()` function (and related `mc*` functions) works via the fork mechanism on Unix-style operating systems. Briefly, your R session is the main process and when you call a function like `mclapply()`, you fork a series of sub-processes that operate independently from the main process (although they share a few low-level features). These sub-processes then execute your function on their subsets of the data, presumably on separate cores of your CPU. Once the computation is complete, each sub-process returns its results and then the sub-process is killed. The `parallel` package manages the logistics of forking the sub-processes and handling them once they've finished. @@ -120,9 +120,9 @@ r <- mclapply(1:10, function(i) { While this "job" was running, I took a screen shot of the system activity monitor ("top"). Here's what it looks like on Mac OS X. -![Multiple sub-processes spawned by `mclapply()`](images/topparallel.png) +![Multiple `rsession` sub-processes are spawned by `mclapply()`.](images/topparallel.png) -In case you are not used to viewing this output, each row of the table is an application or process running on your computer. You can see that there are 11 rows where the COMMAND is labelled `rsession`. One of these is my primary R session (being run through RStudio), and the other 10 are the sub-processes spawned by the `mclapply()` function. +In case you are not used to viewing this output, each row of the table is an application or process running on your computer. You can see that there are 11 rows where the COMMAND is labeled `rsession`. One of these is my primary R session (being run through RStudio), and the other 10 are the sub-processes spawned by the `mclapply()` function. We will use as a second (slightly more realistic) example processing data from multiple files. Often this is something that can be easily parallelized. @@ -182,7 +182,7 @@ When either `mclapply()` or `mcmapply()` are called, the functions supplied will This error handling behavior is a significant difference from the usual call to `lapply()`. With `lapply()`, if the supplied function fails on one component of the list, the entire function call to `lapply()` fails and you only get an error as a result. -With `mclapply()`, when a sub-process fails, the return value for that sub-process will be an R object that inherits from the class `"try-error"`, which is something you can test with the `inherits()` function. Conceptually, each child process is executed with the `try()` function wrapped around it. The code below deliberately causes an error in the 3 element of the list. +With `mclapply()`, when a sub-process fails, the return value for that sub-process will be an R object that inherits from the class `"try-error"`, which is something you can test with the `inherits()` function. Conceptually, each child process is executed with the `try()` function wrapped around it. The code below deliberately causes an error in the 3rd element of the list. ```{r} r <- mclapply(1:5, function(i) { @@ -240,7 +240,7 @@ We can see from the histogram that the distribution of sulfate is skewed to the summary(sulf) ``` -How can we construct confidence interval for the median of sulfate for this monitor? The bootstrap is simple procedure that can work well. Here's how we might do it in the usual (non-parallel) way. +How can we construct a confidence interval for the median of sulfate for this monitor? The bootstrap is a simple procedure that can work well. Here's how we might do it in the usual (non-parallel) way. ```{r} set.seed(1) @@ -256,7 +256,7 @@ A 95% confidence interval would then take the 2.5th and 97.5th percentiles of th quantile(med.boot, c(0.025, 0.975)) ``` -How could be done in parallel? We could simply wrap the expression passed to `replicate()` in a function and pass it to `mclapply()`. However, one thing we need to be careful of is generating random numbers. +How could this be done in parallel? We could simply wrap the expression passed to `replicate()` in a function and pass it to `mclapply()`. However, one thing we need to be careful of is generating random numbers. ### Generating Random Numbers diff --git a/manuscript/profiler.Rmd b/manuscript/profiler.Rmd index 0db0a57..b8e5824 100644 --- a/manuscript/profiler.Rmd +++ b/manuscript/profiler.Rmd @@ -9,9 +9,9 @@ knitr::opts_chunk$set(comment = NA, prompt = TRUE, collapse = TRUE) ``` -R comes with a profiler to help you optimize your code and improve its performance. In generall, it's usually a bad idea to focus on optimizing your code at the very beginning of development. Rather, in the beginning it's better to focus on translating your ideas into code and writing code that's coherent and readable. The problem is that heavily optimized code tends to be obscure and difficult to read, making it harder to debug and revise. Better to get all the bugs out first, then focus on optimizing. +R comes with a profiler to help you optimize your code and improve its performance. In general, it's usually a bad idea to focus on optimizing your code at the very beginning of development. Rather, in the beginning it's better to focus on translating your ideas into code and writing code that's coherent and readable. The problem is that heavily optimized code tends to be obscure and difficult to read, making it harder to debug and revise. Better to get all the bugs out first, then focus on optimizing. -Of course, when it comes to optimizing code, the question is what should you optimize? Well, clearly should optimize the parts of your code that are running slowly, but how do we know what parts those are? +Of course, when it comes to optimizing code, the question is what should you optimize? Well, clearly you should optimize the parts of your code that are running slowly, but how do we know what parts those are? This is what the profiler is for. Profiling is a systematic way to examine how much time is spent in different parts of a program. @@ -26,9 +26,9 @@ the code spends most of its time. This cannot be done without some sort of rigor The basic principles of optimizing your code are: -* Design first, then optimize +* Design first, then optimize. -* Remember: Premature optimization is the root of all evil +* Remember: Premature optimization is the root of all evil. * Measure (collect data), don’t guess. @@ -37,15 +37,15 @@ The basic principles of optimizing your code are: ## Using `system.time()` -They `system.time()` function takes an arbitrary R expression as input (can be wrapped in curly braces) and returns the amount of time taken to evaluate the expression. The `system.time()` function computes the time (in seconds) needed to execute an expression and if there’s an error, gives the time until the error occurred. The function returns an object of class `proc_time` which contains two useful bits of information: +The `system.time()` function takes an arbitrary R expression as input (can be wrapped in curly braces) and returns the amount of time taken to evaluate the expression. The `system.time()` function computes the time (in seconds) needed to execute an expression and if there’s an error, gives the time until the error occurred. The function returns an object of class `proc_time` which contains two useful bits of information: - *user time*: time charged to the CPU(s) for this expression - *elapsed time*: "wall clock" time, the amount of time that passes for *you* as you're sitting there -Usually, the user time and elapsed time are relatively close, for straight computing tasks. But there are a few situations where the two can diverge, sometimes dramatically. The elapsed time may be *greater than* the user time if the CPU spends a lot of time waiting around. This commonly happens if your R expression involes some input or output, which depends on the activity of the file system and the disk (or the Internet, if using a network connection). +Usually, the user time and elapsed time are relatively close, for straight computing tasks. But there are a few situations where the two can diverge, sometimes dramatically. The elapsed time may be *greater than* the user time if the CPU spends a lot of time waiting around. This commonly happens if your R expression involves some input or output, which depends on the activity of the file system and the disk (or the Internet, if using a network connection). -The elapsed time may be *smaller than* the user time if your machine has multiple cores/processors (and is capable of using them). For example, multi-threaded BLAS libraries (vecLib/Accelerate, ATLAS, ACML, MKL) can greatly speed up linear algebra calculations and are commonly installed on even desktop systems these days. Also, parallel processing done via something like the `parallell` package can make the elapsed time smaller than the user time. When you have multiple processors/cores/machines working in parallel, the amount of time that the collection of CPUs spends working on a problem is the same as with a single CPU, but because they are operating in parallel, there is a savings in elapsed time. +The elapsed time may be *smaller than* the user time if your machine has multiple cores/processors (and is capable of using them). For example, multi-threaded BLAS libraries (vecLib/Accelerate, ATLAS, ACML, MKL) can greatly speed up linear algebra calculations and are commonly installed on even desktop systems these days. Also, parallel processing done via something like the `parallel` package can make the elapsed time smaller than the user time. When you have multiple processors/cores/machines working in parallel, the amount of time that the collection of CPUs spends working on a problem is the same as with a single CPU, but because they are operating in parallel, there is a savings in elapsed time. Here's an example of where the elapsed time is greater than the user time. @@ -95,12 +95,12 @@ If your expression is getting pretty long (more than 2 or 3 lines), it might be [Watch a video of this section](https://youtu.be/BZVcMPtlJ4A) -Using `system.time()` allows you to test certain functions or code blocks to see if they are taking excessive amounts of time. However, this approach assumes that you already know where the problem is and can call `system.time()` on it that piece of code. What if you don’t know where to start? +Using `system.time()` allows you to test certain functions or code blocks to see if they are taking excessive amounts of time. However, this approach assumes that you already know where the problem is and can call `system.time()` on that piece of code. What if you don’t know where to start? This is where the profiler comes in handy. The `Rprof()` function starts the profiler in R. Note that R must be compiled with profiler support (but this is usually the case). In conjunction with `Rprof()`, we will use the `summaryRprof()` function which summarizes the output from `Rprof()` (otherwise it’s not really readable). Note that you should NOT use `system.time()` and `Rprof()` together, or you will be sad. -`Rprof()` keeps track of the function call stack at regularly sampled intervals and tabulates how much time is spent inside each function. By default, the profiler samples the function call stack every 0.02 seconds. This means that if your code runs very quickly (say, under 0.02 seconds), the profiler is not useful. But of your code runs that fast, you probably don't need the profiler. +`Rprof()` keeps track of the function call stack at regularly sampled intervals and tabulates how much time is spent inside each function. By default, the profiler samples the function call stack every 0.02 seconds. This means that if your code runs very quickly (say, under 0.02 seconds), the profiler is not useful. But if your code runs that fast, you probably don't need the profiler. The profiler is started by calling the `Rprof()` function. @@ -192,7 +192,7 @@ $by.self Now you can see that only about 4% of the runtime is spent in the actual `lm()` function, whereas over 40% of the time is spent in `lm.fit()`. In this case, this is no surprise since the `lm.fit()` function is the function that actually fits the linear model. -You can see that a reasonable amount of time is spent in functions not necessarily associated with linear modeling (i.e. `as.list.data.frame`, `[.data.frame`). This is because the `lm()` function does a bit of pre-processing and checking before it actually fits the model. This is common with modeling functions---the preprocessing and checking is useful to see if there are any errors. But those two functions take up over 1.5 seconds of runtime. What if you want to fit this model 10,000 times? You're going to be spending a lot of time in preprocessing and checking. +You can see that a reasonable amount of time is spent in functions not necessarily associated with linear modeling (i.e. `as.list.data.frame`, `[.data.frame`). This is because the `lm()` function does a bit of preprocessing and checking before it actually fits the model. This is common with modeling functions---the preprocessing and checking is useful to see if there are any errors. But those two functions take up over 1.5 seconds of runtime. What if you want to fit this model 10,000 times? You're going to be spending a lot of time in preprocessing and checking. The final bit of output that `summaryRprof()` provides is the sampling interval and the total runtime. @@ -206,13 +206,13 @@ $sampling.time ## Summary -* `Rprof()` runs the profiler for performance of analysis of R code +* `Rprof()` runs the profiler for performance analysis of R code. * `summaryRprof()` summarizes the output of `Rprof()` and gives percent of time spent in each function (with two types of - normalization) + normalization). * Good to break your code into functions so that the profiler can give - useful information about where time is being spent + useful information about where time is being spent. -* C or Fortran code is not profiled +* C or Fortran code is not profiled. diff --git a/manuscript/readwritedata.Rmd b/manuscript/readwritedata.Rmd index 69a4f23..78b05eb 100644 --- a/manuscript/readwritedata.Rmd +++ b/manuscript/readwritedata.Rmd @@ -64,15 +64,17 @@ The `read.table()` function has a few important arguments: your file, it's worth setting this to be the empty string `""`. * `skip`, the number of lines to skip from the beginning * `stringsAsFactors`, should character variables be coded as factors? - This defaults to `TRUE` because back in the old days, if you had - data that were stored as strings, it was because those strings - represented levels of a categorical variable. Now we have lots of + This defaults to `FALSE` as of R 4.0.0. In 2020, + [the default was changed](https://developer.r-project.org/Blog/public/2020/02/16/stringsasfactors/) + from `TRUE` to `FALSE` due to reproducibility and to stay consistent + with modern alternatives to data frames. Now we have lots of data that is text data and they don't always represent categorical - variables. So you may want to set this to be `FALSE` in those - cases. If you *always* want this to be `FALSE`, you can set a global - option via `options(stringsAsFactors = FALSE)`. I've never seen so - much heat generated on discussion forums about an R function - argument than the `stringsAsFactors` argument. Seriously. + variables. So setting it as `FALSE` makes sense in those + cases. With older versions of R, if you *always* want this to be + `FALSE`, you can set a global option via + `options(stringsAsFactors = FALSE)`. I've never seen so much heat + generated on discussion forums about an R function argument than the + `stringsAsFactors` argument. Seriously. For small to moderately sized datasets, you can usually call @@ -102,8 +104,7 @@ argument). With much larger datasets, there are a few things that you can do that will make your life easier and will prevent R from choking. -* Read the help page for read.table, which contains many hints - +* Read the help page for read.table, which contains many hints. * Make a rough calculation of the memory required to store your dataset (see the next section for an example of how to do this). If the dataset is larger than the amount of RAM on your computer, you @@ -133,8 +134,8 @@ know a few things about your system. * How much memory is available on your system? * What other applications are in use? Can you close any of them? * Are there other users logged into the same system? -* What operating system ar you using? Some operating systems can limit - the amount of memory a single process can access +* What operating system are you using? Some operating systems can limit + the amount of memory a single process can access. ## Calculating Memory Requirements for R Objects @@ -153,13 +154,12 @@ required to store this data frame? Well, on most modern computers [double precision floating point numbers](http://en.wikipedia.org/wiki/Double-precision_floating-point_format) are stored using 64 bits of memory, or 8 bytes. Given that -information, you can do the following calculation - +information, you can do the following calculation: -| 1,500,000 × 120 × 8 bytes/numeric | = 1,440,000,000 bytes | -| | = 1,440,000,000 / 2^20^ bytes/MB -| | = 1,373.29 MB -| | = 1.34 GB +``` +> 1,500,000 × 120 × 8 bytes/numeric = 1,440,000,000 bytes +> 1,440,000,000 / 2^20 bytes/MB = 1,373.29 MB = 1.34 GB +``` So the dataset would require about 1.34 GB of RAM. Most computers these days have at least that much RAM. However, you need to be aware @@ -172,19 +172,19 @@ Reading in a large dataset for which you do not have enough RAM is one easy way to freeze up your computer (or at least your R session). This is usually an unpleasant experience that usually requires you to kill the R process, in the best case scenario, or reboot your computer, in -the worst case. So make sure to do a rough calculation of memeory +the worst case. So make sure to do a rough calculation of memory requirements before reading in a large dataset. You'll thank me later. # Using the `readr` Package -The `readr` package is recently developed by Hadley Wickham to deal -with reading in large flat files quickly. The package provides -replacements for functions like `read.table()` and `read.csv()`. The -analogous functions in `readr` are `read_table()` and -`read_csv()`. These functions are often *much* faster than their base -R analogues and provide a few other nice features such as progress -meters. +The [readr](https://github.com/tidyverse/readr) package is recently +developed by Hadley Wickham to deal with reading in large flat files +quickly. The package provides replacements for functions like +`read.table()` and `read.csv()`. The analogous functions in `readr` +are `read_table()` and `read_csv()`. These functions are often +*much* faster than their base R analogues and provide a few other +nice features such as progress meters. For the most part, you can read use `read_table()` and `read_csv()` pretty much anywhere you might use `read.table()` and `read.csv()`. In @@ -251,9 +251,6 @@ Now the `date` column is stored as a `Date` object which can be used for relevan A> The `read_csv` function has a `progress` option that defaults to TRUE. This options provides a nice progress meter while the CSV file is being read. However, if you are using `read_csv` in a function, or perhaps embedding it in a loop, it's probably best to set `progress = FALSE`. - - - # Using Textual and Binary Formats for Storing Data [Watch a video of this chapter](https://youtu.be/5mIPigbNDfk) @@ -341,6 +338,7 @@ str(y) x ``` + ## Binary Formats The complement to the textual format is the binary format, which is @@ -404,8 +402,6 @@ losing precision or any metadata. If that is what you need, then `serialize()` is the function for you. - - # Interfaces to the Outside World [Watch a video of this chapter](https://youtu.be/Pb01WoJRUtY) @@ -422,7 +418,7 @@ made to files (most common) or to other more exotic things. In general, connections are powerful tools that let you navigate files or other external objects. Connections can be thought of as a translator that lets you talk to objects that are outside of R. Those -outside objects could be anything from a data base, a simple text +outside objects could be anything from a database, a simple text file, or a a web service API. Connections allow R functions to talk to all these different external objects without you having to write custom code for each object. @@ -445,10 +441,10 @@ detail here. The `open` argument allows for the following options: -- "r" open file in read only mode -- "w" open a file for writing (and initializing a new file) -- "a" open a file for appending -- "rb", "wb", "ab" reading, writing, or appending in binary mode (Windows) +- "r", open file in read only mode +- "w", open a file for writing (and initializing a new file) +- "a", open a file for appending +- "rb", "wb", "ab", reading, writing, or appending in binary mode (Windows) In practice, we often don't need to deal with the connection interface @@ -482,7 +478,7 @@ In the background, `read.csv()` opens a connection to the file `foo.txt`, reads from it, and closes the connection when it's done. The above example shows the basic approach to using -connections. Connections must be opened, then the are read from or +connections. Connections must be opened, then they are read from or written to, and then they are closed. @@ -499,7 +495,7 @@ x <- readLines(con, 10) x ``` -For more structured text data like CSV files or tab-delimited files, +For more structured text data, like CSV files or tab-delimited files, there are other functions like `read.csv()` or `read.table()`. The above example used the `gzfile()` function which is used to create @@ -515,7 +511,7 @@ time to a text file. ## Reading From a URL Connection The `readLines()` function can be useful for reading in lines of -webpages. Since web pages are basically text files that are stored on +web pages. Since web pages are basically text files that are stored on a remote server, there is conceptually not much difference between a web page and a local text file. However, we need R to negotiate the communication between your computer and the web server. This is what @@ -526,23 +522,23 @@ This code might take time depending on your connection speed. ```{r} ## Open a URL connection for reading -con <- url("http://www.jhsph.edu", "r") +con <- url("https://en.wikipedia.org","r") ## Read the web page x <- readLines(con) ## Print out the first few lines -head(x) +head(x,5) ``` While reading in a simple web page is sometimes useful, particularly if data are embedded in the web page somewhere. However, more commonly -we can use URL connection to read in specific data files that are +we can use URL connections to read in specific data files that are stored on web servers. Using URL connections can be useful for producing a reproducible analysis, because the code essentially documents where the data came -from and how they were obtained. This is approach is preferable to +from and how they were obtained. This approach is preferable to opening a web browser and downloading a dataset by hand. Of course, the code you write with connections may not be executable at a later date if things on the server side are changed or reorganized. diff --git a/manuscript/regex.Rmd b/manuscript/regex.Rmd index 33b0611..e1ece96 100644 --- a/manuscript/regex.Rmd +++ b/manuscript/regex.Rmd @@ -21,15 +21,15 @@ If you want a very quick introduction to the general notion of regular expressio The primary R functions for dealing with regular expressions are -- `grep()`, `grepl()`: These functions search for matches of a regular expression/pattern in a character vector. `grep()` returns the indices into the character vector that contain a match or the specific strings that happen to have the match. `grepl()` returns a `TRUE`/`FALSE` vector indicating which elements of the character vector contain a match +- `grep()`, `grepl()`: These functions search for matches of a regular expression/pattern in a character vector. `grep()` returns the indices into the character vector that contain a match or the specific strings that happen to have the match. `grepl()` returns a `TRUE`/`FALSE` vector indicating which elements of the character vector contain a match. -- `regexpr()`, `gregexpr()`: Search a character vector for regular expression matches and return the indices of the string where the match begins and the length of the match +- `regexpr()`, `gregexpr()`: Search a character vector for regular expression matches and return the indices of the string where the match begins and the length of the match. -- `sub()`, `gsub()`: Search a character vector for regular expression matches and replace that match with another string +- `sub()`, `gsub()`: Search a character vector for regular expression matches and replace that match with another string. - `regexec()`: This function searches a character vector for a regular expression, much like `regexpr()`, but it will additionally return the locations of any parenthesized sub-expressions. Probably easier to explain through demonstration. -For this chapter, we will use a running example using data from homicides in Baltimore City. The Baltimore Sun newspaper collects information on all homicides that occur in the city (it also reports on many of them). That data is collected and presented in a [map that is publically available](http://data.baltimoresun.com/bing-maps/homicides/). I encourage you to go look at the web site/map to get a sense of what kinds of data are presented there. Unfortunately, the data on the web site are not particularly amenable to analysis, so I've scraped the data and put it in a separate file. The data in this file contain data from January 2007 to October 2013. +For this chapter, we will use a running example using data from homicides in Baltimore City. The Baltimore Sun newspaper collects information on all homicides that occur in the city (it also reports on many of them). That data is collected and presented in a [map that is publically available](http://data.baltimoresun.com/bing-maps/homicides/). I encourage you to go look at the website/map to get a sense of what kinds of data are presented there. Unfortunately, the data on the website are not particularly amenable to analysis, so I've scraped the data and put it in a separate file. The data in this file contain data from January 2007 to October 2013. Here is an excerpt of the Baltimore City homicides dataset: @@ -42,14 +42,14 @@ homicides[1] homicides[1000] ``` -The data set is formatted so that each homicide is presented on a single line of text. So when we read the data in with `readLines()`, each element of the character vector represents one homicide event. Notice that the data are riddled with HTML tags because they were scraped directly from the web site. +The dataset is formatted so that each homicide is presented on a single line of text. So when we read the data in with `readLines()`, each element of the character vector represents one homicide event. Notice that the data are riddled with HTML tags because they were scraped directly from the website. A few interesting features stand out: We have the latitude and longitude of where the victim was found; then there's the street address; the age, race, and gender of the victim; the date on which the victim was found; in which hospital the victim ultimately died; the cause of death. ## `grep()` Suppose we wanted to identify the records for all the victims of shootings (as opposed -to other causes)? How could we do that? From the map we know that for each cause of death there is a different icon/flag placed on the map. In particular, they are different colors. You can see that is indicated in the dataset for shooting deaths with a `iconHomicideShooting` label. Perhaps we can use this aspect of the data to idenfity all of the shootings. +to other causes)? How could we do that? From the map we know that for each cause of death there is a different icon/flag placed on the map. In particular, they are different colors. You can see that is indicated in the dataset for shooting deaths with a `iconHomicideShooting` label. Perhaps we can use this aspect of the data to identify all of the shootings. Here I use `grep()` to match the literal `iconHomicideShooting` into the character vector of homicides. @@ -58,14 +58,14 @@ g <- grep("iconHomicideShooting", homicides) length(g) ``` -Using this approach I get `r length(g)` shooting deaths. However, I notice that for some of the entries, the indicator for the homicide "flag" is noted as `icon_homicide_shooting`. It's not uncommon over time for web site maintainers to change the names of files or update files. What happens if we now `grep()` on both icon names using the `|` operator? +Using this approach I get `r length(g)` shooting deaths. However, I notice that for some of the entries, the indicator for the homicide "flag" is noted as `icon_homicide_shooting`. It's not uncommon over time for website maintainers to change the names of files or update files. What happens if we now `grep()` on both icon names using the `|` operator? ```{r} g <- grep("iconHomicideShooting|icon_homicide_shooting", homicides) length(g) ``` -Now we have `r length(g)` shooting deaths, which is quite a bit more. In fact, the vast majority of homicides in Baltimore are shooting deaths. +Now we have `r scales::comma(length(g))` shooting deaths, which is quite a bit more. In fact, the vast majority of homicides in Baltimore are shooting deaths. Another possible way to do this is to `grep()` on the cause of death field, which seems to have the format `Cause: shooting`. We can `grep()` on this literally and get @@ -74,21 +74,21 @@ g <- grep("Cause: shooting", homicides) length(g) ``` -Notice that we seem to be undercounting again. This is because for some of the entries, the word "shooting" uses a captial "S" while other entries use a lower case "s". We can handle this variation by using a character class in our regular expression. +Notice that we seem to be undercounting again. This is because for some of the entries, the word "shooting" uses a capital "S" while other entries use a lower case "s". We can handle this variation by using a character class in our regular expression. ```{r} g <- grep("Cause: [Ss]hooting", homicides) length(g) ``` -One thing you have to be careful of when processing text data is not not `grep()` things out of context. For example, suppose we just `grep()`-ed on the expression `[Ss]hooting`. +One thing you have to be careful of when processing text data is to not `grep()` things out of context. For example, suppose we just `grep()`-ed on the expression `[Ss]hooting`. ```{r} g <- grep("[Ss]hooting", homicides) length(g) ``` -Notice that we see to pick up 2 extra homicides this way. We can figure out which ones they are by comparing the results of the two expressions. +Notice that we seem to pick up 2 extra homicides this way. We can figure out which ones they are by comparing the results of the two expressions. First we can get the indices for the first expresssion match. @@ -147,11 +147,9 @@ state.name[g] Here, we can see that `grepl()` returns a logical vector that can be used to subset the original `state.name` vector. - - ## `regexpr()` -Both the `grep()` and the `grepl()` functions have some limitations. In particular, both functions tell you which strings in a character vector match a certain pattern but they don't tell you exactly where the match occurs or what the match is for a more complicated regular expression. +Both the `grep()` and the `grepl()` functions have some limitations. In particular, both functions tell you which strings in a character vector match a certain pattern but they don't tell you exactly where the match occurs or if the match is for a more complicated regular expression. The `regexpr()` function gives you the (a) index into each string where the match begins and the (b) length of the match for that string. `regexpr()` only gives you the *first* match of the string (reading left to right). `gregexpr()` will give you *all* of the matches in a given string if there are is more than one match. @@ -221,7 +219,7 @@ Notice that the `sub()` function found the first match (at the beginning of the gsub("
[F|f]ound on |
", "", x) ``` -The `sub() and `gsub()` functions can take vector arguments so we don't have to process each string one by one. +The `sub()` and `gsub()` functions can take vector arguments so we don't have to process each string one by one. ```{r} r <- regexpr("
[F|f]ound(.*?)
", homicides[1:5]) @@ -257,7 +255,7 @@ By contrast, if we only use the `regexpr()` function, we get regexec("
[F|f]ound on .*?
", homicides[1]) ``` -We can use the `substr()` function to demonstrate which parts of a strings are matched by the `regexec()` function. +We can use the `substr()` function to demonstrate which parts of the strings are matched by the `regexec()` function. Here's the output for `regexec()`. @@ -310,7 +308,7 @@ We can see from the picture that homicides do not occur uniformly throughout the ## The `stringr` Package -The `stringr` package is part of the [tidyverse](https://www.tidyverse.org) collection of packages and wraps they underlying `stringi` package in a series of convenience functions. Some of the complexity of using the base R regular expression functions is usefully hidden by the `stringr` functions. In addition, the `stringr` functions provide a more rational interface to regular expressions with more consistency in the arguments and argument ordering. +The `stringr` package is part of the [tidyverse](https://www.tidyverse.org) collection of packages and wraps the underlying `stringi` package in a series of convenience functions. Some of the complexity of using the base R regular expression functions is usefully hidden by the `stringr` functions. In addition, the `stringr` functions provide a more rational interface to regular expressions with more consistency in the arguments and argument ordering. Given what we have discussed so far, there is a fairly straightforward mapping from the base R functions to the `stringr` functions. In general, for the `stringr` functions, the data are the first argument and the regular expression is the second argument, with optional arguments afterwards. @@ -344,14 +342,14 @@ Note how the second column of the output contains the values of the parenthesize The primary R functions for dealing with regular expressions are - `grep()`, `grepl()`: Search for matches of a regular expression/pattern in a - character vector + character vector. -- `regexpr()`, `gregexpr(): Search a character vector for regular expression matches and +- `regexpr()`, `gregexpr()`: Search a character vector for regular expression matches and return the indices where the match begins; useful in conjunction - with `regmatches()` + with `regmatches()`. - `sub()`, `gsub()`: Search a character vector for regular expression matches and - replace that match with another string + replace that match with another string. - `regexec()`: Gives you indices of parethensized sub-expressions. diff --git a/manuscript/scoping.Rmd b/manuscript/scoping.Rmd index db34eac..1245744 100644 --- a/manuscript/scoping.Rmd +++ b/manuscript/scoping.Rmd @@ -22,7 +22,7 @@ how does R know what value to assign to the symbol `lm`? Why doesn’t it give i When R tries to bind a value to a symbol, it searches through a series of `environments` to find the appropriate value. When you are working on the command line and need to retrieve the value of an R object, the order in which things occur is roughly 1. Search the global environment (i.e. your workspace) for a symbol name matching the one requested. -2. Search the namespaces of each of the packages on the search list +2. Search the namespaces of each of the packages on the search list. The search list can be found by using the `search()` function. @@ -41,10 +41,9 @@ Note that R has separate namespaces for functions and non-functions so it’s po The scoping rules for R are the main feature that make it different from the original S language (in case you care about that). This may seem like an esoteric aspect of R, but it's one of its more interesting and useful features. -The scoping rules of a language determine how a value is associated with a *free variable* in a function. R uses [_lexical scoping_](http://en.wikipedia.org/wiki/Scope_(computer_science)#Lexical_scope_vs._dynamic_scope) or _static scoping_. An alternative to lexical scoping is _dynamic scoping_ which is implemented by some languages. Lexical scoping turns out to be particularly useful for simplifying statistical computations +The scoping rules of a language determine how a value is associated with a *free variable* in a function. R uses [_lexical scoping_](http://en.wikipedia.org/wiki/Scope_(computer_science)#Lexical_scope_vs._dynamic_scope) or _static scoping_. An alternative to lexical scoping is _dynamic scoping_ which is implemented by some languages. Lexical scoping turns out to be particularly useful for simplifying statistical computations. Related to the scoping rules is how R uses the *search list* to bind a value to a -symbol Consider the following function. @@ -72,7 +71,7 @@ A function, together with an environment, makes up what is called a _closure_ or How do we associate a value to a free variable? There is a search process that occurs that goes as follows: - If the value of a symbol is not found in the environment in which a function was defined, then the search is continued in the _parent environment_. -- The search continues down the sequence of parent environments until we hit the _top-level environment_; this usually the global environment (workspace) or the namespace of a package. +- The search continues down the sequence of parent environments until we hit the _top-level environment_; this is usually the global environment (workspace) or the namespace of a package. - After the top-level environment, the search continues down the search list until we hit the _empty environment_. If a value for a given symbol cannot be found once the empty environment is arrived at, then an error is thrown. @@ -257,7 +256,7 @@ Another nice feature that you can take advantage of is plotting the negative log Here is the function when `mu` is fixed. ```{r nLLFixMu} -## Fix 'mu' to be equalt o 1 +## Fix 'mu' to be equal to 1 nLL <- make.NegLogLik(normals, c(1, FALSE)) x <- seq(1.7, 1.9, len = 100) @@ -281,7 +280,7 @@ plot(x, exp(-(y - min(y))), type = "l") ## Summary -- Objective functions can be "built" which contain all of the necessary data for evaluating the function +- Objective functions can be "built" which contain all of the necessary data for evaluating the function. - No need to carry around long argument lists — useful for interactive and exploratory work. -- Code can be simplified and cleaned up +- Code can be simplified and cleaned up. - Reference: Robert Gentleman and Ross Ihaka (2000). "Lexical Scope and Statistical Computing," _JCGS_, 9, 491–508. diff --git a/manuscript/simulation.Rmd b/manuscript/simulation.Rmd index 2477ec1..f04c36c 100644 --- a/manuscript/simulation.Rmd +++ b/manuscript/simulation.Rmd @@ -12,14 +12,14 @@ set.seed(10) Simulation is an important (and big) topic for both statistics and for a variety of other areas where there is a need to introduce randomness. Sometimes you want to implement a statistical procedure that requires random number generation or sampling (i.e. Markov chain Monte Carlo, the bootstrap, random forests, bagging) and sometimes you want to simulate a system and random number generators can be used to model random inputs. -R comes with a set of pseuodo-random number generators that allow you to simulate from well-known probability distributions like the Normal, Poisson, and binomial. Some example functions for probability distributions in R +R comes with a set of pseudo-random number generators that allow you to simulate from well-known probability distributions like the Normal, Poisson, and binomial. Some example functions for probability distributions in R - `rnorm`: generate random Normal variates with a given mean and standard deviation - `dnorm`: evaluate the Normal probability density (with a given mean/SD) at a point (or vector of points) - `pnorm`: evaluate the cumulative distribution function for a Normal distribution - `rpois`: generate random Poisson variates with a given rate -For each probability distribution there are typically four functions available that start with a "r", "d", "p", and "q". The "r" function is the one that actually simulates randon numbers from that distribution. The other functions are prefixed with a +For each probability distribution there are typically four functions available that start with a "r", "d", "p", and "q". The "r" function is the one that actually simulates random numbers from that distribution. The other functions are prefixed with a - `d` for density - `r` for random number generation @@ -28,7 +28,7 @@ For each probability distribution there are typically four functions available t If you're only interested in simulating random numbers, then you will likely only need the "r" functions and not the others. However, if you intend to simulate from arbitrary probability distributions using something like rejection sampling, then you will need the other functions too. -Probably the most common probability distribution to work with the is the Normal distribution (also known as the Gaussian). Working with the Normal distributions requires using these four functions +Probably the most common probability distribution to work with is the Normal distribution (also known as the Gaussian). Working with the Normal distribution requires using these four functions ```r dnorm(x, mean = 0, sd = 1, log = FALSE) @@ -37,7 +37,7 @@ qnorm(p, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE) rnorm(n, mean = 0, sd = 1) ``` -Here we simulate standard Normal random numbers with mean 0 and standard deviation 1. +Here we simulate 10 standard Normal random numbers with mean 0 and standard deviation 1. ```{r} ## Simulate standard Normal random numbers @@ -45,7 +45,7 @@ x <- rnorm(10) x ``` -We can modify the default parameters to simulate numbers with mean 20 and standard deviation 2. +We can modify the default parameters to simulate 10 numbers with mean 20 and standard deviation 2. ```{r} x <- rnorm(10, 20, 2) @@ -149,7 +149,7 @@ plot(x, y) ``` -We can also simulate from *generalized linear model* where the errors are no longer from a Normal distribution but come from some other distribution. For examples, suppose we want to simulate from a Poisson log-linear model where +We can also simulate from a *generalized linear model* where the errors are no longer from a Normal distribution but come from some other distribution. For example, suppose we want to simulate from a Poisson log-linear model where \[ Y \sim Poisson(\mu) @@ -197,7 +197,7 @@ sample(letters, 5) sample(1:10) sample(1:10) -## Sample w/replacement +## Sample w/ replacement sample(1:10, replace = TRUE) ``` @@ -229,7 +229,7 @@ Other more complex objects can be sampled in this way, as long as there's a way ## Summary -- Drawing samples from specific probability distributions can be done with "r" functions +- Drawing samples from specific probability distributions can be done with "r" functions. - Standard distributions are built in: Normal, Poisson, Binomial, Exponential, Gamma, etc. -- The `sample()` function can be used to draw random samples from arbitrary vectors -- Setting the random number generator seed via `set.seed()` is critical for reproducibility +- The `sample()` function can be used to draw random samples from arbitrary vectors. +- Setting the random number generator seed via `set.seed()` is critical for reproducibility. diff --git a/manuscript/vectorized.Rmd b/manuscript/vectorized.Rmd index 21b6e36..de830bf 100644 --- a/manuscript/vectorized.Rmd +++ b/manuscript/vectorized.Rmd @@ -64,7 +64,7 @@ x / y ## Vectorized Matrix Operations -Matrix operations are also vectorized, making for nicly compact +Matrix operations are also vectorized, making for nicely compact notation. This way, we can do element-by-element operations on matrices without having to loop over every element.