From e12695ace512b110d1a105a3b1f38af89c51804e Mon Sep 17 00:00:00 2001 From: Richard Date: Fri, 5 Nov 2021 00:18:11 -0400 Subject: [PATCH 01/24] dead link at bell-labs; use web-archive version --- manuscript/overview.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/manuscript/overview.Rmd b/manuscript/overview.Rmd index cf15008..8a9564d 100644 --- a/manuscript/overview.Rmd +++ b/manuscript/overview.Rmd @@ -56,7 +56,7 @@ figuring out how to make data analysis easier, first for themselves, and then eventually for others. In [Stages in the Evolution of -S](http://www.stat.bell-labs.com/S/history.html ), John Chambers +S](https://web.archive.org/web/20150305201743/http://www.stat.bell-labs.com/S/history.html), John Chambers writes: > “[W]e wanted users to be able to begin in an interactive environment, From 3f71e7894c3fecfd4db34f5d071f99310e6feddc Mon Sep 17 00:00:00 2001 From: Richard Date: Fri, 5 Nov 2021 00:47:43 -0400 Subject: [PATCH 02/24] Overview chapter: syntax and typos --- manuscript/overview.Rmd | 14 +- manuscript/overview.html | 347 +++++++++++++++++++++++++++++++++++++++ 2 files changed, 354 insertions(+), 7 deletions(-) create mode 100644 manuscript/overview.html diff --git a/manuscript/overview.Rmd b/manuscript/overview.Rmd index 8a9564d..82ec124 100644 --- a/manuscript/overview.Rmd +++ b/manuscript/overview.Rmd @@ -67,7 +67,7 @@ writes: The key part here was the transition from *user* to *developer*. They wanted to build a language that could easily service both -"people". More technically, they needed to build language that would +"people". More technically, they needed to build a language that would be suitable for interactive data analysis (more command-line based) as well as for writing longer programs (more traditional programming language-like). @@ -133,7 +133,7 @@ beginning and has generally been better than competing packages. Today, with many more visualization packages available than before, that trend continues. R's base graphics system allows for very fine control over essentially every aspect of a plot or graph. Other -newer graphics systems, like lattice and ggplot2 allow for complex and +newer graphics systems, like lattice and ggplot2, allow for complex and sophisticated visualizations of high-dimensional data. R has maintained the original S philosophy, which is that it provides a @@ -196,9 +196,9 @@ functionality of R. The R system is divided into 2 conceptual parts: 1. The "base" R system that you download from CRAN: -[Linux](http://cran.r-project.org/bin/linux/) -[Windows](http://cran.r-project.org/bin/windows/) -[Mac](http://cran.r-project.org/bin/macosx/) [Source +[Linux](http://cran.r-project.org/bin/linux/), +[Windows](http://cran.r-project.org/bin/windows/), +[Mac](http://cran.r-project.org/bin/macosx/), [Source Code](http://cran.r-project.org/src/base/R-3/R-3.1.3.tar.gz) 2. Everything else. @@ -221,7 +221,7 @@ When you download a fresh installation of R from CRAN, you get all of the above, which represents a substantial amount of functionality. However, there are many other packages available: -- There are over 4000 packages on CRAN that have been developed by +- There are over 4,000 packages on CRAN that have been developed by users and programmers around the world. - There are also many packages associated with the [Bioconductor @@ -305,7 +305,7 @@ this book. Also, available from [CRAN](http://cran.r-project.org) are - [R Installation and Administration](http://cran.r-project.org/doc/manuals/r-release/R-admin.html): - This is mostly for building R from the source code) + This is mostly for building R from the source code - [R Internals](http://cran.r-project.org/doc/manuals/r-release/R-ints.html): diff --git a/manuscript/overview.html b/manuscript/overview.html new file mode 100644 index 0000000..1b8e88e --- /dev/null +++ b/manuscript/overview.html @@ -0,0 +1,347 @@ + + + + + + + + + + + + + +overview.knit + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + + + + + +
+

History and Overview of R

+
+

There are only two kinds of languages: the ones people complain about and the ones nobody uses —Bjarne Stroustrup

+
+

Watch a video of this chapter

+
+

What is R?

+

This is an easy question to answer. R is a dialect of S.

+
+
+

What is S?

+

S is a language that was developed by John Chambers and others at the old Bell Telephone Laboratories, originally part of AT&T Corp. S was initiated in 1976 as an internal statistical analysis environment—originally implemented as Fortran libraries. Early versions of the language did not even contain functions for statistical modeling.

+

In 1988 the system was rewritten in C and began to resemble the system that we have today (this was Version 3 of the language). The book Statistical Models in S by Chambers and Hastie (the white book) documents the statistical analysis functionality. Version 4 of the S language was released in 1998 and is the version we use today. The book Programming with Data by John Chambers (the green book) documents this version of the language.

+

Since the early 90’s the life of the S language has gone down a rather winding path. In 1993 Bell Labs gave StatSci (later Insightful Corp.) an exclusive license to develop and sell the S language. In 2004 Insightful purchased the S language from Lucent for $2 million. In 2006, Alcatel purchased Lucent Technologies and is now called Alcatel-Lucent.

+

Insightful sold its implementation of the S language under the product name S-PLUS and built a number of fancy features (GUIs, mostly) on top of it—hence the “PLUS”. In 2008 Insightful was acquired by TIBCO for $25 million. As of this writing TIBCO is the current owner of the S language and is its exclusive developer.

+

The fundamentals of the S language itself has not changed dramatically since the publication of the Green Book by John Chambers in 1998. In 1998, S won the Association for Computing Machinery’s Software System Award, a highly prestigious award in the computer science field.

+
+
+

The S Philosophy

+

The general S philosophy is important to understand for users of S and R because it sets the stage for the design of the language itself, which many programming veterans find a bit odd and confusing. In particular, it’s important to realize that the S language had its roots in data analysis, and did not come from a traditional programming language background. Its inventors were focused on figuring out how to make data analysis easier, first for themselves, and then eventually for others.

+

In Stages in the Evolution of S, John Chambers writes:

+
+

“[W]e wanted users to be able to begin in an interactive environment, where they did not consciously think of themselves as programming. Then as their needs became clearer and their sophistication increased, they should be able to slide gradually into programming, when the language and system aspects would become more important.”

+
+

The key part here was the transition from user to developer. They wanted to build a language that could easily service both “people”. More technically, they needed to build a language that would be suitable for interactive data analysis (more command-line based) as well as for writing longer programs (more traditional programming language-like).

+
+
+

Back to R

+

The R language came to use quite a bit after S had been developed. One key limitation of the S language was that it was only available in a commericial package, S-PLUS. In 1991, R was created by Ross Ihaka and Robert Gentleman in the Department of Statistics at the University of Auckland. In 1993 the first announcement of R was made to the public. Ross’s and Robert’s experience developing R is documented in a 1996 paper in the Journal of Computational and Graphical Statistics:

+
+

Ross Ihaka and Robert Gentleman. R: A language for data analysis and graphics. Journal of Computational and Graphical Statistics, 5(3):299–314, 1996

+
+

In 1995, Martin Mächler made an important contribution by convincing Ross and Robert to use the GNU General Public License to make R free software. This was critical because it allowed for the source code for the entire R system to be accessible to anyone who wanted to tinker with it (more on free software later).

+

In 1996, a public mailing list was created (the R-help and R-devel lists) and in 1997 the R Core Group was formed, containing some people associated with S and S-PLUS. Currently, the core group controls the source code for R and is solely able to check in changes to the main R source tree. Finally, in 2000 R version 1.0.0 was released to the public.

+
+
+

Basic Features of R

+

In the early days, a key feature of R was that its syntax is very similar to S, making it easy for S-PLUS users to switch over. While the R’s syntax is nearly identical to that of S’s, R’s semantics, while superficially similar to S, are quite different. In fact, R is technically much closer to the Scheme language than it is to the original S language when it comes to how R works under the hood.

+

Today R runs on almost any standard computing platform and operating system. Its open source nature means that anyone is free to adapt the software to whatever platform they choose. Indeed, R has been reported to be running on modern tablets, phones, PDAs, and game consoles.

+

One nice feature that R shares with many popular open source projects is frequent releases. These days there is a major annual release, typically in October, where major new features are incorporated and released to the public. Throughout the year, smaller-scale bugfix releases will be made as needed. The frequent releases and regular release cycle indicates active development of the software and ensures that bugs will be addressed in a timely manner. Of course, while the core developers control the primary source tree for R, many people around the world make contributions in the form of new feature, bug fixes, or both.

+

Another key advantage that R has over many other statistical packages (even today) is its sophisticated graphics capabilities. R’s ability to create “publication quality” graphics has existed since the very beginning and has generally been better than competing packages. Today, with many more visualization packages available than before, that trend continues. R’s base graphics system allows for very fine control over essentially every aspect of a plot or graph. Other newer graphics systems, like lattice and ggplot2, allow for complex and sophisticated visualizations of high-dimensional data.

+

R has maintained the original S philosophy, which is that it provides a language that is both useful for interactive work, but contains a powerful programming language for developing new tools. This allows the user, who takes existing tools and applies them to data, to slowly but surely become a developer who is creating new tools.

+

Finally, one of the joys of using R has nothing to do with the language itself, but rather with the active and vibrant user community. In many ways, a language is successful inasmuch as it creates a platform with which many people can create new things. R is that platform and thousands of people around the world have come together to make contributions to R, to develop packages, and help each other use R for all kinds of applications. The R-help and R-devel mailing lists have been highly active for over a decade now and there is considerable activity on web sites like Stack Overflow.

+
+
+

Free Software

+

A major advantage that R has over many other statistical packages and is that it’s free in the sense of free software (it’s also free in the sense of free beer). The copyright for the primary source code for R is held by the R Foundation and is published under the GNU General Public License version 2.0.

+

According to the Free Software Foundation, with free software, you are granted the following four freedoms

+
    +
  • The freedom to run the program, for any purpose (freedom 0).

  • +
  • The freedom to study how the program works, and adapt it to your needs (freedom 1). Access to the source code is a precondition for this.

  • +
  • The freedom to redistribute copies so you can help your neighbor (freedom 2).

  • +
  • The freedom to improve the program, and release your improvements to the public, so that the whole community benefits (freedom 3). Access to the source code is a precondition for this.

  • +
+

You can visit the Free Software Foundation’s web site to learn a lot more about free software. The Free Software Foundation was founded by Richard Stallman in 1985 and Stallman’s personal web site is an interesting read if you happen to have some spare time.

+
+
+

Design of the R System

+

The primary R system is available from the Comprehensive R Archive Network, also known as CRAN. CRAN also hosts many add-on packages that can be used to extend the functionality of R.

+

The R system is divided into 2 conceptual parts:

+
    +
  1. The “base” R system that you download from CRAN: Linux, Windows, Mac, Source Code

  2. +
  3. Everything else.

  4. +
+

R functionality is divided into a number of packages.

+
    +
  • The “base” R system contains, among other things, the base package which is required to run R and contains the most fundamental functions.

  • +
  • The other packages contained in the “base” system include utils, stats, datasets, graphics, grDevices, grid, methods, tools, parallel, compiler, splines, tcltk, stats4.

  • +
  • There are also “Recommended” packages: boot, class, cluster, codetools, foreign, KernSmooth, lattice, mgcv, nlme, rpart, survival, MASS, spatial, nnet, Matrix.

  • +
+

When you download a fresh installation of R from CRAN, you get all of the above, which represents a substantial amount of functionality. However, there are many other packages available:

+
    +
  • There are over 4,000 packages on CRAN that have been developed by users and programmers around the world.

  • +
  • There are also many packages associated with the Bioconductor project.

  • +
  • People often make packages available on their personal websites; there is no reliable way to keep track of how many packages are available in this fashion.

  • +
  • There are a number of packages being developed on repositories like GitHub and BitBucket but there is no reliable listing of all these packages.

  • +
+
+
+

Limitations of R

+

No programming language or statistical analysis system is perfect. R certainly has a number of drawbacks. For starters, R is essentially based on almost 50 year old technology, going back to the original S system developed at Bell Labs. There was originally little built in support for dynamic or 3-D graphics (but things have improved greatly since the “old days”).

+

Another commonly cited limitation of R is that objects must generally be stored in physical memory. This is in part due to the scoping rules of the language, but R generally is more of a memory hog than other statistical packages. However, there have been a number of advancements to deal with this, both in the R core and also in a number of packages developed by contributors. Also, computing power and capacity has continued to grow over time and amount of physical memory that can be installed on even a consumer-level laptop is substantial. While we will likely never have enough physical memory on a computer to handle the increasingly large datasets that are being generated, the situation has gotten quite a bit easier over time.

+

At a higher level one “limitation” of R is that its functionality is based on consumer demand and (voluntary) user contributions. If no one feels like implementing your favorite method, then it’s your job to implement it (or you need to pay someone to do it). The capabilities of the R system generally reflect the interests of the R user community. As the community has ballooned in size over the past 10 years, the capabilities have similarly increased. When I first started using R, there was very little in the way of functionality for the physical sciences (physics, astronomy, etc.). However, now some of those communities have adopted R and we are seeing more code being written for those kinds of applications.

+

If you want to know my general views on the usefulness of R, you can see them here in the following exchange on the R-help mailing list with Douglas Bates and Brian Ripley in June 2004:

+
+

Roger D. Peng: I don’t think anyone actually believes that R is designed to make everyone happy. For me, R does about 99% of the things I need to do, but sadly, when I need to order a pizza, I still have to pick up the telephone.

+
+
+

Douglas Bates: There are several chains of pizzerias in the U.S. that provide for Internet-based ordering (e.g. www.papajohnsonline.com) so, with the Internet modules in R, it’s only a matter of time before you will have a pizza-ordering function available.

+
+
+

Brian D. Ripley: Indeed, the GraphApp toolkit (used for the RGui interface under R for Windows, but Guido forgot to include it) provides one (for use in Sydney, Australia, we presume as that is where the GraphApp author hails from). Alternatively, a Padovian has no need of ordering pizzas with both home and neighbourhood restaurants ….

+
+

At this point in time, I think it would be fairly straightforward to build a pizza ordering R package using something like the RCurl or httr packages. Any takers?

+
+
+

R Resources

+
+

Official Manuals

+

As far as getting started with R by reading stuff, there is of course this book. Also, available from CRAN are

+ +
+
+

Useful Standard Texts on S and R

+
    +
  • Chambers (2008). Software for Data Analysis, Springer

  • +
  • Chambers (1998). Programming with Data, Springer: This book is not about R, but it describes the organization and philosophy of the current version of the S language, and is a useful reference.

  • +
  • Venables & Ripley (2002). Modern Applied Statistics with S, Springer: This is a standard textbook in statistics and describes how to use many statistical methods in R. This book has an associated R package (the MASS package) that comes with every installation of R.

  • +
  • Venables & Ripley (2000). S Programming, Springer: This book is a little old but is still relevant and accurate. Despite its title, this book is useful for R also.

  • +
  • Murrell (2005). R Graphics, Chapman & Hall/CRC Press: Paul Murrell wrote and designed much of the graphics system in R and this book essentially documents the underlying details. This is not so much a “user-level” book as a developer-level book. But it is an important book for anyone interested in designing new types of graphics or visualizations.

  • +
  • Wickham (2014). Advanced R, Chapman & Hall/CRC Press: This book by Hadley Wickham covers a number of areas including object-oriented programming, functional programming, profiling and other advanced topics.

  • +
+
+
+

Other Resources

+
    +
  • Major technical publishers like Springer, Chapman & Hall/CRC have entire series of books dedicated to using R in various applications. For example, Springer has a series of books called Use R!.

  • +
  • A longer list of books can be found on the CRAN web site.

  • +
+
+
+
+ + + + +
+ + + + + + + + + + + + + + + From 7de756b2ea775e78ebb956fbdc46295308276cea Mon Sep 17 00:00:00 2001 From: Richard Date: Fri, 5 Nov 2021 00:50:22 -0400 Subject: [PATCH 03/24] remove unnecessary html output --- manuscript/overview.html | 347 --------------------------------------- 1 file changed, 347 deletions(-) delete mode 100644 manuscript/overview.html diff --git a/manuscript/overview.html b/manuscript/overview.html deleted file mode 100644 index 1b8e88e..0000000 --- a/manuscript/overview.html +++ /dev/null @@ -1,347 +0,0 @@ - - - - - - - - - - - - - -overview.knit - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - -
-

History and Overview of R

-
-

There are only two kinds of languages: the ones people complain about and the ones nobody uses —Bjarne Stroustrup

-
-

Watch a video of this chapter

-
-

What is R?

-

This is an easy question to answer. R is a dialect of S.

-
-
-

What is S?

-

S is a language that was developed by John Chambers and others at the old Bell Telephone Laboratories, originally part of AT&T Corp. S was initiated in 1976 as an internal statistical analysis environment—originally implemented as Fortran libraries. Early versions of the language did not even contain functions for statistical modeling.

-

In 1988 the system was rewritten in C and began to resemble the system that we have today (this was Version 3 of the language). The book Statistical Models in S by Chambers and Hastie (the white book) documents the statistical analysis functionality. Version 4 of the S language was released in 1998 and is the version we use today. The book Programming with Data by John Chambers (the green book) documents this version of the language.

-

Since the early 90’s the life of the S language has gone down a rather winding path. In 1993 Bell Labs gave StatSci (later Insightful Corp.) an exclusive license to develop and sell the S language. In 2004 Insightful purchased the S language from Lucent for $2 million. In 2006, Alcatel purchased Lucent Technologies and is now called Alcatel-Lucent.

-

Insightful sold its implementation of the S language under the product name S-PLUS and built a number of fancy features (GUIs, mostly) on top of it—hence the “PLUS”. In 2008 Insightful was acquired by TIBCO for $25 million. As of this writing TIBCO is the current owner of the S language and is its exclusive developer.

-

The fundamentals of the S language itself has not changed dramatically since the publication of the Green Book by John Chambers in 1998. In 1998, S won the Association for Computing Machinery’s Software System Award, a highly prestigious award in the computer science field.

-
-
-

The S Philosophy

-

The general S philosophy is important to understand for users of S and R because it sets the stage for the design of the language itself, which many programming veterans find a bit odd and confusing. In particular, it’s important to realize that the S language had its roots in data analysis, and did not come from a traditional programming language background. Its inventors were focused on figuring out how to make data analysis easier, first for themselves, and then eventually for others.

-

In Stages in the Evolution of S, John Chambers writes:

-
-

“[W]e wanted users to be able to begin in an interactive environment, where they did not consciously think of themselves as programming. Then as their needs became clearer and their sophistication increased, they should be able to slide gradually into programming, when the language and system aspects would become more important.”

-
-

The key part here was the transition from user to developer. They wanted to build a language that could easily service both “people”. More technically, they needed to build a language that would be suitable for interactive data analysis (more command-line based) as well as for writing longer programs (more traditional programming language-like).

-
-
-

Back to R

-

The R language came to use quite a bit after S had been developed. One key limitation of the S language was that it was only available in a commericial package, S-PLUS. In 1991, R was created by Ross Ihaka and Robert Gentleman in the Department of Statistics at the University of Auckland. In 1993 the first announcement of R was made to the public. Ross’s and Robert’s experience developing R is documented in a 1996 paper in the Journal of Computational and Graphical Statistics:

-
-

Ross Ihaka and Robert Gentleman. R: A language for data analysis and graphics. Journal of Computational and Graphical Statistics, 5(3):299–314, 1996

-
-

In 1995, Martin Mächler made an important contribution by convincing Ross and Robert to use the GNU General Public License to make R free software. This was critical because it allowed for the source code for the entire R system to be accessible to anyone who wanted to tinker with it (more on free software later).

-

In 1996, a public mailing list was created (the R-help and R-devel lists) and in 1997 the R Core Group was formed, containing some people associated with S and S-PLUS. Currently, the core group controls the source code for R and is solely able to check in changes to the main R source tree. Finally, in 2000 R version 1.0.0 was released to the public.

-
-
-

Basic Features of R

-

In the early days, a key feature of R was that its syntax is very similar to S, making it easy for S-PLUS users to switch over. While the R’s syntax is nearly identical to that of S’s, R’s semantics, while superficially similar to S, are quite different. In fact, R is technically much closer to the Scheme language than it is to the original S language when it comes to how R works under the hood.

-

Today R runs on almost any standard computing platform and operating system. Its open source nature means that anyone is free to adapt the software to whatever platform they choose. Indeed, R has been reported to be running on modern tablets, phones, PDAs, and game consoles.

-

One nice feature that R shares with many popular open source projects is frequent releases. These days there is a major annual release, typically in October, where major new features are incorporated and released to the public. Throughout the year, smaller-scale bugfix releases will be made as needed. The frequent releases and regular release cycle indicates active development of the software and ensures that bugs will be addressed in a timely manner. Of course, while the core developers control the primary source tree for R, many people around the world make contributions in the form of new feature, bug fixes, or both.

-

Another key advantage that R has over many other statistical packages (even today) is its sophisticated graphics capabilities. R’s ability to create “publication quality” graphics has existed since the very beginning and has generally been better than competing packages. Today, with many more visualization packages available than before, that trend continues. R’s base graphics system allows for very fine control over essentially every aspect of a plot or graph. Other newer graphics systems, like lattice and ggplot2, allow for complex and sophisticated visualizations of high-dimensional data.

-

R has maintained the original S philosophy, which is that it provides a language that is both useful for interactive work, but contains a powerful programming language for developing new tools. This allows the user, who takes existing tools and applies them to data, to slowly but surely become a developer who is creating new tools.

-

Finally, one of the joys of using R has nothing to do with the language itself, but rather with the active and vibrant user community. In many ways, a language is successful inasmuch as it creates a platform with which many people can create new things. R is that platform and thousands of people around the world have come together to make contributions to R, to develop packages, and help each other use R for all kinds of applications. The R-help and R-devel mailing lists have been highly active for over a decade now and there is considerable activity on web sites like Stack Overflow.

-
-
-

Free Software

-

A major advantage that R has over many other statistical packages and is that it’s free in the sense of free software (it’s also free in the sense of free beer). The copyright for the primary source code for R is held by the R Foundation and is published under the GNU General Public License version 2.0.

-

According to the Free Software Foundation, with free software, you are granted the following four freedoms

-
    -
  • The freedom to run the program, for any purpose (freedom 0).

  • -
  • The freedom to study how the program works, and adapt it to your needs (freedom 1). Access to the source code is a precondition for this.

  • -
  • The freedom to redistribute copies so you can help your neighbor (freedom 2).

  • -
  • The freedom to improve the program, and release your improvements to the public, so that the whole community benefits (freedom 3). Access to the source code is a precondition for this.

  • -
-

You can visit the Free Software Foundation’s web site to learn a lot more about free software. The Free Software Foundation was founded by Richard Stallman in 1985 and Stallman’s personal web site is an interesting read if you happen to have some spare time.

-
-
-

Design of the R System

-

The primary R system is available from the Comprehensive R Archive Network, also known as CRAN. CRAN also hosts many add-on packages that can be used to extend the functionality of R.

-

The R system is divided into 2 conceptual parts:

-
    -
  1. The “base” R system that you download from CRAN: Linux, Windows, Mac, Source Code

  2. -
  3. Everything else.

  4. -
-

R functionality is divided into a number of packages.

-
    -
  • The “base” R system contains, among other things, the base package which is required to run R and contains the most fundamental functions.

  • -
  • The other packages contained in the “base” system include utils, stats, datasets, graphics, grDevices, grid, methods, tools, parallel, compiler, splines, tcltk, stats4.

  • -
  • There are also “Recommended” packages: boot, class, cluster, codetools, foreign, KernSmooth, lattice, mgcv, nlme, rpart, survival, MASS, spatial, nnet, Matrix.

  • -
-

When you download a fresh installation of R from CRAN, you get all of the above, which represents a substantial amount of functionality. However, there are many other packages available:

-
    -
  • There are over 4,000 packages on CRAN that have been developed by users and programmers around the world.

  • -
  • There are also many packages associated with the Bioconductor project.

  • -
  • People often make packages available on their personal websites; there is no reliable way to keep track of how many packages are available in this fashion.

  • -
  • There are a number of packages being developed on repositories like GitHub and BitBucket but there is no reliable listing of all these packages.

  • -
-
-
-

Limitations of R

-

No programming language or statistical analysis system is perfect. R certainly has a number of drawbacks. For starters, R is essentially based on almost 50 year old technology, going back to the original S system developed at Bell Labs. There was originally little built in support for dynamic or 3-D graphics (but things have improved greatly since the “old days”).

-

Another commonly cited limitation of R is that objects must generally be stored in physical memory. This is in part due to the scoping rules of the language, but R generally is more of a memory hog than other statistical packages. However, there have been a number of advancements to deal with this, both in the R core and also in a number of packages developed by contributors. Also, computing power and capacity has continued to grow over time and amount of physical memory that can be installed on even a consumer-level laptop is substantial. While we will likely never have enough physical memory on a computer to handle the increasingly large datasets that are being generated, the situation has gotten quite a bit easier over time.

-

At a higher level one “limitation” of R is that its functionality is based on consumer demand and (voluntary) user contributions. If no one feels like implementing your favorite method, then it’s your job to implement it (or you need to pay someone to do it). The capabilities of the R system generally reflect the interests of the R user community. As the community has ballooned in size over the past 10 years, the capabilities have similarly increased. When I first started using R, there was very little in the way of functionality for the physical sciences (physics, astronomy, etc.). However, now some of those communities have adopted R and we are seeing more code being written for those kinds of applications.

-

If you want to know my general views on the usefulness of R, you can see them here in the following exchange on the R-help mailing list with Douglas Bates and Brian Ripley in June 2004:

-
-

Roger D. Peng: I don’t think anyone actually believes that R is designed to make everyone happy. For me, R does about 99% of the things I need to do, but sadly, when I need to order a pizza, I still have to pick up the telephone.

-
-
-

Douglas Bates: There are several chains of pizzerias in the U.S. that provide for Internet-based ordering (e.g. www.papajohnsonline.com) so, with the Internet modules in R, it’s only a matter of time before you will have a pizza-ordering function available.

-
-
-

Brian D. Ripley: Indeed, the GraphApp toolkit (used for the RGui interface under R for Windows, but Guido forgot to include it) provides one (for use in Sydney, Australia, we presume as that is where the GraphApp author hails from). Alternatively, a Padovian has no need of ordering pizzas with both home and neighbourhood restaurants ….

-
-

At this point in time, I think it would be fairly straightforward to build a pizza ordering R package using something like the RCurl or httr packages. Any takers?

-
-
-

R Resources

-
-

Official Manuals

-

As far as getting started with R by reading stuff, there is of course this book. Also, available from CRAN are

- -
-
-

Useful Standard Texts on S and R

-
    -
  • Chambers (2008). Software for Data Analysis, Springer

  • -
  • Chambers (1998). Programming with Data, Springer: This book is not about R, but it describes the organization and philosophy of the current version of the S language, and is a useful reference.

  • -
  • Venables & Ripley (2002). Modern Applied Statistics with S, Springer: This is a standard textbook in statistics and describes how to use many statistical methods in R. This book has an associated R package (the MASS package) that comes with every installation of R.

  • -
  • Venables & Ripley (2000). S Programming, Springer: This book is a little old but is still relevant and accurate. Despite its title, this book is useful for R also.

  • -
  • Murrell (2005). R Graphics, Chapman & Hall/CRC Press: Paul Murrell wrote and designed much of the graphics system in R and this book essentially documents the underlying details. This is not so much a “user-level” book as a developer-level book. But it is an important book for anyone interested in designing new types of graphics or visualizations.

  • -
  • Wickham (2014). Advanced R, Chapman & Hall/CRC Press: This book by Hadley Wickham covers a number of areas including object-oriented programming, functional programming, profiling and other advanced topics.

  • -
-
-
-

Other Resources

-
    -
  • Major technical publishers like Springer, Chapman & Hall/CRC have entire series of books dedicated to using R in various applications. For example, Springer has a series of books called Use R!.

  • -
  • A longer list of books can be found on the CRAN web site.

  • -
-
-
-
- - - - -
- - - - - - - - - - - - - - - From 5574868b7db3edda954bfab019ff176e0288d4d6 Mon Sep 17 00:00:00 2001 From: Richard Date: Fri, 5 Nov 2021 00:53:12 -0400 Subject: [PATCH 04/24] output md file --- manuscript/overview.md | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/manuscript/overview.md b/manuscript/overview.md index cf15008..82ec124 100644 --- a/manuscript/overview.md +++ b/manuscript/overview.md @@ -56,7 +56,7 @@ figuring out how to make data analysis easier, first for themselves, and then eventually for others. In [Stages in the Evolution of -S](http://www.stat.bell-labs.com/S/history.html ), John Chambers +S](https://web.archive.org/web/20150305201743/http://www.stat.bell-labs.com/S/history.html), John Chambers writes: > “[W]e wanted users to be able to begin in an interactive environment, @@ -67,7 +67,7 @@ writes: The key part here was the transition from *user* to *developer*. They wanted to build a language that could easily service both -"people". More technically, they needed to build language that would +"people". More technically, they needed to build a language that would be suitable for interactive data analysis (more command-line based) as well as for writing longer programs (more traditional programming language-like). @@ -133,7 +133,7 @@ beginning and has generally been better than competing packages. Today, with many more visualization packages available than before, that trend continues. R's base graphics system allows for very fine control over essentially every aspect of a plot or graph. Other -newer graphics systems, like lattice and ggplot2 allow for complex and +newer graphics systems, like lattice and ggplot2, allow for complex and sophisticated visualizations of high-dimensional data. R has maintained the original S philosophy, which is that it provides a @@ -196,9 +196,9 @@ functionality of R. The R system is divided into 2 conceptual parts: 1. The "base" R system that you download from CRAN: -[Linux](http://cran.r-project.org/bin/linux/) -[Windows](http://cran.r-project.org/bin/windows/) -[Mac](http://cran.r-project.org/bin/macosx/) [Source +[Linux](http://cran.r-project.org/bin/linux/), +[Windows](http://cran.r-project.org/bin/windows/), +[Mac](http://cran.r-project.org/bin/macosx/), [Source Code](http://cran.r-project.org/src/base/R-3/R-3.1.3.tar.gz) 2. Everything else. @@ -221,7 +221,7 @@ When you download a fresh installation of R from CRAN, you get all of the above, which represents a substantial amount of functionality. However, there are many other packages available: -- There are over 4000 packages on CRAN that have been developed by +- There are over 4,000 packages on CRAN that have been developed by users and programmers around the world. - There are also many packages associated with the [Bioconductor @@ -305,7 +305,7 @@ this book. Also, available from [CRAN](http://cran.r-project.org) are - [R Installation and Administration](http://cran.r-project.org/doc/manuals/r-release/R-admin.html): - This is mostly for building R from the source code) + This is mostly for building R from the source code - [R Internals](http://cran.r-project.org/doc/manuals/r-release/R-ints.html): From d3868b99bb4506fbc2f7fc0ef7d8e0e0c88ed470 Mon Sep 17 00:00:00 2001 From: Richard Date: Fri, 5 Nov 2021 00:55:38 -0400 Subject: [PATCH 05/24] gettingstarted: typos --- manuscript/gettingstarted.Rmd | 4 ++-- manuscript/gettingstarted.md | 4 ++-- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/manuscript/gettingstarted.Rmd b/manuscript/gettingstarted.Rmd index 31b199f..5f7e91d 100644 --- a/manuscript/gettingstarted.Rmd +++ b/manuscript/gettingstarted.Rmd @@ -16,7 +16,7 @@ There is also an integrated development environment available for R that is built by RStudio. I really like this IDE---it has a nice editor with syntax highlighting, there is an R object viewer, and there are a number of other nice features that are integrated. You can -see how to install RStudio here +see how to install RStudio here: - [Installing RStudio](https://youtu.be/bM7Sfz-LADM) @@ -28,7 +28,7 @@ site](http://rstudio.com). After you install R you will need to launch it and start writing R code. Before we get to exactly how to write R code, it's useful to get a sense of how the system is organized. In these two videos I talk -about where to write code and how set your working directory, which +about where to write code and how to set your working directory, which let's R know where to find all of your files. - [Writing code and setting your working directory on the Mac](https://youtu.be/8xT3hmJQskU) diff --git a/manuscript/gettingstarted.md b/manuscript/gettingstarted.md index 31b199f..5f7e91d 100644 --- a/manuscript/gettingstarted.md +++ b/manuscript/gettingstarted.md @@ -16,7 +16,7 @@ There is also an integrated development environment available for R that is built by RStudio. I really like this IDE---it has a nice editor with syntax highlighting, there is an R object viewer, and there are a number of other nice features that are integrated. You can -see how to install RStudio here +see how to install RStudio here: - [Installing RStudio](https://youtu.be/bM7Sfz-LADM) @@ -28,7 +28,7 @@ site](http://rstudio.com). After you install R you will need to launch it and start writing R code. Before we get to exactly how to write R code, it's useful to get a sense of how the system is organized. In these two videos I talk -about where to write code and how set your working directory, which +about where to write code and how to set your working directory, which let's R know where to find all of your files. - [Writing code and setting your working directory on the Mac](https://youtu.be/8xT3hmJQskU) From 7dcfa283dd0422932fa6c1568565e58448e30c33 Mon Sep 17 00:00:00 2001 From: Richard Date: Fri, 5 Nov 2021 01:38:49 -0400 Subject: [PATCH 06/24] nutsbolts: syntax, wording, redirected url --- manuscript/nutsbolts.Rmd | 14 ++-- manuscript/nutsbolts.md | 134 ++++++++++++++++----------------------- 2 files changed, 62 insertions(+), 86 deletions(-) diff --git a/manuscript/nutsbolts.Rmd b/manuscript/nutsbolts.Rmd index 2d83046..b1552dd 100644 --- a/manuscript/nutsbolts.Rmd +++ b/manuscript/nutsbolts.Rmd @@ -132,7 +132,7 @@ used in ordinary calculations; e.g. `1 / Inf` is 0. The value `NaN` represents an undefined value ("not a number"); e.g. 0 / 0; `NaN` can also be thought of as a missing value (more on that -later) +later). ## Attributes @@ -244,7 +244,7 @@ from R. Matrices are vectors with a _dimension_ attribute. The dimension attribute is itself an integer vector of length 2 (number of rows, -number of columns) +number of columns). ```{r} m <- matrix(nrow = 2, ncol = 3) @@ -297,7 +297,7 @@ x ``` We can also create an empty list of a prespecified length with the -`vector()` function +`vector()` function. ```{r} x <- vector("list", length = 5) @@ -347,7 +347,7 @@ x ## Missing Values -Missing values are denoted by `NA` or `NaN` for q undefined +Missing values are denoted by `NA` or `NaN` for undefined mathematical operations. - `is.na()` is used to test objects if they are `NA` @@ -381,7 +381,7 @@ is.nan(x) Data frames are used to store tabular data in R. They are an important type of object in R and are used in a variety of statistical modeling applications. Hadley Wickham's package -[dplyr](https://github.com/hadley/dplyr) has an optimized set of +[dplyr](https://github.com/tidyverse/dplyr) has an optimized set of functions designed to work efficiently with data frames. Data frames are represented as a special type of list where every @@ -467,8 +467,8 @@ know its confusing. Here's a quick summary: ## Summary -There are a variety of different builtin-data types in R. In this -chapter we have reviewed the following +There are a variety of different built-in data types in R. In this +chapter we have reviewed the following: - atomic classes: numeric, logical, character, integer, complex diff --git a/manuscript/nutsbolts.md b/manuscript/nutsbolts.md index 2e5d1b4..104e0ae 100644 --- a/manuscript/nutsbolts.md +++ b/manuscript/nutsbolts.md @@ -10,23 +10,21 @@ At the R prompt we type expressions. The `<-` symbol is the assignment operator. -{line-numbers=off} -~~~~~~~~ +```r > x <- 1 > print(x) [1] 1 > x [1] 1 > msg <- "hello" -~~~~~~~~ +``` The grammar of the language determines whether an expression is complete or not. -{line-numbers=off} -~~~~~~~~ +```r x <- ## Incomplete expression -~~~~~~~~ +``` The # character indicates a comment. Anything to the right of the # (including the # itself) is ignored. This is the only comment @@ -41,14 +39,13 @@ and the result of the evaluated expression is returned. The result may be *auto-printed*. -{line-numbers=off} -~~~~~~~~ +```r > x <- 5 ## nothing printed > x ## auto-printing occurs [1] 5 > print(x) ## explicit printing [1] 5 -~~~~~~~~ +``` The `[1]` shown in the output indicates that `x` is a vector and `5` is its first element. @@ -67,13 +64,12 @@ see this integer sequence of length 20. -{line-numbers=off} -~~~~~~~~ +```r > x <- 11:30 > x [1] 11 12 13 14 15 16 17 18 19 20 21 22 [13] 23 24 25 26 27 28 29 30 -~~~~~~~~ +``` @@ -139,7 +135,7 @@ used in ordinary calculations; e.g. `1 / Inf` is 0. The value `NaN` represents an undefined value ("not a number"); e.g. 0 / 0; `NaN` can also be thought of as a missing value (more on that -later) +later). ## Attributes @@ -174,15 +170,14 @@ The `c()` function can be used to create vectors of objects by concatenating things together. -{line-numbers=off} -~~~~~~~~ +```r > x <- c(0.5, 0.6) ## numeric > x <- c(TRUE, FALSE) ## logical > x <- c(T, F) ## logical > x <- c("a", "b", "c") ## character > x <- 9:29 ## integer > x <- c(1+0i, 2+4i) ## complex -~~~~~~~~ +``` Note that in the above example, `T` and `F` are short-hand ways to specify `TRUE` and `FALSE`. However, in general one should try to use @@ -193,12 +188,11 @@ feeling lazy. You can also use the `vector()` function to initialize vectors. -{line-numbers=off} -~~~~~~~~ +```r > x <- vector("numeric", length = 10) > x [1] 0 0 0 0 0 0 0 0 0 0 -~~~~~~~~ +``` ## Mixing Objects @@ -207,12 +201,11 @@ together. Sometimes this happens by accident but it can also happen on purpose. So what happens with the following code? -{line-numbers=off} -~~~~~~~~ +```r > y <- c(1.7, "a") ## character > y <- c(TRUE, 2) ## numeric > y <- c("a", TRUE) ## character -~~~~~~~~ +``` In each case above, we are mixing objects of two different classes in a vector. But remember that the only rule about vectors says this is @@ -233,8 +226,7 @@ Objects can be explicitly coerced from one class to another using the `as.*` functions, if available. -{line-numbers=off} -~~~~~~~~ +```r > x <- 0:6 > class(x) [1] "integer" @@ -244,14 +236,13 @@ Objects can be explicitly coerced from one class to another using the [1] FALSE TRUE TRUE TRUE TRUE TRUE TRUE > as.character(x) [1] "0" "1" "2" "3" "4" "5" "6" -~~~~~~~~ +``` Sometimes, R can't figure out how to coerce an object and this can result in `NA`s being produced. -{line-numbers=off} -~~~~~~~~ +```r > x <- c("a", "b", "c") > as.numeric(x) Warning: NAs introduced by coercion @@ -261,7 +252,7 @@ Warning: NAs introduced by coercion > as.complex(x) Warning: NAs introduced by coercion [1] NA NA NA -~~~~~~~~ +``` When nonsensical coercion takes place, you will usually get a warning from R. @@ -271,11 +262,10 @@ from R. Matrices are vectors with a _dimension_ attribute. The dimension attribute is itself an integer vector of length 2 (number of rows, -number of columns) +number of columns). -{line-numbers=off} -~~~~~~~~ +```r > m <- matrix(nrow = 2, ncol = 3) > m [,1] [,2] [,3] @@ -286,27 +276,25 @@ number of columns) > attributes(m) $dim [1] 2 3 -~~~~~~~~ +``` Matrices are constructed _column-wise_, so entries can be thought of starting in the "upper left" corner and running down the columns. -{line-numbers=off} -~~~~~~~~ +```r > m <- matrix(1:6, nrow = 2, ncol = 3) > m [,1] [,2] [,3] [1,] 1 3 5 [2,] 2 4 6 -~~~~~~~~ +``` Matrices can also be created directly from vectors by adding a dimension attribute. -{line-numbers=off} -~~~~~~~~ +```r > m <- 1:10 > m [1] 1 2 3 4 5 6 7 8 9 10 @@ -315,14 +303,13 @@ dimension attribute. [,1] [,2] [,3] [,4] [,5] [1,] 1 3 5 7 9 [2,] 2 4 6 8 10 -~~~~~~~~ +``` Matrices can be created by _column-binding_ or _row-binding_ with the `cbind()` and `rbind()` functions. -{line-numbers=off} -~~~~~~~~ +```r > x <- 1:3 > y <- 10:12 > cbind(x, y) @@ -334,7 +321,7 @@ Matrices can be created by _column-binding_ or _row-binding_ with the [,1] [,2] [,3] x 1 2 3 y 10 11 12 -~~~~~~~~ +``` ## Lists @@ -347,8 +334,7 @@ Lists can be explicitly created using the `list()` function, which takes an arbitrary number of arguments. -{line-numbers=off} -~~~~~~~~ +```r > x <- list(1, "a", TRUE, 1 + 4i) > x [[1]] @@ -362,14 +348,13 @@ takes an arbitrary number of arguments. [[4]] [1] 1+4i -~~~~~~~~ +``` We can also create an empty list of a prespecified length with the -`vector()` function +`vector()` function. -{line-numbers=off} -~~~~~~~~ +```r > x <- vector("list", length = 5) > x [[1]] @@ -386,7 +371,7 @@ NULL [[5]] NULL -~~~~~~~~ +``` ## Factors @@ -405,8 +390,7 @@ and "Female" is better than a variable that has values 1 and 2. Factor objects can be created with the `factor()` function. -{line-numbers=off} -~~~~~~~~ +```r > x <- factor(c("yes", "yes", "no", "yes", "no")) > x [1] yes yes no yes no @@ -420,7 +404,7 @@ x [1] 2 2 1 2 1 attr(,"levels") [1] "no" "yes" -~~~~~~~~ +``` Often factors will be automatically created for you when you read a dataset in using a function like `read.table()`. Those functions often @@ -432,8 +416,7 @@ argument to `factor()`. This can be important in linear modelling because the first level is used as the baseline level. -{line-numbers=off} -~~~~~~~~ +```r > x <- factor(c("yes", "yes", "no", "yes", "no")) > x ## Levels are put in alphabetical order [1] yes yes no yes no @@ -443,11 +426,11 @@ Levels: no yes > x [1] yes yes no yes no Levels: yes no -~~~~~~~~ +``` ## Missing Values -Missing values are denoted by `NA` or `NaN` for q undefined +Missing values are denoted by `NA` or `NaN` for undefined mathematical operations. - `is.na()` is used to test objects if they are `NA` @@ -461,8 +444,7 @@ mathematical operations. -{line-numbers=off} -~~~~~~~~ +```r > ## Create a vector with NAs in it > x <- c(1, 2, NA, 10, 3) > ## Return a logical vector indicating which elements are NA @@ -471,25 +453,24 @@ mathematical operations. > ## Return a logical vector indicating which elements are NaN > is.nan(x) [1] FALSE FALSE FALSE FALSE FALSE -~~~~~~~~ +``` -{line-numbers=off} -~~~~~~~~ +```r > ## Now create a vector with both NA and NaN values > x <- c(1, 2, NaN, NA, 4) > is.na(x) [1] FALSE FALSE TRUE TRUE FALSE > is.nan(x) [1] FALSE FALSE TRUE FALSE FALSE -~~~~~~~~ +``` ## Data Frames Data frames are used to store tabular data in R. They are an important type of object in R and are used in a variety of statistical modeling applications. Hadley Wickham's package -[dplyr](https://github.com/hadley/dplyr) has an optimized set of +[dplyr](https://github.com/tidyverse/dplyr) has an optimized set of functions designed to work efficiently with data frames. Data frames are represented as a special type of list where every @@ -516,8 +497,7 @@ should be used to coerce a data frame to a matrix, almost always, what you want is the result of `data.matrix()`. -{line-numbers=off} -~~~~~~~~ +```r > x <- data.frame(foo = 1:4, bar = c(T, T, F, F)) > x foo bar @@ -529,7 +509,7 @@ you want is the result of `data.matrix()`. [1] 4 > ncol(x) [1] 2 -~~~~~~~~ +``` ## Names @@ -538,8 +518,7 @@ code and self-describing objects. Here is an example of assigning names to an integer vector. -{line-numbers=off} -~~~~~~~~ +```r > x <- 1:3 > names(x) NULL @@ -549,13 +528,12 @@ NULL 1 2 3 > names(x) [1] "New York" "Seattle" "Los Angeles" -~~~~~~~~ +``` Lists can also have names, which is often very useful. -{line-numbers=off} -~~~~~~~~ +```r > x <- list("Los Angeles" = 1, Boston = 2, London = 3) > x $`Los Angeles` @@ -568,34 +546,32 @@ $London [1] 3 > names(x) [1] "Los Angeles" "Boston" "London" -~~~~~~~~ +``` Matrices can have both column and row names. -{line-numbers=off} -~~~~~~~~ +```r > m <- matrix(1:4, nrow = 2, ncol = 2) > dimnames(m) <- list(c("a", "b"), c("c", "d")) > m c d a 1 3 b 2 4 -~~~~~~~~ +``` Column names and row names can be set separately using the `colnames()` and `rownames()` functions. -{line-numbers=off} -~~~~~~~~ +```r > colnames(m) <- c("h", "f") > rownames(m) <- c("x", "z") > m h f x 1 3 z 2 4 -~~~~~~~~ +``` Note that for data frames, there is a separate function for setting the row names, the `row.names()` function. Also, data frames do not @@ -611,8 +587,8 @@ know its confusing. Here's a quick summary: ## Summary -There are a variety of different builtin-data types in R. In this -chapter we have reviewed the following +There are a variety of different built-in data types in R. In this +chapter we have reviewed the following: - atomic classes: numeric, logical, character, integer, complex From ae2e6009ef2b048656b587a4e446feca5b9cc7a4 Mon Sep 17 00:00:00 2001 From: Richard Date: Fri, 5 Nov 2021 03:16:12 -0400 Subject: [PATCH 07/24] readwritedata: stringsAsFactors update, syntax, calculation markdown readability, change example from jhsph.edu to en.wikipedia.org (jhsph has more complex header now), various data outputs have changed --- manuscript/readwritedata.Rmd | 82 +++++---- manuscript/readwritedata.md | 323 ++++++++++++++++------------------- 2 files changed, 185 insertions(+), 220 deletions(-) diff --git a/manuscript/readwritedata.Rmd b/manuscript/readwritedata.Rmd index 69a4f23..78b05eb 100644 --- a/manuscript/readwritedata.Rmd +++ b/manuscript/readwritedata.Rmd @@ -64,15 +64,17 @@ The `read.table()` function has a few important arguments: your file, it's worth setting this to be the empty string `""`. * `skip`, the number of lines to skip from the beginning * `stringsAsFactors`, should character variables be coded as factors? - This defaults to `TRUE` because back in the old days, if you had - data that were stored as strings, it was because those strings - represented levels of a categorical variable. Now we have lots of + This defaults to `FALSE` as of R 4.0.0. In 2020, + [the default was changed](https://developer.r-project.org/Blog/public/2020/02/16/stringsasfactors/) + from `TRUE` to `FALSE` due to reproducibility and to stay consistent + with modern alternatives to data frames. Now we have lots of data that is text data and they don't always represent categorical - variables. So you may want to set this to be `FALSE` in those - cases. If you *always* want this to be `FALSE`, you can set a global - option via `options(stringsAsFactors = FALSE)`. I've never seen so - much heat generated on discussion forums about an R function - argument than the `stringsAsFactors` argument. Seriously. + variables. So setting it as `FALSE` makes sense in those + cases. With older versions of R, if you *always* want this to be + `FALSE`, you can set a global option via + `options(stringsAsFactors = FALSE)`. I've never seen so much heat + generated on discussion forums about an R function argument than the + `stringsAsFactors` argument. Seriously. For small to moderately sized datasets, you can usually call @@ -102,8 +104,7 @@ argument). With much larger datasets, there are a few things that you can do that will make your life easier and will prevent R from choking. -* Read the help page for read.table, which contains many hints - +* Read the help page for read.table, which contains many hints. * Make a rough calculation of the memory required to store your dataset (see the next section for an example of how to do this). If the dataset is larger than the amount of RAM on your computer, you @@ -133,8 +134,8 @@ know a few things about your system. * How much memory is available on your system? * What other applications are in use? Can you close any of them? * Are there other users logged into the same system? -* What operating system ar you using? Some operating systems can limit - the amount of memory a single process can access +* What operating system are you using? Some operating systems can limit + the amount of memory a single process can access. ## Calculating Memory Requirements for R Objects @@ -153,13 +154,12 @@ required to store this data frame? Well, on most modern computers [double precision floating point numbers](http://en.wikipedia.org/wiki/Double-precision_floating-point_format) are stored using 64 bits of memory, or 8 bytes. Given that -information, you can do the following calculation - +information, you can do the following calculation: -| 1,500,000 × 120 × 8 bytes/numeric | = 1,440,000,000 bytes | -| | = 1,440,000,000 / 2^20^ bytes/MB -| | = 1,373.29 MB -| | = 1.34 GB +``` +> 1,500,000 × 120 × 8 bytes/numeric = 1,440,000,000 bytes +> 1,440,000,000 / 2^20 bytes/MB = 1,373.29 MB = 1.34 GB +``` So the dataset would require about 1.34 GB of RAM. Most computers these days have at least that much RAM. However, you need to be aware @@ -172,19 +172,19 @@ Reading in a large dataset for which you do not have enough RAM is one easy way to freeze up your computer (or at least your R session). This is usually an unpleasant experience that usually requires you to kill the R process, in the best case scenario, or reboot your computer, in -the worst case. So make sure to do a rough calculation of memeory +the worst case. So make sure to do a rough calculation of memory requirements before reading in a large dataset. You'll thank me later. # Using the `readr` Package -The `readr` package is recently developed by Hadley Wickham to deal -with reading in large flat files quickly. The package provides -replacements for functions like `read.table()` and `read.csv()`. The -analogous functions in `readr` are `read_table()` and -`read_csv()`. These functions are often *much* faster than their base -R analogues and provide a few other nice features such as progress -meters. +The [readr](https://github.com/tidyverse/readr) package is recently +developed by Hadley Wickham to deal with reading in large flat files +quickly. The package provides replacements for functions like +`read.table()` and `read.csv()`. The analogous functions in `readr` +are `read_table()` and `read_csv()`. These functions are often +*much* faster than their base R analogues and provide a few other +nice features such as progress meters. For the most part, you can read use `read_table()` and `read_csv()` pretty much anywhere you might use `read.table()` and `read.csv()`. In @@ -251,9 +251,6 @@ Now the `date` column is stored as a `Date` object which can be used for relevan A> The `read_csv` function has a `progress` option that defaults to TRUE. This options provides a nice progress meter while the CSV file is being read. However, if you are using `read_csv` in a function, or perhaps embedding it in a loop, it's probably best to set `progress = FALSE`. - - - # Using Textual and Binary Formats for Storing Data [Watch a video of this chapter](https://youtu.be/5mIPigbNDfk) @@ -341,6 +338,7 @@ str(y) x ``` + ## Binary Formats The complement to the textual format is the binary format, which is @@ -404,8 +402,6 @@ losing precision or any metadata. If that is what you need, then `serialize()` is the function for you. - - # Interfaces to the Outside World [Watch a video of this chapter](https://youtu.be/Pb01WoJRUtY) @@ -422,7 +418,7 @@ made to files (most common) or to other more exotic things. In general, connections are powerful tools that let you navigate files or other external objects. Connections can be thought of as a translator that lets you talk to objects that are outside of R. Those -outside objects could be anything from a data base, a simple text +outside objects could be anything from a database, a simple text file, or a a web service API. Connections allow R functions to talk to all these different external objects without you having to write custom code for each object. @@ -445,10 +441,10 @@ detail here. The `open` argument allows for the following options: -- "r" open file in read only mode -- "w" open a file for writing (and initializing a new file) -- "a" open a file for appending -- "rb", "wb", "ab" reading, writing, or appending in binary mode (Windows) +- "r", open file in read only mode +- "w", open a file for writing (and initializing a new file) +- "a", open a file for appending +- "rb", "wb", "ab", reading, writing, or appending in binary mode (Windows) In practice, we often don't need to deal with the connection interface @@ -482,7 +478,7 @@ In the background, `read.csv()` opens a connection to the file `foo.txt`, reads from it, and closes the connection when it's done. The above example shows the basic approach to using -connections. Connections must be opened, then the are read from or +connections. Connections must be opened, then they are read from or written to, and then they are closed. @@ -499,7 +495,7 @@ x <- readLines(con, 10) x ``` -For more structured text data like CSV files or tab-delimited files, +For more structured text data, like CSV files or tab-delimited files, there are other functions like `read.csv()` or `read.table()`. The above example used the `gzfile()` function which is used to create @@ -515,7 +511,7 @@ time to a text file. ## Reading From a URL Connection The `readLines()` function can be useful for reading in lines of -webpages. Since web pages are basically text files that are stored on +web pages. Since web pages are basically text files that are stored on a remote server, there is conceptually not much difference between a web page and a local text file. However, we need R to negotiate the communication between your computer and the web server. This is what @@ -526,23 +522,23 @@ This code might take time depending on your connection speed. ```{r} ## Open a URL connection for reading -con <- url("http://www.jhsph.edu", "r") +con <- url("https://en.wikipedia.org","r") ## Read the web page x <- readLines(con) ## Print out the first few lines -head(x) +head(x,5) ``` While reading in a simple web page is sometimes useful, particularly if data are embedded in the web page somewhere. However, more commonly -we can use URL connection to read in specific data files that are +we can use URL connections to read in specific data files that are stored on web servers. Using URL connections can be useful for producing a reproducible analysis, because the code essentially documents where the data came -from and how they were obtained. This is approach is preferable to +from and how they were obtained. This approach is preferable to opening a web browser and downloading a dataset by hand. Of course, the code you write with connections may not be executable at a later date if things on the server side are changed or reorganized. diff --git a/manuscript/readwritedata.md b/manuscript/readwritedata.md index 9e9fb17..162ac42 100644 --- a/manuscript/readwritedata.md +++ b/manuscript/readwritedata.md @@ -62,25 +62,26 @@ The `read.table()` function has a few important arguments: your file, it's worth setting this to be the empty string `""`. * `skip`, the number of lines to skip from the beginning * `stringsAsFactors`, should character variables be coded as factors? - This defaults to `TRUE` because back in the old days, if you had - data that were stored as strings, it was because those strings - represented levels of a categorical variable. Now we have lots of + This defaults to `FALSE` as of R 4.0.0. In 2020, + [the default was changed](https://developer.r-project.org/Blog/public/2020/02/16/stringsasfactors/) + from `TRUE` to `FALSE` due to reproducibility and to stay consistent + with modern alternatives to data frames. Now we have lots of data that is text data and they don't always represent categorical - variables. So you may want to set this to be `FALSE` in those - cases. If you *always* want this to be `FALSE`, you can set a global - option via `options(stringsAsFactors = FALSE)`. I've never seen so - much heat generated on discussion forums about an R function - argument than the `stringsAsFactors` argument. Seriously. + variables. So setting it as `FALSE` makes sense in those + cases. With older versions of R, if you *always* want this to be + `FALSE`, you can set a global option via + `options(stringsAsFactors = FALSE)`. I've never seen so much heat + generated on discussion forums about an R function argument than the + `stringsAsFactors` argument. Seriously. For small to moderately sized datasets, you can usually call read.table without specifying any other arguments -{line-numbers=off} -~~~~~~~~ +```r > data <- read.table("foo.txt") -~~~~~~~~ +``` In this case, R will automatically @@ -102,8 +103,7 @@ argument). With much larger datasets, there are a few things that you can do that will make your life easier and will prevent R from choking. -* Read the help page for read.table, which contains many hints - +* Read the help page for read.table, which contains many hints. * Make a rough calculation of the memory required to store your dataset (see the next section for an example of how to do this). If the dataset is larger than the amount of RAM on your computer, you @@ -118,12 +118,11 @@ will make your life easier and will prevent R from choking. following: -{line-numbers=off} -~~~~~~~~ +```r > initial <- read.table("datatable.txt", nrows = 100) > classes <- sapply(initial, class) > tabAll <- read.table("datatable.txt", colClasses = classes) -~~~~~~~~ +``` * Set `nrows`. This doesn’t make R run faster but it helps with memory usage. A mild overestimate is okay. You can use the Unix tool `wc` @@ -135,8 +134,8 @@ know a few things about your system. * How much memory is available on your system? * What other applications are in use? Can you close any of them? * Are there other users logged into the same system? -* What operating system ar you using? Some operating systems can limit - the amount of memory a single process can access +* What operating system are you using? Some operating systems can limit + the amount of memory a single process can access. ## Calculating Memory Requirements for R Objects @@ -155,13 +154,12 @@ required to store this data frame? Well, on most modern computers [double precision floating point numbers](http://en.wikipedia.org/wiki/Double-precision_floating-point_format) are stored using 64 bits of memory, or 8 bytes. Given that -information, you can do the following calculation - +information, you can do the following calculation: -| 1,500,000 × 120 × 8 bytes/numeric | = 1,440,000,000 bytes | -| | = 1,440,000,000 / 2^20^ bytes/MB -| | = 1,373.29 MB -| | = 1.34 GB +``` +> 1,500,000 × 120 × 8 bytes/numeric = 1,440,000,000 bytes +> 1,440,000,000 / 2^20 bytes/MB = 1,373.29 MB = 1.34 GB +``` So the dataset would require about 1.34 GB of RAM. Most computers these days have at least that much RAM. However, you need to be aware @@ -174,19 +172,19 @@ Reading in a large dataset for which you do not have enough RAM is one easy way to freeze up your computer (or at least your R session). This is usually an unpleasant experience that usually requires you to kill the R process, in the best case scenario, or reboot your computer, in -the worst case. So make sure to do a rough calculation of memeory +the worst case. So make sure to do a rough calculation of memory requirements before reading in a large dataset. You'll thank me later. # Using the `readr` Package -The `readr` package is recently developed by Hadley Wickham to deal -with reading in large flat files quickly. The package provides -replacements for functions like `read.table()` and `read.csv()`. The -analogous functions in `readr` are `read_table()` and -`read_csv()`. These functions are often *much* faster than their base -R analogues and provide a few other nice features such as progress -meters. +The [readr](https://github.com/tidyverse/readr) package is recently +developed by Hadley Wickham to deal with reading in large flat files +quickly. The package provides replacements for functions like +`read.table()` and `read.csv()`. The analogous functions in `readr` +are `read_table()` and `read_csv()`. These functions are often +*much* faster than their base R analogues and provide a few other +nice features such as progress meters. For the most part, you can read use `read_table()` and `read_csv()` pretty much anywhere you might use `read.table()` and `read.csv()`. In @@ -208,31 +206,33 @@ specifying column types. A typical call to `read_csv` will look as follows. -{line-numbers=off} -~~~~~~~~ +```r > library(readr) > teams <- read_csv("data/team_standings.csv") -Parsed with column specification: -cols( - Standing = col_integer(), - Team = col_character() -) +Rows: 32 Columns: 2 +-- Column specification ------------------------------------------------------------------------------------------------------------------------------------ +Delimiter: "," +chr (1): Team +dbl (1): Standing + +i Use `spec()` to retrieve the full column specification for this data. +i Specify the column types or set `show_col_types = FALSE` to quiet this message. > teams -# A tibble: 32 × 2 - Standing Team - -1 1 Spain -2 2 Netherlands -3 3 Germany -4 4 Uruguay -5 5 Argentina -6 6 Brazil -7 7 Ghana -8 8 Paraguay -9 9 Japan -10 10 Chile +# A tibble: 32 x 2 + Standing Team + + 1 1 Spain + 2 2 Netherlands + 3 3 Germany + 4 4 Uruguay + 5 5 Argentina + 6 6 Brazil + 7 7 Ghana + 8 8 Paraguay + 9 9 Japan +10 10 Chile # ... with 22 more rows -~~~~~~~~ +``` By default, `read_csv` will open a CSV file and read it in line-by-line. It will also (by default), read in the first few rows of the table in order to figure out the type of each column (i.e. integer, character, etc.). From the `read_csv` help page: @@ -243,88 +243,77 @@ You can specify the type of each column with the `col_types` argument. In general, it's a good idea to specify the column types explicitly. This rules out any possible guessing errors on the part of `read_csv`. Also, specifying the column types explicitly provides a useful safety check in case anything about the dataset should change without you knowing about it. -{line-numbers=off} -~~~~~~~~ +```r > teams <- read_csv("data/team_standings.csv", col_types = "cc") -~~~~~~~~ +``` Note that the `col_types` argument accepts a compact representation. Here `"cc"` indicates that the first column is `character` and the second column is `character` (there are only two columns). Using the `col_types` argument is useful because often it is not easy to automatically figure out the type of a column by looking at a few rows (especially if a column has many missing values). The `read_csv` function will also read compressed files automatically. There is no need to decompress the file first or use the `gzfile` connection function. The following call reads a gzip-compressed CSV file containing download logs from the RStudio CRAN mirror. -{line-numbers=off} -~~~~~~~~ +```r > logs <- read_csv("data/2016-07-19.csv.bz2", n_max = 10) -Parsed with column specification: -cols( - date = col_date(format = ""), - time = col_time(format = ""), - size = col_integer(), - r_version = col_character(), - r_arch = col_character(), - r_os = col_character(), - package = col_character(), - version = col_character(), - country = col_character(), - ip_id = col_integer() -) -~~~~~~~~ +Rows: 10 Columns: 10 +-- Column specification ------------------------------------------------------------------------------------------------------------------------------------ +Delimiter: "," +chr (6): r_version, r_arch, r_os, package, version, country +dbl (2): size, ip_id +date (1): date +time (1): time + +i Use `spec()` to retrieve the full column specification for this data. +i Specify the column types or set `show_col_types = FALSE` to quiet this message. +``` Note that the warnings indicate that `read_csv` may have had some difficulty identifying the type of each column. This can be solved by using the `col_types` argument. -{line-numbers=off} -~~~~~~~~ +```r > logs <- read_csv("data/2016-07-19.csv.bz2", col_types = "ccicccccci", n_max = 10) > logs -# A tibble: 10 × 10 - date time size r_version r_arch r_os package - -1 2016-07-19 22:00:00 1887881 3.3.0 x86_64 mingw32 data.table -2 2016-07-19 22:00:05 45436 3.3.1 x86_64 mingw32 assertthat -3 2016-07-19 22:00:03 14259016 3.3.1 x86_64 mingw32 stringi -4 2016-07-19 22:00:05 1887881 3.3.1 x86_64 mingw32 data.table -5 2016-07-19 22:00:06 389615 3.3.1 x86_64 mingw32 foreach -6 2016-07-19 22:00:08 48842 3.3.1 x86_64 linux-gnu tree -7 2016-07-19 22:00:12 525 3.3.1 x86_64 darwin13.4.0 survival -8 2016-07-19 22:00:08 3225980 3.3.1 x86_64 mingw32 Rcpp -9 2016-07-19 22:00:09 556091 3.3.1 x86_64 mingw32 tibble -10 2016-07-19 22:00:10 151527 3.3.1 x86_64 mingw32 magrittr -# ... with 3 more variables: version , country , ip_id -~~~~~~~~ +# A tibble: 10 x 10 + date time size r_version r_arch r_os package version country ip_id + + 1 2016-07-19 22:00:00 1887881 3.3.0 x86_64 mingw32 data.table 1.9.6 US 1 + 2 2016-07-19 22:00:05 45436 3.3.1 x86_64 mingw32 assertthat 0.1 US 2 + 3 2016-07-19 22:00:03 14259016 3.3.1 x86_64 mingw32 stringi 1.1.1 DE 3 + 4 2016-07-19 22:00:05 1887881 3.3.1 x86_64 mingw32 data.table 1.9.6 US 4 + 5 2016-07-19 22:00:06 389615 3.3.1 x86_64 mingw32 foreach 1.4.3 US 4 + 6 2016-07-19 22:00:08 48842 3.3.1 x86_64 linux-gnu tree 1.0-37 CO 5 + 7 2016-07-19 22:00:12 525 3.3.1 x86_64 darwin13.4.0 survival 2.39-5 US 6 + 8 2016-07-19 22:00:08 3225980 3.3.1 x86_64 mingw32 Rcpp 0.12.5 US 2 + 9 2016-07-19 22:00:09 556091 3.3.1 x86_64 mingw32 tibble 1.1 US 2 +10 2016-07-19 22:00:10 151527 3.3.1 x86_64 mingw32 magrittr 1.5 US 2 +``` You can specify the column type in a more detailed fashion by using the various `col_*` functions. For example, in the log data above, the first column is actually a date, so it might make more sense to read it in as a Date variable. If we wanted to just read in that first column, we could do -{line-numbers=off} -~~~~~~~~ +```r > logdates <- read_csv("data/2016-07-19.csv.bz2", + col_types = cols_only(date = col_date()), + n_max = 10) > logdates -# A tibble: 10 × 1 - date - -1 2016-07-19 -2 2016-07-19 -3 2016-07-19 -4 2016-07-19 -5 2016-07-19 -6 2016-07-19 -7 2016-07-19 -8 2016-07-19 -9 2016-07-19 +# A tibble: 10 x 1 + date + + 1 2016-07-19 + 2 2016-07-19 + 3 2016-07-19 + 4 2016-07-19 + 5 2016-07-19 + 6 2016-07-19 + 7 2016-07-19 + 8 2016-07-19 + 9 2016-07-19 10 2016-07-19 -~~~~~~~~ +``` Now the `date` column is stored as a `Date` object which can be used for relevant date-related computations (for example, see the `lubridate` package). A> The `read_csv` function has a `progress` option that defaults to TRUE. This options provides a nice progress meter while the CSV file is being read. However, if you are using `read_csv` in a function, or perhaps embedding it in a loop, it's probably best to set `progress = FALSE`. - - - # Using Textual and Binary Formats for Storing Data [Watch a video of this chapter](https://youtu.be/5mIPigbNDfk) @@ -368,15 +357,14 @@ One way to pass data around is by deparsing the R object with `dput()` and reading it back in (parsing it) using `dget()`. -{line-numbers=off} -~~~~~~~~ +```r > ## Create a data frame > y <- data.frame(a = 1, b = "a") > ## Print 'dput' output to console > dput(y) -structure(list(a = 1, b = structure(1L, .Label = "a", class = "factor")), .Names = c("a", -"b"), row.names = c(NA, -1L), class = "data.frame") -~~~~~~~~ +structure(list(a = 1, b = "a"), class = "data.frame", row.names = c(NA, +-1L)) +``` Notice that the `dput()` output is in the form of R code and that it preserves metadata like the class of the object, the row names, and @@ -385,8 +373,7 @@ the column names. The output of `dput()` can also be saved directly to a file. -{line-numbers=off} -~~~~~~~~ +```r > ## Send 'dput' output to a file > dput(y, file = "y.R") > ## Read in 'dput' output from a file @@ -394,41 +381,39 @@ The output of `dput()` can also be saved directly to a file. > new.y a b 1 1 a -~~~~~~~~ +``` Multiple objects can be deparsed at once using the dump function and read back in using `source`. -{line-numbers=off} -~~~~~~~~ +```r > x <- "foo" > y <- data.frame(a = 1L, b = "a") -~~~~~~~~ +``` We can `dump()` R objects to a file by passing a character vector of their names. -{line-numbers=off} -~~~~~~~~ +```r > dump(c("x", "y"), file = "data.R") > rm(x, y) -~~~~~~~~ +``` The inverse of `dump()` is `source()`. -{line-numbers=off} -~~~~~~~~ +```r > source("data.R") > str(y) 'data.frame': 1 obs. of 2 variables: $ a: int 1 - $ b: Factor w/ 1 level "a": 1 + $ b: chr "a" > x [1] "foo" -~~~~~~~~ +``` + ## Binary Formats @@ -443,8 +428,7 @@ The key functions for converting R objects into a binary format are be saved to a file using the `save()` function. -{line-numbers=off} -~~~~~~~~ +```r > a <- data.frame(x = rnorm(100), y = runif(100)) > b <- c(3, 4.4, 1 / 3) > @@ -453,20 +437,19 @@ be saved to a file using the `save()` function. > > ## Load 'a' and 'b' into your workspace > load("mydata.rda") -~~~~~~~~ +``` If you have a lot of objects that you want to save to a file, you can save all objects in your workspace using the `save.image()` function. -{line-numbers=off} -~~~~~~~~ +```r > ## Save everything to a file > save.image(file = "mydata.RData") > > ## load all objects in this file > load("mydata.RData") -~~~~~~~~ +``` Notice that I've used the `.rda` extension when using `save()` and the `.RData` extension when using `save.image()`. This is just my personal @@ -484,15 +467,12 @@ When you call `serialize()` on an R object, the output will be a raw vector coded in hexadecimal format. -{line-numbers=off} -~~~~~~~~ +```r > x <- list(1, 2, 3) > serialize(x, NULL) - [1] 58 0a 00 00 00 02 00 03 03 02 00 02 03 00 00 00 00 13 00 00 00 03 00 -[24] 00 00 0e 00 00 00 01 3f f0 00 00 00 00 00 00 00 00 00 0e 00 00 00 01 -[47] 40 00 00 00 00 00 00 00 00 00 00 0e 00 00 00 01 40 08 00 00 00 00 00 -[70] 00 -~~~~~~~~ + [1] 58 0a 00 00 00 03 00 04 01 01 00 03 05 00 00 00 00 06 43 50 31 32 35 32 00 00 00 13 00 00 00 03 00 00 00 0e 00 00 00 01 3f f0 00 00 00 00 00 00 00 00 +[51] 00 0e 00 00 00 01 40 00 00 00 00 00 00 00 00 00 00 0e 00 00 00 01 40 08 00 00 00 00 00 00 +``` If you want, this can be sent to a file, but in that case you are better off using something like `save()`. @@ -503,8 +483,6 @@ losing precision or any metadata. If that is what you need, then `serialize()` is the function for you. - - # Interfaces to the Outside World [Watch a video of this chapter](https://youtu.be/Pb01WoJRUtY) @@ -521,7 +499,7 @@ made to files (most common) or to other more exotic things. In general, connections are powerful tools that let you navigate files or other external objects. Connections can be thought of as a translator that lets you talk to objects that are outside of R. Those -outside objects could be anything from a data base, a simple text +outside objects could be anything from a database, a simple text file, or a a web service API. Connections allow R functions to talk to all these different external objects without you having to write custom code for each object. @@ -532,12 +510,10 @@ custom code for each object. Connections to text files can be created with the `file()` function. -{line-numbers=off} -~~~~~~~~ +```r > str(file) -function (description = "", open = "", blocking = TRUE, encoding = getOption("encoding"), - raw = FALSE, method = getOption("url.method", "default")) -~~~~~~~~ +function (description = "", open = "", blocking = TRUE, encoding = getOption("encoding"), raw = FALSE, method = getOption("url.method", "default")) +``` The `file()` function has a number of arguments that are common to many other connection functions so it's worth going into a little @@ -548,10 +524,10 @@ detail here. The `open` argument allows for the following options: -- "r" open file in read only mode -- "w" open a file for writing (and initializing a new file) -- "a" open a file for appending -- "rb", "wb", "ab" reading, writing, or appending in binary mode (Windows) +- "r", open file in read only mode +- "w", open a file for writing (and initializing a new file) +- "a", open a file for appending +- "rb", "wb", "ab", reading, writing, or appending in binary mode (Windows) In practice, we often don't need to deal with the connection interface @@ -562,8 +538,7 @@ For example, if one were to explicitly use connections to read a CSV file in to R, it might look like this, -{line-numbers=off} -~~~~~~~~ +```r > ## Create a connection to 'foo.txt' > con <- file("foo.txt") > @@ -575,21 +550,20 @@ file in to R, it might look like this, > > ## Close the connection > close(con) -~~~~~~~~ +``` which is the same as -{line-numbers=off} -~~~~~~~~ +```r > data <- read.csv("foo.txt") -~~~~~~~~ +``` In the background, `read.csv()` opens a connection to the file `foo.txt`, reads from it, and closes the connection when it's done. The above example shows the basic approach to using -connections. Connections must be opened, then the are read from or +connections. Connections must be opened, then they are read from or written to, and then they are closed. @@ -600,17 +574,15 @@ function. This function is useful for reading text files that may be unstructured or contain non-standard data. -{line-numbers=off} -~~~~~~~~ +```r > ## Open connection to gz-compressed text file > con <- gzfile("words.gz") > x <- readLines(con, 10) > x - [1] "1080" "10-point" "10th" "11-point" "12-point" "16-point" - [7] "18-point" "1st" "2" "20-point" -~~~~~~~~ + [1] "1080" "10-point" "10th" "11-point" "12-point" "16-point" "18-point" "1st" "2" "20-point" +``` -For more structured text data like CSV files or tab-delimited files, +For more structured text data, like CSV files or tab-delimited files, there are other functions like `read.csv()` or `read.table()`. The above example used the `gzfile()` function which is used to create @@ -626,7 +598,7 @@ time to a text file. ## Reading From a URL Connection The `readLines()` function can be useful for reading in lines of -webpages. Since web pages are basically text files that are stored on +web pages. Since web pages are basically text files that are stored on a remote server, there is conceptually not much difference between a web page and a local text file. However, we need R to negotiate the communication between your computer and the web server. This is what @@ -636,32 +608,29 @@ a web server. This code might take time depending on your connection speed. -{line-numbers=off} -~~~~~~~~ +```r > ## Open a URL connection for reading -> con <- url("http://www.jhsph.edu", "r") +> con <- url("https://en.wikipedia.org","r") > > ## Read the web page > x <- readLines(con) +Warning in readLines(con): incomplete final line found on 'https://en.wikipedia.org' > > ## Print out the first few lines -> head(x) -[1] "" -[2] "" -[3] "" -[4] "" -[5] "" -[6] "Johns Hopkins Bloomberg School of Public Health" -~~~~~~~~ +> head(x,5) +[1] "" "" +[3] "" "" +[5] "Wikipedia, the free encyclopedia" +``` While reading in a simple web page is sometimes useful, particularly if data are embedded in the web page somewhere. However, more commonly -we can use URL connection to read in specific data files that are +we can use URL connections to read in specific data files that are stored on web servers. Using URL connections can be useful for producing a reproducible analysis, because the code essentially documents where the data came -from and how they were obtained. This is approach is preferable to +from and how they were obtained. This approach is preferable to opening a web browser and downloading a dataset by hand. Of course, the code you write with connections may not be executable at a later date if things on the server side are changed or reorganized. From e7d54eeb5f0563a3dd2a408d52ea14ddee97424d Mon Sep 17 00:00:00 2001 From: Richard Date: Fri, 5 Nov 2021 03:41:21 -0400 Subject: [PATCH 08/24] vectorized: spelling --- manuscript/vectorized.Rmd | 2 +- manuscript/vectorized.md | 32 +++++++++++++------------------- 2 files changed, 14 insertions(+), 20 deletions(-) diff --git a/manuscript/vectorized.Rmd b/manuscript/vectorized.Rmd index 21b6e36..de830bf 100644 --- a/manuscript/vectorized.Rmd +++ b/manuscript/vectorized.Rmd @@ -64,7 +64,7 @@ x / y ## Vectorized Matrix Operations -Matrix operations are also vectorized, making for nicly compact +Matrix operations are also vectorized, making for nicely compact notation. This way, we can do element-by-element operations on matrices without having to loop over every element. diff --git a/manuscript/vectorized.md b/manuscript/vectorized.md index 8c47b31..56ac57a 100644 --- a/manuscript/vectorized.md +++ b/manuscript/vectorized.md @@ -12,27 +12,25 @@ languages. The simplest example is when adding two vectors together. -{line-numbers=off} -~~~~~~~~ +```r > x <- 1:4 > y <- 6:9 > z <- x + y > z [1] 7 9 11 13 -~~~~~~~~ +``` Natural, right? Without vectorization, you'd have to do something like -{line-numbers=off} -~~~~~~~~ +```r z <- numeric(length(x)) for(i in seq_along(x)) { z[i] <- x[i] + y[i] } z [1] 7 9 11 13 -~~~~~~~~ +``` If you had to do that every time you wanted to add two vectors, your hands would get very tired from all the typing. @@ -42,26 +40,24 @@ comparisons. So suppose you wanted to know which elements of a vector were greater than 2. You could do he following. -{line-numbers=off} -~~~~~~~~ +```r > x [1] 1 2 3 4 > x > 2 [1] FALSE FALSE TRUE TRUE -~~~~~~~~ +``` Here are other vectorized logical operations. -{line-numbers=off} -~~~~~~~~ +```r > x >= 2 [1] FALSE TRUE TRUE TRUE > x < 3 [1] TRUE TRUE FALSE FALSE > y == 8 [1] FALSE FALSE TRUE FALSE -~~~~~~~~ +``` Notice that these logical operations return a logical vector of `TRUE` and `FALSE`. @@ -70,25 +66,23 @@ and `FALSE`. Of course, subtraction, multiplication and division are also vectorized. -{line-numbers=off} -~~~~~~~~ +```r > x - y [1] -5 -5 -5 -5 > x * y [1] 6 14 24 36 > x / y [1] 0.1666667 0.2857143 0.3750000 0.4444444 -~~~~~~~~ +``` ## Vectorized Matrix Operations -Matrix operations are also vectorized, making for nicly compact +Matrix operations are also vectorized, making for nicely compact notation. This way, we can do element-by-element operations on matrices without having to loop over every element. -{line-numbers=off} -~~~~~~~~ +```r > x <- matrix(1:4, 2, 2) > y <- matrix(rep(10, 4), 2, 2) > @@ -109,6 +103,6 @@ matrices without having to loop over every element. [,1] [,2] [1,] 40 40 [2,] 60 60 -~~~~~~~~ +``` From c9de63a4218e93dd95e2c766e605398d21e9da75 Mon Sep 17 00:00:00 2001 From: Richard Date: Fri, 5 Nov 2021 04:30:41 -0400 Subject: [PATCH 09/24] dplyr: spelling, wording, syntax --- manuscript/dplyr.Rmd | 20 +-- manuscript/dplyr.md | 302 ++++++++++++++++++++----------------------- 2 files changed, 147 insertions(+), 175 deletions(-) diff --git a/manuscript/dplyr.Rmd b/manuscript/dplyr.Rmd index ce79ca1..8bf464d 100644 --- a/manuscript/dplyr.Rmd +++ b/manuscript/dplyr.Rmd @@ -40,7 +40,7 @@ Some of the key "verbs" provided by the `dplyr` package are * `%>%`: the "pipe" operator is used to connect multiple verb actions together into a pipeline -The `dplyr` package as a number of its own data types that it takes advantage of. For example, there is a handy `print` method that prevents you from printing a lot of data to the console. Most of the time, these additional data types are transparent to the user and do not need to be worried about. +The `dplyr` package has a number of its own data types that it takes advantage of. For example, there is a handy `print` method that prevents you from printing a lot of data to the console. Most of the time, these additional data types are transparent to the user and do not need to be worried about. @@ -52,7 +52,7 @@ All of the functions that we will discuss in this Chapter will have a few common 2. The subsequent arguments describe what to do with the data frame specified in the first argument, and you can refer to columns in the data frame directly without using the $ operator (just use the column names). -3. The return result of a function is a new data frame +3. The return result of a function is a new data frame. 4. Data frames must be properly formatted and annotated for this to all be useful. In particular, the data must be [tidy](http://www.jstatsoft.org/v59/i10/paper). In short, there should be one observation per row, and each column should represent a feature or characteristic of that observation. @@ -84,7 +84,7 @@ You may get some warnings when the package is loaded because there are functions ## `select()` -For the examples in this chapter we will be using a dataset containing air pollution and temperature data for the [city of Chicago](http://www.biostat.jhsph.edu/~rpeng/leanpub/rprog/chicago_data.zip) in the U.S. The dataset is available from my web site. +For the examples in this chapter we will be using a dataset containing air pollution and temperature data for the [city of Chicago](http://www.biostat.jhsph.edu/~rpeng/leanpub/rprog/chicago_data.zip) in the U.S. The dataset is available from my website. After unzipping the archive, you can load the data into R using the `readRDS()` function. @@ -101,7 +101,7 @@ str(chicago) The `select()` function can be used to select columns of a data frame that you want to focus on. Often you'll have a large data frame containing "all" of the data, but any *given* analysis might only use a subset of variables or observations. The `select()` function allows you to get the few columns you might need. -Suppose we wanted to take the first 3 columns only. There are a few ways to do this. We could for example use numerical indices. But we can also use the names directly. +Suppose we wanted to take the first 3 columns only. There are a few ways to do this. We could, for example, use numerical indices. But we can also use the names directly. ```{r} names(chicago)[1:3] @@ -197,7 +197,7 @@ and the last few rows. tail(select(chicago, date, pm25tmean2), 3) ``` -Columns can be arranged in descending order too by useing the special `desc()` operator. +Columns can be arranged in descending order too by using the special `desc()` operator. ```{r} chicago <- arrange(chicago, desc(date)) @@ -221,7 +221,7 @@ Here you can see the names of the first five variables in the `chicago` data fra head(chicago[, 1:5], 3) ``` -The `dptp` column is supposed to represent the dew point temperature adn the `pm25tmean2` column provides the PM2.5 data. However, these names are pretty obscure or awkward and probably be renamed to something more sensible. +The `dptp` column is supposed to represent the dew point temperature adn the `pm25tmean2` column provides the PM2.5 data. However, these names are pretty obscure or awkward and probably need to be renamed to something more sensible. ```{r} chicago <- rename(chicago, dewpoint = dptp, pm25 = pm25tmean2) @@ -365,11 +365,11 @@ Here we can see that `o3` tends to be low in the winter months and high in the s The `dplyr` package provides a concise set of operations for managing data frames. With these functions we can do a number of complex operations in just a few lines of code. In particular, we can often conduct the beginnings of an exploratory analysis with the powerful combination of `group_by()` and `summarize()`. -Once you learn the `dplyr` grammar there are a few additional benefits +Once you learn the `dplyr` grammar there are a few additional benefits: -* `dplyr` can work with other data frame "backends" such as SQL databases. There is an SQL interface for relational databases via the DBI package +* `dplyr` can work with other data frame "backends", such as SQL databases. There is a SQL interface for relational databases via the DBI package. -* `dplyr` can be integrated with the `data.table` package for large fast tables +* `dplyr` can be integrated with the `data.table` package for large fast tables. -The `dplyr` package is handy way to both simplify and speed up your data frame management code. It's rare that you get such a combination at the same time! +The `dplyr` package is a handy way to both simplify and speed up your data frame management code. It's rare that you get such a combination at the same time! diff --git a/manuscript/dplyr.md b/manuscript/dplyr.md index 3375593..081fa52 100644 --- a/manuscript/dplyr.md +++ b/manuscript/dplyr.md @@ -37,7 +37,7 @@ Some of the key "verbs" provided by the `dplyr` package are * `%>%`: the "pipe" operator is used to connect multiple verb actions together into a pipeline -The `dplyr` package as a number of its own data types that it takes advantage of. For example, there is a handy `print` method that prevents you from printing a lot of data to the console. Most of the time, these additional data types are transparent to the user and do not need to be worried about. +The `dplyr` package has a number of its own data types that it takes advantage of. For example, there is a handy `print` method that prevents you from printing a lot of data to the console. Most of the time, these additional data types are transparent to the user and do not need to be worried about. @@ -49,7 +49,7 @@ All of the functions that we will discuss in this Chapter will have a few common 2. The subsequent arguments describe what to do with the data frame specified in the first argument, and you can refer to columns in the data frame directly without using the $ operator (just use the column names). -3. The return result of a function is a new data frame +3. The return result of a function is a new data frame. 4. Data frames must be properly formatted and annotated for this to all be useful. In particular, the data must be [tidy](http://www.jstatsoft.org/v59/i10/paper). In short, there should be one observation per row, and each column should represent a feature or characteristic of that observation. @@ -61,24 +61,21 @@ The `dplyr` package can be installed from CRAN or from GitHub using the `devtool To install from CRAN, just run -{line-numbers=off} -~~~~~~~~ +```r > install.packages("dplyr") -~~~~~~~~ +``` To install from GitHub you can run -{line-numbers=off} -~~~~~~~~ +```r > install_github("hadley/dplyr") -~~~~~~~~ +``` After installing the package it is important that you load it into your R session with the `library()` function. -{line-numbers=off} -~~~~~~~~ +```r > library(dplyr) Attaching package: 'dplyr' @@ -88,28 +85,26 @@ The following objects are masked from 'package:stats': The following objects are masked from 'package:base': intersect, setdiff, setequal, union -~~~~~~~~ +``` You may get some warnings when the package is loaded because there are functions in the `dplyr` package that have the same name as functions in other packages. For now you can ignore the warnings. ## `select()` -For the examples in this chapter we will be using a dataset containing air pollution and temperature data for the [city of Chicago](http://www.biostat.jhsph.edu/~rpeng/leanpub/rprog/chicago_data.zip) in the U.S. The dataset is available from my web site. +For the examples in this chapter we will be using a dataset containing air pollution and temperature data for the [city of Chicago](http://www.biostat.jhsph.edu/~rpeng/leanpub/rprog/chicago_data.zip) in the U.S. The dataset is available from my website. After unzipping the archive, you can load the data into R using the `readRDS()` function. -{line-numbers=off} -~~~~~~~~ +```r > chicago <- readRDS("chicago.rds") -~~~~~~~~ +``` You can see some basic characteristics of the dataset with the `dim()` and `str()` functions. -{line-numbers=off} -~~~~~~~~ +```r > dim(chicago) [1] 6940 8 > str(chicago) @@ -122,15 +117,14 @@ You can see some basic characteristics of the dataset with the `dim()` and `str( $ pm10tmean2: num 34 NA 34.2 47 NA ... $ o3tmean2 : num 4.25 3.3 3.33 4.38 4.75 ... $ no2tmean2 : num 20 23.2 23.8 30.4 30.3 ... -~~~~~~~~ +``` The `select()` function can be used to select columns of a data frame that you want to focus on. Often you'll have a large data frame containing "all" of the data, but any *given* analysis might only use a subset of variables or observations. The `select()` function allows you to get the few columns you might need. -Suppose we wanted to take the first 3 columns only. There are a few ways to do this. We could for example use numerical indices. But we can also use the names directly. +Suppose we wanted to take the first 3 columns only. There are a few ways to do this. We could, for example, use numerical indices. But we can also use the names directly. -{line-numbers=off} -~~~~~~~~ +```r > names(chicago)[1:3] [1] "city" "tmpd" "dptp" > subset <- select(chicago, city:dptp) @@ -142,36 +136,33 @@ Suppose we wanted to take the first 3 columns only. There are a few ways to do t 4 chic 29.0 28.625 5 chic 32.0 28.875 6 chic 40.0 35.125 -~~~~~~~~ +``` Note that the `:` normally cannot be used with names or strings, but inside the `select()` function you can use it to specify a range of variable names. You can also *omit* variables using the `select()` function by using the negative sign. With `select()` you can do -{line-numbers=off} -~~~~~~~~ +```r > select(chicago, -(city:dptp)) -~~~~~~~~ +``` which indicates that we should include every variable *except* the variables `city` through `dptp`. The equivalent code in base R would be -{line-numbers=off} -~~~~~~~~ +```r > i <- match("city", names(chicago)) > j <- match("dptp", names(chicago)) > head(chicago[, -(i:j)]) -~~~~~~~~ +``` Not super intuitive, right? The `select()` function also allows a special syntax that allows you to specify variable names based on patterns. So, for example, if you wanted to keep every variable that ends with a "2", we could do -{line-numbers=off} -~~~~~~~~ +```r > subset <- select(chicago, ends_with("2")) > str(subset) 'data.frame': 6940 obs. of 4 variables: @@ -179,19 +170,18 @@ The `select()` function also allows a special syntax that allows you to specify $ pm10tmean2: num 34 NA 34.2 47 NA ... $ o3tmean2 : num 4.25 3.3 3.33 4.38 4.75 ... $ no2tmean2 : num 20 23.2 23.8 30.4 30.3 ... -~~~~~~~~ +``` Or if we wanted to keep every variable that starts with a "d", we could do -{line-numbers=off} -~~~~~~~~ +```r > subset <- select(chicago, starts_with("d")) > str(subset) 'data.frame': 6940 obs. of 2 variables: $ dptp: num 31.5 29.9 27.4 28.6 28.9 ... $ date: Date, format: "1987-01-01" "1987-01-02" ... -~~~~~~~~ +``` You can also use more general regular expressions if necessary. See the help page (`?select`) for more details. @@ -203,8 +193,7 @@ The `filter()` function is used to extract subsets of rows from a data frame. Th Suppose we wanted to extract the rows of the `chicago` data frame where the levels of PM2.5 are greater than 30 (which is a reasonably high level), we could do -{line-numbers=off} -~~~~~~~~ +```r > chic.f <- filter(chicago, pm25tmean2 > 30) > str(chic.f) 'data.frame': 194 obs. of 8 variables: @@ -216,24 +205,22 @@ Suppose we wanted to extract the rows of the `chicago` data frame where the leve $ pm10tmean2: num 32.5 38.7 34 28.5 35 ... $ o3tmean2 : num 3.18 1.75 10.79 14.3 20.66 ... $ no2tmean2 : num 25.3 29.4 25.3 31.4 26.8 ... -~~~~~~~~ +``` You can see that there are now only 194 rows in the data frame and the distribution of the `pm25tmean2` values is. -{line-numbers=off} -~~~~~~~~ +```r > summary(chic.f$pm25tmean2) Min. 1st Qu. Median Mean 3rd Qu. Max. 30.05 32.12 35.04 36.63 39.53 61.50 -~~~~~~~~ +``` We can place an arbitrarily complex logical sequence inside of `filter()`, so we could for example extract the rows where PM2.5 is greater than 30 *and* temperature is greater than 80 degrees Fahrenheit. -{line-numbers=off} -~~~~~~~~ +```r > chic.f <- filter(chicago, pm25tmean2 > 30 & tmpd > 80) > select(chic.f, date, tmpd, pm25tmean2) date tmpd pm25tmean2 @@ -254,7 +241,7 @@ We can place an arbitrarily complex logical sequence inside of `filter()`, so we 15 2005-06-28 85 31.20000 16 2005-07-17 84 32.70000 17 2005-08-03 84 37.90000 -~~~~~~~~ +``` Now there are only 17 observations where both of those conditions are met. @@ -268,48 +255,43 @@ of other columns) is normally a pain to do in R. The `arrange()` function simpli Here we can order the rows of the data frame by date, so that the first row is the earliest (oldest) observation and the last row is the latest (most recent) observation. -{line-numbers=off} -~~~~~~~~ +```r > chicago <- arrange(chicago, date) -~~~~~~~~ +``` We can now check the first few rows -{line-numbers=off} -~~~~~~~~ +```r > head(select(chicago, date, pm25tmean2), 3) date pm25tmean2 1 1987-01-01 NA 2 1987-01-02 NA 3 1987-01-03 NA -~~~~~~~~ +``` and the last few rows. -{line-numbers=off} -~~~~~~~~ +```r > tail(select(chicago, date, pm25tmean2), 3) date pm25tmean2 6938 2005-12-29 7.45000 6939 2005-12-30 15.05714 6940 2005-12-31 15.00000 -~~~~~~~~ +``` -Columns can be arranged in descending order too by useing the special `desc()` operator. +Columns can be arranged in descending order too by using the special `desc()` operator. -{line-numbers=off} -~~~~~~~~ +```r > chicago <- arrange(chicago, desc(date)) -~~~~~~~~ +``` Looking at the first three and last three rows shows the dates in descending order. -{line-numbers=off} -~~~~~~~~ +```r > head(select(chicago, date, pm25tmean2), 3) date pm25tmean2 1 2005-12-31 15.00000 @@ -320,7 +302,7 @@ Looking at the first three and last three rows shows the dates in descending ord 6938 1987-01-03 NA 6939 1987-01-02 NA 6940 1987-01-01 NA -~~~~~~~~ +``` ## `rename()` @@ -330,27 +312,25 @@ Renaming a variable in a data frame in R is surprisingly hard to do! The `rename Here you can see the names of the first five variables in the `chicago` data frame. -{line-numbers=off} -~~~~~~~~ +```r > head(chicago[, 1:5], 3) city tmpd dptp date pm25tmean2 1 chic 35 30.1 2005-12-31 15.00000 2 chic 36 31.0 2005-12-30 15.05714 3 chic 35 29.4 2005-12-29 7.45000 -~~~~~~~~ +``` -The `dptp` column is supposed to represent the dew point temperature adn the `pm25tmean2` column provides the PM2.5 data. However, these names are pretty obscure or awkward and probably be renamed to something more sensible. +The `dptp` column is supposed to represent the dew point temperature adn the `pm25tmean2` column provides the PM2.5 data. However, these names are pretty obscure or awkward and probably need to be renamed to something more sensible. -{line-numbers=off} -~~~~~~~~ +```r > chicago <- rename(chicago, dewpoint = dptp, pm25 = pm25tmean2) > head(chicago[, 1:5], 3) city tmpd dewpoint date pm25 1 chic 35 30.1 2005-12-31 15.00000 2 chic 36 31.0 2005-12-30 15.05714 3 chic 35 29.4 2005-12-29 7.45000 -~~~~~~~~ +``` The syntax inside the `rename()` function is to have the new name on the left-hand side of the `=` sign and the old name on the right-hand side. @@ -365,8 +345,7 @@ For example, with air pollution data, we often want to *detrend* the data by sub Here we create a `pm25detrend` variable that subtracts the mean from the `pm25` variable. -{line-numbers=off} -~~~~~~~~ +```r > chicago <- mutate(chicago, pm25detrend = pm25 - mean(pm25, na.rm = TRUE)) > head(chicago) city tmpd dewpoint date pm25 pm10tmean2 o3tmean2 no2tmean2 @@ -383,15 +362,14 @@ Here we create a `pm25detrend` variable that subtracts the mean from the `pm25` 4 1.519042 5 7.329042 6 -7.830958 -~~~~~~~~ +``` There is also the related `transmute()` function, which does the same thing as `mutate()` but then *drops all non-transformed variables*. Here we detrend the PM10 and ozone (O3) variables. -{line-numbers=off} -~~~~~~~~ +```r > head(transmute(chicago, + pm10detrend = pm10tmean2 - mean(pm10tmean2, na.rm = TRUE), + o3detrend = o3tmean2 - mean(o3tmean2, na.rm = TRUE))) @@ -402,7 +380,7 @@ Here we detrend the PM10 and ozone (O3) variables. 4 -6.395206 -16.175096 5 -6.895206 -14.966763 6 -25.395206 -5.393846 -~~~~~~~~ +``` Note that there are only two columns in the transmuted data frame. @@ -416,50 +394,48 @@ The general operation here is a combination of splitting a data frame into separ First, we can create a `year` varible using `as.POSIXlt()`. -{line-numbers=off} -~~~~~~~~ +```r > chicago <- mutate(chicago, year = as.POSIXlt(date)$year + 1900) -~~~~~~~~ +``` Now we can create a separate data frame that splits the original data frame by year. -{line-numbers=off} -~~~~~~~~ +```r > years <- group_by(chicago, year) -~~~~~~~~ +``` Finally, we compute summary statistics for each year in the data frame with the `summarize()` function. -{line-numbers=off} -~~~~~~~~ +```r > summarize(years, pm25 = mean(pm25, na.rm = TRUE), + o3 = max(o3tmean2, na.rm = TRUE), -+ no2 = median(no2tmean2, na.rm = TRUE)) -# A tibble: 19 × 4 - year pm25 o3 no2 - -1 1987 NaN 62.96966 23.49369 -2 1988 NaN 61.67708 24.52296 -3 1989 NaN 59.72727 26.14062 -4 1990 NaN 52.22917 22.59583 -5 1991 NaN 63.10417 21.38194 -6 1992 NaN 50.82870 24.78921 -7 1993 NaN 44.30093 25.76993 -8 1994 NaN 52.17844 28.47500 -9 1995 NaN 66.58750 27.26042 -10 1996 NaN 58.39583 26.38715 -11 1997 NaN 56.54167 25.48143 -12 1998 18.26467 50.66250 24.58649 -13 1999 18.49646 57.48864 24.66667 -14 2000 16.93806 55.76103 23.46082 -15 2001 16.92632 51.81984 25.06522 -16 2002 15.27335 54.88043 22.73750 -17 2003 15.23183 56.16608 24.62500 -18 2004 14.62864 44.48240 23.39130 -19 2005 16.18556 58.84126 22.62387 -~~~~~~~~ ++ no2 = median(no2tmean2, na.rm = TRUE), ++ .groups = "drop") +# A tibble: 19 x 4 + year pm25 o3 no2 + + 1 1987 NaN 63.0 23.5 + 2 1988 NaN 61.7 24.5 + 3 1989 NaN 59.7 26.1 + 4 1990 NaN 52.2 22.6 + 5 1991 NaN 63.1 21.4 + 6 1992 NaN 50.8 24.8 + 7 1993 NaN 44.3 25.8 + 8 1994 NaN 52.2 28.5 + 9 1995 NaN 66.6 27.3 +10 1996 NaN 58.4 26.4 +11 1997 NaN 56.5 25.5 +12 1998 18.3 50.7 24.6 +13 1999 18.5 57.5 24.7 +14 2000 16.9 55.8 23.5 +15 2001 16.9 51.8 25.1 +16 2002 15.3 54.9 22.7 +17 2003 15.2 56.2 24.6 +18 2004 14.6 44.5 23.4 +19 2005 16.2 58.8 22.6 +``` `summarize()` returns a data frame with `year` as the first column, and then the annual averages of `pm25`, `o3`, and `no2`. @@ -468,37 +444,35 @@ In a slightly more complicated example, we might want to know what are the avera First, we can create a categorical variable of `pm25` divided into quintiles. -{line-numbers=off} -~~~~~~~~ +```r > qq <- quantile(chicago$pm25, seq(0, 1, 0.2), na.rm = TRUE) > chicago <- mutate(chicago, pm25.quint = cut(pm25, qq)) -~~~~~~~~ +``` Now we can group the data frame by the `pm25.quint` variable. -{line-numbers=off} -~~~~~~~~ +```r > quint <- group_by(chicago, pm25.quint) -~~~~~~~~ +``` Finally, we can compute the mean of `o3` and `no2` within quintiles of `pm25`. -{line-numbers=off} -~~~~~~~~ +```r > summarize(quint, o3 = mean(o3tmean2, na.rm = TRUE), -+ no2 = mean(no2tmean2, na.rm = TRUE)) -# A tibble: 6 × 3 - pm25.quint o3 no2 - -1 (1.7,8.7] 21.66401 17.99129 -2 (8.7,12.4] 20.38248 22.13004 -3 (12.4,16.7] 20.66160 24.35708 -4 (16.7,22.6] 19.88122 27.27132 -5 (22.6,61.5] 20.31775 29.64427 -6 NA 18.79044 25.77585 -~~~~~~~~ ++ no2 = mean(no2tmean2, na.rm = TRUE), ++ .groups = "drop") +# A tibble: 6 x 3 + pm25.quint o3 no2 + +1 (1.7,8.7] 21.7 18.0 +2 (8.7,12.4] 20.4 22.1 +3 (12.4,16.7] 20.7 24.4 +4 (16.7,22.6] 19.9 27.3 +5 (22.6,61.5] 20.3 29.6 +6 18.8 25.8 +``` From the table, it seems there isn't a strong relationship between `pm25` and `o3`, but there appears to be a positive correlation between `pm25` and `no2`. More sophisticated statistical modeling can help to provide precise answers to these questions, but a simple application of `dplyr` functions can often get you most of the way there. @@ -507,18 +481,16 @@ From the table, it seems there isn't a strong relationship between `pm25` and `o The pipeline operater `%>%` is very handy for stringing together multiple `dplyr` functions in a sequence of operations. Notice above that every time we wanted to apply more than one function, the sequence gets buried in a sequence of nested function calls that is difficult to read, i.e. -{line-numbers=off} -~~~~~~~~ +```r > third(second(first(x))) -~~~~~~~~ +``` This nesting is not a natural way to think about a sequence of operations. The `%>%` operator allows you to string operations in a left-to-right fashion, i.e. -{line-numbers=off} -~~~~~~~~ +```r > first(x) %>% second %>% third -~~~~~~~~ +``` Take the example that we just did in the last section where we computed the mean of `o3` and `no2` within quintiles of `pm25`. There we had to @@ -529,22 +501,22 @@ Take the example that we just did in the last section where we computed the mean That can be done with the following sequence in a single R expression. -{line-numbers=off} -~~~~~~~~ +```r > mutate(chicago, pm25.quint = cut(pm25, qq)) %>% + group_by(pm25.quint) %>% + summarize(o3 = mean(o3tmean2, na.rm = TRUE), -+ no2 = mean(no2tmean2, na.rm = TRUE)) -# A tibble: 6 × 3 - pm25.quint o3 no2 - -1 (1.7,8.7] 21.66401 17.99129 -2 (8.7,12.4] 20.38248 22.13004 -3 (12.4,16.7] 20.66160 24.35708 -4 (16.7,22.6] 19.88122 27.27132 -5 (22.6,61.5] 20.31775 29.64427 -6 NA 18.79044 25.77585 -~~~~~~~~ ++ no2 = mean(no2tmean2, na.rm = TRUE), ++ .groups = "drop") +# A tibble: 6 x 3 + pm25.quint o3 no2 + +1 (1.7,8.7] 21.7 18.0 +2 (8.7,12.4] 20.4 22.1 +3 (12.4,16.7] 20.7 24.4 +4 (16.7,22.6] 19.9 27.3 +5 (22.6,61.5] 20.3 29.6 +6 18.8 25.8 +``` This way we don't have to create a set of temporary variables along the way or create a massive nested sequence of function calls. @@ -554,29 +526,29 @@ Notice in the above code that I pass the `chicago` data frame to the first call Another example might be computing the average pollutant level by month. This could be useful to see if there are any seasonal trends in the data. -{line-numbers=off} -~~~~~~~~ +```r > mutate(chicago, month = as.POSIXlt(date)$mon + 1) %>% + group_by(month) %>% + summarize(pm25 = mean(pm25, na.rm = TRUE), + o3 = max(o3tmean2, na.rm = TRUE), -+ no2 = median(no2tmean2, na.rm = TRUE)) -# A tibble: 12 × 4 - month pm25 o3 no2 - -1 1 17.76996 28.22222 25.35417 -2 2 20.37513 37.37500 26.78034 -3 3 17.40818 39.05000 26.76984 -4 4 13.85879 47.94907 25.03125 -5 5 14.07420 52.75000 24.22222 -6 6 15.86461 66.58750 25.01140 -7 7 16.57087 59.54167 22.38442 -8 8 16.93380 53.96701 22.98333 -9 9 15.91279 57.48864 24.47917 -10 10 14.23557 47.09275 24.15217 -11 11 15.15794 29.45833 23.56537 -12 12 17.52221 27.70833 24.45773 -~~~~~~~~ ++ no2 = median(no2tmean2, na.rm = TRUE), ++ .groups = "drop") +# A tibble: 12 x 4 + month pm25 o3 no2 + + 1 1 17.8 28.2 25.4 + 2 2 20.4 37.4 26.8 + 3 3 17.4 39.0 26.8 + 4 4 13.9 47.9 25.0 + 5 5 14.1 52.8 24.2 + 6 6 15.9 66.6 25.0 + 7 7 16.6 59.5 22.4 + 8 8 16.9 54.0 23.0 + 9 9 15.9 57.5 24.5 +10 10 14.2 47.1 24.2 +11 11 15.2 29.5 23.6 +12 12 17.5 27.7 24.5 +``` Here we can see that `o3` tends to be low in the winter months and high in the summer while `no2` is higher in the winter and lower in the summer. @@ -585,11 +557,11 @@ Here we can see that `o3` tends to be low in the winter months and high in the s The `dplyr` package provides a concise set of operations for managing data frames. With these functions we can do a number of complex operations in just a few lines of code. In particular, we can often conduct the beginnings of an exploratory analysis with the powerful combination of `group_by()` and `summarize()`. -Once you learn the `dplyr` grammar there are a few additional benefits +Once you learn the `dplyr` grammar there are a few additional benefits: -* `dplyr` can work with other data frame "backends" such as SQL databases. There is an SQL interface for relational databases via the DBI package +* `dplyr` can work with other data frame "backends", such as SQL databases. There is a SQL interface for relational databases via the DBI package. -* `dplyr` can be integrated with the `data.table` package for large fast tables +* `dplyr` can be integrated with the `data.table` package for large fast tables. -The `dplyr` package is handy way to both simplify and speed up your data frame management code. It's rare that you get such a combination at the same time! +The `dplyr` package is a handy way to both simplify and speed up your data frame management code. It's rare that you get such a combination at the same time! From 0360873864cc25064f16fbba821cbd44a910aa95 Mon Sep 17 00:00:00 2001 From: Richard Date: Fri, 5 Nov 2021 04:49:45 -0400 Subject: [PATCH 10/24] change output --- manuscript/dplyr.md | 8 -------- 1 file changed, 8 deletions(-) diff --git a/manuscript/dplyr.md b/manuscript/dplyr.md index 081fa52..ee38b53 100644 --- a/manuscript/dplyr.md +++ b/manuscript/dplyr.md @@ -77,14 +77,6 @@ After installing the package it is important that you load it into your R sessio ```r > library(dplyr) - -Attaching package: 'dplyr' -The following objects are masked from 'package:stats': - - filter, lag -The following objects are masked from 'package:base': - - intersect, setdiff, setequal, union ``` You may get some warnings when the package is loaded because there are functions in the `dplyr` package that have the same name as functions in other packages. For now you can ignore the warnings. From dbddd0ba362892571f1d7da50dbff22819bb7753 Mon Sep 17 00:00:00 2001 From: Richard Date: Fri, 5 Nov 2021 04:50:47 -0400 Subject: [PATCH 11/24] control: if-else example was malformed, spelling, syntax --- manuscript/control.Rmd | 19 ++++---- manuscript/control.md | 104 +++++++++++++++++------------------------ 2 files changed, 52 insertions(+), 71 deletions(-) diff --git a/manuscript/control.Rmd b/manuscript/control.Rmd index b9f98ef..1869326 100644 --- a/manuscript/control.Rmd +++ b/manuscript/control.Rmd @@ -28,7 +28,7 @@ Commonly used control structures are - `next`: skip an interation of a loop Most control structures are not used in interactive sessions, but -rather when writing functions or longer expresisons. However, these +rather when writing functions or longer expressions. However, these constructs do not have to be used in functions and it's a good idea to become familiar with them before we delve into functions. @@ -58,8 +58,7 @@ an `else` clause. ```r if() { ## do something -} -else { +} else { ## do something else } ``` @@ -123,12 +122,12 @@ if() { [Watch a video of this section](https://youtu.be/FbT1dGXCCxU) -For loops are pretty much the only looping construct that you will +`for` loops are pretty much the only looping construct that you will need in R. While you may occasionally find a need for other types of loops, in my experience doing data analysis, I've found very few situations where a for loop wasn't sufficient. -In R, for loops take an interator variable and assign it successive +In R, for loops take an iterator variable and assign it successive values from a sequence or vector. For loops are most commonly used for iterating over the elements of an object (list, vector, etc.) @@ -210,7 +209,7 @@ functions (discussed later). [Watch a video of this section](https://youtu.be/VqrS1Wghq1c) -While loops begin by testing a condition. If it is true, then they +`while` loops begin by testing a condition. If it is true, then they execute the loop body. Once the loop body is executed, the condition is tested again, and so forth, until the condition is false, after which the loop exits. @@ -223,7 +222,7 @@ while(count < 10) { } ``` -While loops can potentially result in infinite loops if not written +`while` loops can potentially result in infinite loops if not written properly. Use with care! Sometimes there will be more than one condition in the test. @@ -259,7 +258,7 @@ not commonly used in statistical or data analysis applications but they do have their uses. The only way to exit a `repeat` loop is to call `break`. -One possible paradigm might be in an iterative algorith where you may +One possible paradigm might be in an iterative algorithm where you may be searching for a solution and you don't want to stop until you're close enough to the solution. In this kind of situation, you often don't know in advance how many iterations it's going to take to get @@ -322,8 +321,8 @@ for(i in 1:100) { ## Summary -- Control structures like `if`, `while`, and `for` allow you to - control the flow of an R program +- Control structures, like `if`, `while`, and `for`, allow you to + control the flow of an R program. - Infinite loops should generally be avoided, even if (you believe) they are theoretically correct. diff --git a/manuscript/control.md b/manuscript/control.md index 17609a9..029df02 100644 --- a/manuscript/control.md +++ b/manuscript/control.md @@ -26,7 +26,7 @@ Commonly used control structures are - `next`: skip an interation of a loop Most control structures are not used in interactive sessions, but -rather when writing functions or longer expresisons. However, these +rather when writing functions or longer expressions. However, these constructs do not have to be used in functions and it's a good idea to become familiar with them before we delve into functions. @@ -42,33 +42,29 @@ false. For starters, you can just use the `if` statement. -{line-numbers=off} -~~~~~~~~ +```r if() { ## do something } ## Continue with rest of code -~~~~~~~~ +``` The above code does nothing if the condition is false. If you have an action you want to execute when the condition is false, then you need an `else` clause. -{line-numbers=off} -~~~~~~~~ +```r if() { ## do something -} -else { +} else { ## do something else } -~~~~~~~~ +``` You can have a series of tests by following the initial `if` with any number of `else if`s. -{line-numbers=off} -~~~~~~~~ +```r if() { ## do something } else if() { @@ -76,13 +72,12 @@ if() { } else { ## do something different } -~~~~~~~~ +``` Here is an example of a valid if/else structure. -{line-numbers=off} -~~~~~~~~ +```r ## Generate a uniform random number x <- runif(1, 0, 10) if(x > 3) { @@ -90,20 +85,19 @@ if(x > 3) { } else { y <- 0 } -~~~~~~~~ +``` The value of `y` is set depending on whether `x > 3` or not. This expression can also be written a different, but equivalent, way in R. -{line-numbers=off} -~~~~~~~~ +```r y <- if(x > 3) { 10 } else { 0 } -~~~~~~~~ +``` Neither way of writing this expression is more correct than the other. Which one you use will depend on your preference and perhaps @@ -113,8 +107,7 @@ Of course, the `else` clause is not necessary. You could have a series of if clauses that always get executed if their respective conditions are true. -{line-numbers=off} -~~~~~~~~ +```r if() { } @@ -122,25 +115,24 @@ if() { if() { } -~~~~~~~~ +``` ## `for` Loops [Watch a video of this section](https://youtu.be/FbT1dGXCCxU) -For loops are pretty much the only looping construct that you will +`for` loops are pretty much the only looping construct that you will need in R. While you may occasionally find a need for other types of loops, in my experience doing data analysis, I've found very few situations where a for loop wasn't sufficient. -In R, for loops take an interator variable and assign it successive +In R, for loops take an iterator variable and assign it successive values from a sequence or vector. For loops are most commonly used for iterating over the elements of an object (list, vector, etc.) -{line-numbers=off} -~~~~~~~~ +```r > for(i in 1:10) { + print(i) + } @@ -154,7 +146,7 @@ iterating over the elements of an object (list, vector, etc.) [1] 8 [1] 9 [1] 10 -~~~~~~~~ +``` This loop takes the `i` variable and in each iteration of the loop gives it values 1, 2, 3, ..., 10, executes the code within the curly @@ -163,8 +155,7 @@ braces, and then the loop exits. The following three loops all have the same behavior. -{line-numbers=off} -~~~~~~~~ +```r > x <- c("a", "b", "c", "d") > > for(i in 1:4) { @@ -175,15 +166,14 @@ The following three loops all have the same behavior. [1] "b" [1] "c" [1] "d" -~~~~~~~~ +``` The `seq_along()` function is commonly used in conjunction with for loops in order to generate an integer sequence based on the length of an object (in this case, the object `x`). -{line-numbers=off} -~~~~~~~~ +```r > ## Generate a sequence based on length of 'x' > for(i in seq_along(x)) { + print(x[i]) @@ -192,13 +182,12 @@ an object (in this case, the object `x`). [1] "b" [1] "c" [1] "d" -~~~~~~~~ +``` It is not necessary to use an index-type variable. -{line-numbers=off} -~~~~~~~~ +```r > for(letter in x) { + print(letter) + } @@ -206,19 +195,18 @@ It is not necessary to use an index-type variable. [1] "b" [1] "c" [1] "d" -~~~~~~~~ +``` For one line loops, the curly braces are not strictly necessary. -{line-numbers=off} -~~~~~~~~ +```r > for(i in 1:4) print(x[i]) [1] "a" [1] "b" [1] "c" [1] "d" -~~~~~~~~ +``` However, I like to use curly braces even for one-line loops, because that way if you decide to expand the loop to multiple lines, you won't @@ -230,8 +218,7 @@ burned by this). `for` loops can be nested inside of each other. -{line-numbers=off} -~~~~~~~~ +```r x <- matrix(1:6, 2, 3) for(i in seq_len(nrow(x))) { @@ -239,7 +226,7 @@ for(i in seq_len(nrow(x))) { print(x[i, j]) } } -~~~~~~~~ +``` Nested loops are commonly needed for multidimensional or hierarchical data structures (e.g. matrices, lists). Be careful with nesting @@ -253,14 +240,13 @@ functions (discussed later). [Watch a video of this section](https://youtu.be/VqrS1Wghq1c) -While loops begin by testing a condition. If it is true, then they +`while` loops begin by testing a condition. If it is true, then they execute the loop body. Once the loop body is executed, the condition is tested again, and so forth, until the condition is false, after which the loop exits. -{line-numbers=off} -~~~~~~~~ +```r > count <- 0 > while(count < 10) { + print(count) @@ -276,16 +262,15 @@ which the loop exits. [1] 7 [1] 8 [1] 9 -~~~~~~~~ +``` -While loops can potentially result in infinite loops if not written +`while` loops can potentially result in infinite loops if not written properly. Use with care! Sometimes there will be more than one condition in the test. -{line-numbers=off} -~~~~~~~~ +```r > z <- 5 > set.seed(1) > @@ -300,7 +285,7 @@ Sometimes there will be more than one condition in the test. + } > print(z) [1] 2 -~~~~~~~~ +``` Conditions are always evaluated from left to right. For example, in the above code, if `z` were less than 3, the second test would not @@ -317,15 +302,14 @@ not commonly used in statistical or data analysis applications but they do have their uses. The only way to exit a `repeat` loop is to call `break`. -One possible paradigm might be in an iterative algorith where you may +One possible paradigm might be in an iterative algorithm where you may be searching for a solution and you don't want to stop until you're close enough to the solution. In this kind of situation, you often don't know in advance how many iterations it's going to take to get "close enough" to the solution. -{line-numbers=off} -~~~~~~~~ +```r x0 <- 1 tol <- 1e-8 @@ -338,7 +322,7 @@ repeat { x0 <- x1 } } -~~~~~~~~ +``` Note that the above code will not run if the `computeEstimate()` function is not defined (I just made it up for the purposes of this @@ -356,8 +340,7 @@ report whether convergence was achieved or not. `next` is used to skip an iteration of a loop. -{line-numbers=off} -~~~~~~~~ +```r for(i in 1:100) { if(i <= 20) { ## Skip the first 20 iterations @@ -365,14 +348,13 @@ for(i in 1:100) { } ## Do something here } -~~~~~~~~ +``` `break` is used to exit a loop immediately, regardless of what iteration the loop may be on. -{line-numbers=off} -~~~~~~~~ +```r for(i in 1:100) { print(i) @@ -381,13 +363,13 @@ for(i in 1:100) { break } } -~~~~~~~~ +``` ## Summary -- Control structures like `if`, `while`, and `for` allow you to - control the flow of an R program +- Control structures, like `if`, `while`, and `for`, allow you to + control the flow of an R program. - Infinite loops should generally be avoided, even if (you believe) they are theoretically correct. From 3bb3610fe92f1aa6f6c12d76110a0b3751a72c12 Mon Sep 17 00:00:00 2001 From: Richard Date: Fri, 5 Nov 2021 05:22:52 -0400 Subject: [PATCH 12/24] functions: spelling, syntax --- manuscript/functions.Rmd | 22 +++--- manuscript/functions.md | 148 ++++++++++++++++----------------------- 2 files changed, 70 insertions(+), 100 deletions(-) diff --git a/manuscript/functions.Rmd b/manuscript/functions.Rmd index 80da217..0e5e002 100644 --- a/manuscript/functions.Rmd +++ b/manuscript/functions.Rmd @@ -29,7 +29,7 @@ treated much like any other R object. Importantly, - Functions can be passed as arguments to other functions. This is very handy for the various apply functions, like `lapply()` and `sapply()`. - Functions can be nested, so that you can define a function inside of - another function + another function. If you're familiar with common language like C, these features might appear a bit strange. However, they are really important in R and can be useful for data analysis. @@ -63,7 +63,7 @@ f() The last aspect of a basic function is the *function arguments*. These are the options that you can specify to the user that the user may -explicity set. For this basic function, we can add an argument that +explicitly set. For this basic function, we can add an argument that determines how many times "Hello, world!" is printed to the console. ```{r} @@ -75,7 +75,7 @@ f <- function(num) { f(3) ``` -Obviously, we could have just cut-and-pasted the `cat("Hello, world!\n")` code three times to achieve the same effect, but then we wouldn't be programming, would we? Also, it would be un-neighborly of you to give your code to someone else and force them to cut-and-paste the code however many times the need to see "Hello, world!". +Obviously, we could have just cut-and-pasted the `cat("Hello, world!\n")` code three times to achieve the same effect, but then we wouldn't be programming, would we? Also, it would be un-neighborly of you to give your code to someone else and force them to cut-and-paste the code however many times they need to see "Hello, world!". > In general, if you find yourself doing a lot of cutting and pasting, that's usually a good sign that you might need to write a function. @@ -98,7 +98,7 @@ print(meaningoflife) In the above function, we didn't have to indicate anything special in order for the function to return the number of characters. In R, the return value of a function is always the very last expression that is evaluated. Because the `chars` variable is the last expression that is evaluated in this function, that becomes the return value of the function. -Note that there is a `return()` function that can be used to return an explicity value from a function, but it is rarely used in R (we will discuss it a bit later in this chapter). +Note that there is a `return()` function that can be used to return an explicit value from a function, but it is rarely used in R (we will discuss it a bit later in this chapter). Finally, in the above function, the user must specify the value of the argument `num`. If it is not specified by the user, R will throw an error. @@ -145,7 +145,7 @@ Specifying an argument by its name is sometimes useful if a function has many ar ## Argument Matching -Calling an R function with arguments can be done in a variety of ways. This may be confusing at first, but it's really handing when doing interactive work at the command line. R functions arguments can be matched *positionally* or by name. Positional matching just means that R assigns the first value to the first argument, the second value to second argument, etc. So in the following call to `rnorm()` +Calling an R function with arguments can be done in a variety of ways. This may be confusing at first, but it's really handy when doing interactive work at the command line. R functions arguments can be matched *positionally* or by name. Positional matching just means that R assigns the first value to the first argument, the second value to second argument, etc. So in the following call to `rnorm()` ```{r} str(rnorm) @@ -301,14 +301,10 @@ paste("a", "b", se = ":") ## Summary -* Functions can be defined using the `function()` directive and are assigned to R objects just like any other R object - -* Functions have can be defined with named arguments; these function arguments can have default values - -* Functions arguments can be specified by name or by position in the argument list - -* Functions always return the last expression evaluated in the function body - +* Functions can be defined using the `function()` directive and are assigned to R objects just like any other R object. +* Functions can be defined with named arguments; these function arguments can have default values. +* Functions arguments can be specified by name or by position in the argument list. +* Functions always return the last expression evaluated in the function body. * A variable number of arguments can be specified using the special `...` argument in a function definition. diff --git a/manuscript/functions.md b/manuscript/functions.md index 9c12963..072ed1c 100644 --- a/manuscript/functions.md +++ b/manuscript/functions.md @@ -27,7 +27,7 @@ treated much like any other R object. Importantly, - Functions can be passed as arguments to other functions. This is very handy for the various apply functions, like `lapply()` and `sapply()`. - Functions can be nested, so that you can define a function inside of - another function + another function. If you're familiar with common language like C, these features might appear a bit strange. However, they are really important in R and can be useful for data analysis. @@ -40,8 +40,7 @@ objects of class "function". Here's a simple function that takes no arguments and does nothing. -{line-numbers=off} -~~~~~~~~ +```r > f <- function() { + ## This is an empty function + } @@ -51,29 +50,27 @@ Here's a simple function that takes no arguments and does nothing. > ## Execute this function > f() NULL -~~~~~~~~ +``` Not very interesting, but it's a start. The next thing we can do is create a function that actually has a non-trivial *function body*. -{line-numbers=off} -~~~~~~~~ +```r > f <- function() { + cat("Hello, world!\n") + } > f() Hello, world! -~~~~~~~~ +``` The last aspect of a basic function is the *function arguments*. These are the options that you can specify to the user that the user may -explicity set. For this basic function, we can add an argument that +explicitly set. For this basic function, we can add an argument that determines how many times "Hello, world!" is printed to the console. -{line-numbers=off} -~~~~~~~~ +```r > f <- function(num) { + for(i in seq_len(num)) { + cat("Hello, world!\n") @@ -83,9 +80,9 @@ determines how many times "Hello, world!" is printed to the console. Hello, world! Hello, world! Hello, world! -~~~~~~~~ +``` -Obviously, we could have just cut-and-pasted the `cat("Hello, world!\n")` code three times to achieve the same effect, but then we wouldn't be programming, would we? Also, it would be un-neighborly of you to give your code to someone else and force them to cut-and-paste the code however many times the need to see "Hello, world!". +Obviously, we could have just cut-and-pasted the `cat("Hello, world!\n")` code three times to achieve the same effect, but then we wouldn't be programming, would we? Also, it would be un-neighborly of you to give your code to someone else and force them to cut-and-paste the code however many times they need to see "Hello, world!". > In general, if you find yourself doing a lot of cutting and pasting, that's usually a good sign that you might need to write a function. @@ -94,8 +91,7 @@ Finally, the function above doesn't *return* anything. It just prints "Hello, wo This next function returns the total number of characters printed to the console. -{line-numbers=off} -~~~~~~~~ +```r > f <- function(num) { + hello <- "Hello, world!\n" + for(i in seq_len(num)) { @@ -110,28 +106,26 @@ Hello, world! Hello, world! > print(meaningoflife) [1] 42 -~~~~~~~~ +``` In the above function, we didn't have to indicate anything special in order for the function to return the number of characters. In R, the return value of a function is always the very last expression that is evaluated. Because the `chars` variable is the last expression that is evaluated in this function, that becomes the return value of the function. -Note that there is a `return()` function that can be used to return an explicity value from a function, but it is rarely used in R (we will discuss it a bit later in this chapter). +Note that there is a `return()` function that can be used to return an explicit value from a function, but it is rarely used in R (we will discuss it a bit later in this chapter). Finally, in the above function, the user must specify the value of the argument `num`. If it is not specified by the user, R will throw an error. -{line-numbers=off} -~~~~~~~~ +```r > f() Error in f(): argument "num" is missing, with no default -~~~~~~~~ +``` We can modify this behavior by setting a *default value* for the argument `num`. Any function argument can have a default value, if you wish to specify it. Sometimes, argument values are rarely modified (except in special cases) and it makes sense to set a default value for that argument. This relieves the user from having to specify the value of that argument every single time the function is called. Here, for example, we could set the default value for `num` to be 1, so that if the function is called without the `num` argument being explicitly specified, then it will print "Hello, world!" to the console once. -{line-numbers=off} -~~~~~~~~ +```r > f <- function(num = 1) { + hello <- "Hello, world!\n" + for(i in seq_len(num)) { @@ -147,7 +141,7 @@ Hello, world! Hello, world! Hello, world! [1] 28 -~~~~~~~~ +``` Remember that the function still returns the number of characters printed to the console. @@ -163,87 +157,80 @@ At this point, we have written a function that Functions have _named arguments_ which can optionally have default values. Because all function arguments have names, they can be specified using their name. -{line-numbers=off} -~~~~~~~~ +```r > f(num = 2) Hello, world! Hello, world! [1] 28 -~~~~~~~~ +``` Specifying an argument by its name is sometimes useful if a function has many arguments and it may not always be clear which argument is being specified. Here, our function only has one argument so there's no confusion. ## Argument Matching -Calling an R function with arguments can be done in a variety of ways. This may be confusing at first, but it's really handing when doing interactive work at the command line. R functions arguments can be matched *positionally* or by name. Positional matching just means that R assigns the first value to the first argument, the second value to second argument, etc. So in the following call to `rnorm()` +Calling an R function with arguments can be done in a variety of ways. This may be confusing at first, but it's really handy when doing interactive work at the command line. R functions arguments can be matched *positionally* or by name. Positional matching just means that R assigns the first value to the first argument, the second value to second argument, etc. So in the following call to `rnorm()` -{line-numbers=off} -~~~~~~~~ +```r > str(rnorm) function (n, mean = 0, sd = 1) > mydata <- rnorm(100, 2, 1) ## Generate some data -~~~~~~~~ +``` 100 is assigned to the `n` argument, 2 is assigned to the `mean` argument, and 1 is assigned to the `sd` argument, all by positional matching. The following calls to the `sd()` function (which computes the empirical standard deviation of a vector of numbers) are all equivalent. Note that `sd()` has two arguments: `x` indicates the vector of numbers and `na.rm` is a logical indicating whether missing values should be removed or not. -{line-numbers=off} -~~~~~~~~ +```r > ## Positional match first argument, default for 'na.rm' > sd(mydata) -[1] 0.8707092 +[1] 0.873495 > ## Specify 'x' argument by name, default for 'na.rm' > sd(x = mydata) -[1] 0.8707092 +[1] 0.873495 > ## Specify both arguments by name > sd(x = mydata, na.rm = FALSE) -[1] 0.8707092 -~~~~~~~~ +[1] 0.873495 +``` When specifying the function arguments by name, it doesn't matter in what order you specify them. In the example below, we specify the `na.rm` argument first, followed by `x`, even though `x` is the first argument defined in the function definition. -{line-numbers=off} -~~~~~~~~ +```r > ## Specify both arguments by name > sd(na.rm = FALSE, x = mydata) -[1] 0.8707092 -~~~~~~~~ +[1] 0.873495 +``` You can mix positional matching with matching by name. When an argument is matched by name, it is “taken out” of the argument list and the remaining unnamed arguments are matched in the order that they are listed in the function definition. -{line-numbers=off} -~~~~~~~~ +```r > sd(na.rm = FALSE, mydata) -[1] 0.8707092 -~~~~~~~~ +[1] 0.873495 +``` Here, the `mydata` object is assigned to the `x` argument, because it's the only argument not yet specified. Below is the argument list for the `lm()` function, which fits linear models to a dataset. -{line-numbers=off} -~~~~~~~~ +```r > args(lm) function (formula, data, subset, weights, na.action, method = "qr", model = TRUE, x = FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE, contrasts = NULL, offset, ...) NULL -~~~~~~~~ +``` The following two calls are equivalent. -{line-numbers=off} -~~~~~~~~ +```r lm(data = mydata, y ~ x, model = FALSE, 1:100) lm(y ~ x, mydata, 1:100, model = FALSE) -~~~~~~~~ +``` Even though it’s legal, I don’t recommend messing around with the order of the arguments too much, since it can lead to some confusion. @@ -260,12 +247,11 @@ Partial matching should be avoided when writing longer code or programs, because In addition to not specifying a default value, you can also set an argument value to `NULL`. -{line-numbers=off} -~~~~~~~~ +```r f <- function(a, b = 1, c = 2, d = NULL) { } -~~~~~~~~ +``` You can check to see whether an R object is `NULL` with the `is.null()` function. It is sometimes useful to allow an argument to take the `NULL` value, which might indicate that the function should take some specific action. @@ -277,22 +263,20 @@ Arguments to functions are evaluated _lazily_, so they are evaluated only as nee In this example, the function `f()` has two arguments: `a` and `b`. -{line-numbers=off} -~~~~~~~~ +```r > f <- function(a, b) { + a^2 + } > f(2) [1] 4 -~~~~~~~~ +``` This function never actually uses the argument `b`, so calling `f(2)` will not produce an error because the 2 gets positionally matched to `a`. This behavior can be good or bad. It's common to write a function that doesn't use an argument and not notice it simply because R never throws an error. This example also shows lazy evaluation at work, but does eventually result in an error. -{line-numbers=off} -~~~~~~~~ +```r > f <- function(a, b) { + print(a) + print(b) @@ -300,7 +284,7 @@ This example also shows lazy evaluation at work, but does eventually result in a > f(45) [1] 45 Error in print(b): argument "b" is missing, with no default -~~~~~~~~ +``` Notice that "45" got printed first before the error was triggered. This is because `b` did not have to be evaluated until after `print(a)`. Once the function tried to evaluate `print(b)` the function had to throw an error. @@ -311,40 +295,37 @@ There is a special argument in R known as the `...` argument, which indicate a v For example, a custom plotting function may want to make use of the default `plot()` function along with its entire argument list. The function below changes the default for the `type` argument to the value `type = "l"` (the original default was `type = "p"`). -{line-numbers=off} -~~~~~~~~ +```r myplot <- function(x, y, type = "l", ...) { plot(x, y, type = type, ...) ## Pass '...' to 'plot' function } -~~~~~~~~ +``` Generic functions use `...` so that extra arguments can be passed to methods. -{line-numbers=off} -~~~~~~~~ +```r > mean function (x, ...) UseMethod("mean") - + -~~~~~~~~ +``` The `...` argument is necessary when the number of arguments passed to the function cannot be known in advance. This is clear in functions like `paste()` and `cat()`. -{line-numbers=off} -~~~~~~~~ +```r > args(paste) -function (..., sep = " ", collapse = NULL) +function (..., sep = " ", collapse = NULL, recycle0 = FALSE) NULL > args(cat) function (..., file = "", sep = " ", fill = FALSE, labels = NULL, append = FALSE) NULL -~~~~~~~~ +``` Because both `paste()` and `cat()` print out text to the console by combining multiple character vectors together, it is impossible for those functions to know in advance how many character vectors will be passed to the function by the user. So the first argument to either function is `...`. @@ -355,44 +336,37 @@ One catch with `...` is that any arguments that appear _after_ `...` on the argu Take a look at the arguments to the `paste()` function. -{line-numbers=off} -~~~~~~~~ +```r > args(paste) -function (..., sep = " ", collapse = NULL) +function (..., sep = " ", collapse = NULL, recycle0 = FALSE) NULL -~~~~~~~~ +``` With the `paste()` function, the arguments `sep` and `collapse` must be named explicitly and in full if the default values are not going to be used. Here I specify that I want "a" and "b" to be pasted together and separated by a colon. -{line-numbers=off} -~~~~~~~~ +```r > paste("a", "b", sep = ":") [1] "a:b" -~~~~~~~~ +``` If I don't specify the `sep` argument in full and attempt to rely on partial matching, I don't get the expected result. -{line-numbers=off} -~~~~~~~~ +```r > paste("a", "b", se = ":") [1] "a b :" -~~~~~~~~ +``` ## Summary -* Functions can be defined using the `function()` directive and are assigned to R objects just like any other R object - -* Functions have can be defined with named arguments; these function arguments can have default values - -* Functions arguments can be specified by name or by position in the argument list - -* Functions always return the last expression evaluated in the function body - +* Functions can be defined using the `function()` directive and are assigned to R objects just like any other R object. +* Functions can be defined with named arguments; these function arguments can have default values. +* Functions arguments can be specified by name or by position in the argument list. +* Functions always return the last expression evaluated in the function body. * A variable number of arguments can be specified using the special `...` argument in a function definition. From e7dac0212293d4d429e80cf23425afca714e4065 Mon Sep 17 00:00:00 2001 From: Richard Date: Fri, 5 Nov 2021 15:01:05 -0400 Subject: [PATCH 13/24] scoping: spelling --- manuscript/scoping.Rmd | 2 +- manuscript/scoping.md | 102 +++++++++++++++++------------------------ 2 files changed, 43 insertions(+), 61 deletions(-) diff --git a/manuscript/scoping.Rmd b/manuscript/scoping.Rmd index db34eac..45852ae 100644 --- a/manuscript/scoping.Rmd +++ b/manuscript/scoping.Rmd @@ -257,7 +257,7 @@ Another nice feature that you can take advantage of is plotting the negative log Here is the function when `mu` is fixed. ```{r nLLFixMu} -## Fix 'mu' to be equalt o 1 +## Fix 'mu' to be equal to 1 nLL <- make.NegLogLik(normals, c(1, FALSE)) x <- seq(1.7, 1.9, len = 100) diff --git a/manuscript/scoping.md b/manuscript/scoping.md index df00ea3..9d61993 100644 --- a/manuscript/scoping.md +++ b/manuscript/scoping.md @@ -11,12 +11,11 @@ How does R know which value to assign to which symbol? When I type -{line-numbers=off} -~~~~~~~~ +```r > lm <- function(x) { x * x } > lm function(x) { x * x } -~~~~~~~~ +``` how does R know what value to assign to the symbol `lm`? Why doesn’t it give it the value of `lm` that is in the `stats` package? @@ -28,13 +27,11 @@ When R tries to bind a value to a symbol, it searches through a series of `envir The search list can be found by using the `search()` function. -{line-numbers=off} -~~~~~~~~ +```r > search() -[1] ".GlobalEnv" "package:knitr" "package:stats" -[4] "package:graphics" "package:grDevices" "package:utils" -[7] "package:datasets" "Autoloads" "package:base" -~~~~~~~~ + [1] ".GlobalEnv" "package:dplyr" "package:readr" "tools:rstudio" "package:stats" "package:graphics" "package:grDevices" + [8] "package:utils" "package:datasets" "package:methods" "Autoloads" "package:base" +``` The _global environment_ or the user’s workspace is always the first element of the search list and the `base` package is always the last. For better or for worse, the order of the packages on the search list matters, particularly if there are multiple objects with the same name in different packages. @@ -55,12 +52,11 @@ symbol Consider the following function. -{line-numbers=off} -~~~~~~~~ +```r > f <- function(x, y) { + x^2 + y / z + } -~~~~~~~~ +``` This function has 2 formal arguments `x` and `y`. In the body of the function there is another symbol `z`. In this case `z` is called a _free variable_. @@ -96,64 +92,59 @@ Typically, a function is defined in the global environment, so that the values o Here is an example of a function that returns another function as its return value. Remember, in R functions are treated like any other object and so this is perfectly valid. -{line-numbers=off} -~~~~~~~~ +```r > make.power <- function(n) { + pow <- function(x) { + x^n + } + pow + } -~~~~~~~~ +``` The `make.power()` function is a kind of "constructor function" that can be used to construct other functions. -{line-numbers=off} -~~~~~~~~ +```r > cube <- make.power(3) > square <- make.power(2) > cube(3) [1] 27 > square(3) [1] 9 -~~~~~~~~ +``` Let's take a look at the `cube()` function's code. -{line-numbers=off} -~~~~~~~~ +```r > cube function(x) { x^n } - -~~~~~~~~ + +``` Notice that `cube()` has a free variable `n`. What is the value of `n` here? Well, its value is taken from the environment where the function was defined. When I defined the `cube()` function it was when I called `make.power(3)`, so the value of `n` at that time was 3. We can explore the environment of a function to see what objects are there and their values. -{line-numbers=off} -~~~~~~~~ +```r > ls(environment(cube)) [1] "n" "pow" > get("n", environment(cube)) [1] 3 -~~~~~~~~ +``` We can also take a look at the `square()` function. -{line-numbers=off} -~~~~~~~~ +```r > ls(environment(square)) [1] "n" "pow" > get("n", environment(square)) [1] 2 -~~~~~~~~ +``` ## Lexical vs. Dynamic Scoping @@ -161,8 +152,7 @@ We can also take a look at the `square()` function. We can use the following example to demonstrate the difference between lexical and dynamic scoping rules. -{line-numbers=off} -~~~~~~~~ +```r > y <- 10 > > f <- function(x) { @@ -173,14 +163,13 @@ We can use the following example to demonstrate the difference between lexical a > g <- function(x) { + x*y + } -~~~~~~~~ +``` What is the value of the following expression? -{line-numbers=off} -~~~~~~~~ +```r f(3) -~~~~~~~~ +``` With lexical scoping the value of `y` in the function `g` is looked up in the environment in which the function was defined, in this case the global environment, so the value of `y` is 10. With dynamic scoping, the value of `y` is looked up in the environment from which the function was _called_ (sometimes referred to as the _calling environment_). In R the calling environment is known as the _parent frame_. In this case, the value of `y` would be 2. @@ -191,8 +180,7 @@ Consider this example. -{line-numbers=off} -~~~~~~~~ +```r > g <- function(x) { + a <- 3 + x+a+y @@ -203,7 +191,7 @@ Error in g(2): object 'y' not found > y <- 3 > g(2) [1] 8 -~~~~~~~~ +``` Here, `y` is defined in the global environment, which also happens to be where the function `g()` is defined. @@ -230,8 +218,7 @@ Optimization routines in R like `optim()`, `nlm()`, and `optimize()` require you Here is an example of a "constructor" function that creates a negative log-likelihood function that can be minimized to find maximum likelihood estimates in a statistical model. -{line-numbers=off} -~~~~~~~~ +```r > make.NegLogLik <- function(data, fixed = c(FALSE, FALSE)) { + params <- fixed + function(p) { @@ -245,15 +232,14 @@ Here is an example of a "constructor" function that creates a negative log-likel + -(a + b) + } + } -~~~~~~~~ +``` **Note**: Optimization functions in R _minimize_ functions, so you need to use the negative log-likelihood. Now we can generate some data and then construct our negative log-likelihood. -{line-numbers=off} -~~~~~~~~ +```r > set.seed(1) > normals <- rnorm(100, 1, 2) > nLL <- make.NegLogLik(normals) @@ -268,46 +254,44 @@ function(p) { b <- -0.5*sum((data-mu)^2) / (sigma^2) -(a + b) } - + + > > ## What's in the function environment? > ls(environment(nLL)) [1] "data" "fixed" "params" -~~~~~~~~ +``` Now that we have our `nLL()` function, we can try to minimize it with `optim()` to estimate the parameters. -{line-numbers=off} -~~~~~~~~ +```r > optim(c(mu = 0, sigma = 1), nLL)$par mu sigma 1.218239 1.787343 -~~~~~~~~ +``` You can see that the algorithm converged and obtained an estimate of `mu` and `sigma`. We can also try to estimate one parameter while holding another parameter fixed. Here we fix `sigma` to be equal to 2. -{line-numbers=off} -~~~~~~~~ +```r > nLL <- make.NegLogLik(normals, c(FALSE, 2)) > optimize(nLL, c(-1, 3))$minimum [1] 1.217775 -~~~~~~~~ +``` Because we now have a one-dimensional problem, we can use the simpler `optimize()` function rather than `optim()`. We can also try to estimate `sigma` while holding `mu` fixed at 1. -{line-numbers=off} -~~~~~~~~ +```r > nLL <- make.NegLogLik(normals, c(1, FALSE)) > optimize(nLL, c(1e-6, 10))$minimum [1] 1.800596 -~~~~~~~~ +``` ## Plotting the Likelihood @@ -316,24 +300,22 @@ Another nice feature that you can take advantage of is plotting the negative log Here is the function when `mu` is fixed. -{line-numbers=off} -~~~~~~~~ -> ## Fix 'mu' to be equalt o 1 +```r +> ## Fix 'mu' to be equal to 1 > nLL <- make.NegLogLik(normals, c(1, FALSE)) > x <- seq(1.7, 1.9, len = 100) > > ## Evaluate 'nLL()' at every point in 'x' > y <- sapply(x, nLL) > plot(x, exp(-(y - min(y))), type = "l") -~~~~~~~~ +``` ![plot of chunk nLLFixMu](images/nLLFixMu-1.png) Here is the function when `sigma` is fixed. -{line-numbers=off} -~~~~~~~~ +```r > ## Fix 'sigma' to be equal to 2 > nLL <- make.NegLogLik(normals, c(FALSE, 2)) > x <- seq(0.5, 1.5, len = 100) @@ -341,7 +323,7 @@ Here is the function when `sigma` is fixed. > ## Evaluate 'nLL()' at every point in 'x' > y <- sapply(x, nLL) > plot(x, exp(-(y - min(y))), type = "l") -~~~~~~~~ +``` ![plot of chunk nLLFixSigma](images/nLLFixSigma-1.png) From d931163f91de12aeac1c474d4f6d9e0ba08cab96 Mon Sep 17 00:00:00 2001 From: Richard Date: Sat, 6 Nov 2021 02:45:33 -0400 Subject: [PATCH 14/24] scoping: spelling, syntax --- manuscript/scoping.Rmd | 11 +++++------ manuscript/scoping.md | 17 ++++++++--------- 2 files changed, 13 insertions(+), 15 deletions(-) diff --git a/manuscript/scoping.Rmd b/manuscript/scoping.Rmd index 45852ae..1245744 100644 --- a/manuscript/scoping.Rmd +++ b/manuscript/scoping.Rmd @@ -22,7 +22,7 @@ how does R know what value to assign to the symbol `lm`? Why doesn’t it give i When R tries to bind a value to a symbol, it searches through a series of `environments` to find the appropriate value. When you are working on the command line and need to retrieve the value of an R object, the order in which things occur is roughly 1. Search the global environment (i.e. your workspace) for a symbol name matching the one requested. -2. Search the namespaces of each of the packages on the search list +2. Search the namespaces of each of the packages on the search list. The search list can be found by using the `search()` function. @@ -41,10 +41,9 @@ Note that R has separate namespaces for functions and non-functions so it’s po The scoping rules for R are the main feature that make it different from the original S language (in case you care about that). This may seem like an esoteric aspect of R, but it's one of its more interesting and useful features. -The scoping rules of a language determine how a value is associated with a *free variable* in a function. R uses [_lexical scoping_](http://en.wikipedia.org/wiki/Scope_(computer_science)#Lexical_scope_vs._dynamic_scope) or _static scoping_. An alternative to lexical scoping is _dynamic scoping_ which is implemented by some languages. Lexical scoping turns out to be particularly useful for simplifying statistical computations +The scoping rules of a language determine how a value is associated with a *free variable* in a function. R uses [_lexical scoping_](http://en.wikipedia.org/wiki/Scope_(computer_science)#Lexical_scope_vs._dynamic_scope) or _static scoping_. An alternative to lexical scoping is _dynamic scoping_ which is implemented by some languages. Lexical scoping turns out to be particularly useful for simplifying statistical computations. Related to the scoping rules is how R uses the *search list* to bind a value to a -symbol Consider the following function. @@ -72,7 +71,7 @@ A function, together with an environment, makes up what is called a _closure_ or How do we associate a value to a free variable? There is a search process that occurs that goes as follows: - If the value of a symbol is not found in the environment in which a function was defined, then the search is continued in the _parent environment_. -- The search continues down the sequence of parent environments until we hit the _top-level environment_; this usually the global environment (workspace) or the namespace of a package. +- The search continues down the sequence of parent environments until we hit the _top-level environment_; this is usually the global environment (workspace) or the namespace of a package. - After the top-level environment, the search continues down the search list until we hit the _empty environment_. If a value for a given symbol cannot be found once the empty environment is arrived at, then an error is thrown. @@ -281,7 +280,7 @@ plot(x, exp(-(y - min(y))), type = "l") ## Summary -- Objective functions can be "built" which contain all of the necessary data for evaluating the function +- Objective functions can be "built" which contain all of the necessary data for evaluating the function. - No need to carry around long argument lists — useful for interactive and exploratory work. -- Code can be simplified and cleaned up +- Code can be simplified and cleaned up. - Reference: Robert Gentleman and Ross Ihaka (2000). "Lexical Scope and Statistical Computing," _JCGS_, 9, 491–508. diff --git a/manuscript/scoping.md b/manuscript/scoping.md index 9d61993..9eb2691 100644 --- a/manuscript/scoping.md +++ b/manuscript/scoping.md @@ -22,7 +22,7 @@ how does R know what value to assign to the symbol `lm`? Why doesn’t it give i When R tries to bind a value to a symbol, it searches through a series of `environments` to find the appropriate value. When you are working on the command line and need to retrieve the value of an R object, the order in which things occur is roughly 1. Search the global environment (i.e. your workspace) for a symbol name matching the one requested. -2. Search the namespaces of each of the packages on the search list +2. Search the namespaces of each of the packages on the search list. The search list can be found by using the `search()` function. @@ -44,10 +44,9 @@ Note that R has separate namespaces for functions and non-functions so it’s po The scoping rules for R are the main feature that make it different from the original S language (in case you care about that). This may seem like an esoteric aspect of R, but it's one of its more interesting and useful features. -The scoping rules of a language determine how a value is associated with a *free variable* in a function. R uses [_lexical scoping_](http://en.wikipedia.org/wiki/Scope_(computer_science)#Lexical_scope_vs._dynamic_scope) or _static scoping_. An alternative to lexical scoping is _dynamic scoping_ which is implemented by some languages. Lexical scoping turns out to be particularly useful for simplifying statistical computations +The scoping rules of a language determine how a value is associated with a *free variable* in a function. R uses [_lexical scoping_](http://en.wikipedia.org/wiki/Scope_(computer_science)#Lexical_scope_vs._dynamic_scope) or _static scoping_. An alternative to lexical scoping is _dynamic scoping_ which is implemented by some languages. Lexical scoping turns out to be particularly useful for simplifying statistical computations. Related to the scoping rules is how R uses the *search list* to bind a value to a -symbol Consider the following function. @@ -76,7 +75,7 @@ A function, together with an environment, makes up what is called a _closure_ or How do we associate a value to a free variable? There is a search process that occurs that goes as follows: - If the value of a symbol is not found in the environment in which a function was defined, then the search is continued in the _parent environment_. -- The search continues down the sequence of parent environments until we hit the _top-level environment_; this usually the global environment (workspace) or the namespace of a package. +- The search continues down the sequence of parent environments until we hit the _top-level environment_; this is usually the global environment (workspace) or the namespace of a package. - After the top-level environment, the search continues down the search list until we hit the _empty environment_. If a value for a given symbol cannot be found once the empty environment is arrived at, then an error is thrown. @@ -121,7 +120,7 @@ Let's take a look at the `cube()` function's code. function(x) { x^n } - + ``` Notice that `cube()` has a free variable `n`. What is the value of `n` here? Well, its value is taken from the environment where the function was defined. When I defined the `cube()` function it was when I called `make.power(3)`, so the value of `n` at that time was 3. @@ -254,8 +253,8 @@ function(p) { b <- -0.5*sum((data-mu)^2) / (sigma^2) -(a + b) } - - + + > > ## What's in the function environment? > ls(environment(nLL)) @@ -330,7 +329,7 @@ Here is the function when `sigma` is fixed. ## Summary -- Objective functions can be "built" which contain all of the necessary data for evaluating the function +- Objective functions can be "built" which contain all of the necessary data for evaluating the function. - No need to carry around long argument lists — useful for interactive and exploratory work. -- Code can be simplified and cleaned up +- Code can be simplified and cleaned up. - Reference: Robert Gentleman and Ross Ihaka (2000). "Lexical Scope and Statistical Computing," _JCGS_, 9, 491–508. From 6b2618387f661a6ec88736da2419ee3d2359745e Mon Sep 17 00:00:00 2001 From: Richard Date: Sat, 6 Nov 2021 16:35:13 -0400 Subject: [PATCH 15/24] apply: spelling, syntax --- manuscript/apply.Rmd | 16 +-- manuscript/apply.md | 278 +++++++++++++++++-------------------------- 2 files changed, 119 insertions(+), 175 deletions(-) diff --git a/manuscript/apply.Rmd b/manuscript/apply.Rmd index c4f8878..90c63f8 100644 --- a/manuscript/apply.Rmd +++ b/manuscript/apply.Rmd @@ -47,7 +47,7 @@ Note that the actual looping is done internally in C code for efficiency reasons It's important to remember that `lapply()` always returns a list, regardless of the class of the input. -Here's an example of applying the `mean()` function to all elements of a list. If the original list has names, the the names will be preserved in the output. +Here's an example of applying the `mean()` function to all elements of a list. If the original list has names, then the names will be preserved in the output. ```{r} @@ -86,7 +86,7 @@ x <- 1:4 lapply(x, runif, min = 0, max = 10) ``` -So now, instead of the random numbers being between 0 and 1 (the default), the are all between 0 and 10. +So now, instead of the random numbers being between 0 and 1 (the default), they are all between 0 and 10. The `lapply()` function and its friends make heavy use of _anonymous_ functions. Anonymous functions are like members of [Project Mayhem](http://en.wikipedia.org/wiki/Fight_Club)---they have no names. These are functions are generated "on the fly" as you are using `lapply()`. Once the call to `lapply()` is finished, the function disappears and does not appear in the workspace. @@ -165,7 +165,7 @@ where - `f` is a factor (or coerced to one) or a list of factors - `drop` indicates whether empty factors levels should be dropped -The combination of `split()` and a function like `lapply()` or `sapply()` is a common paradigm in R. The basic idea is that you can take a data structure, split it into subsets defined by another variable, and apply a function over those subsets. The results of applying tha function over the subsets are then collated and returned as an object. This sequence of operations is sometimes referred to as "map-reduce" in other contexts. +The combination of `split()` and a function like `lapply()` or `sapply()` is a common paradigm in R. The basic idea is that you can take a data structure, split it into subsets defined by another variable, and apply a function over those subsets. The results of applying the function over the subsets are then collated and returned as an object. This sequence of operations is sometimes referred to as "map-reduce" in other contexts. Here we simulate some data and split it according to a factor variable. Note that we use the `gl()` function to "generate levels" in a factor variable. @@ -413,13 +413,13 @@ With `mapply()`, instead we can do This passes the sequence `1:4` to the first argument of `rep()` and the sequence `4:1` to the second argument. -Here's another example for simulating randon Normal variables. +Here's another example for simulating random Normal variables. ```{r} noise <- function(n, mean, sd) { rnorm(n, mean, sd) } -## Simulate 5 randon numbers +## Simulate 5 random numbers noise(5, 1, 2) ## This only simulates 1 set of numbers, not 5 @@ -484,9 +484,9 @@ Pretty cool, right? * The loop functions in R are very powerful because they allow you to conduct a series of operations on data using a compact form -* The operation of a loop function involves iterating over an R object (e.g. a list or vector or matrix), applying a function to each element of the object, and the collating the results and returning the collated results. +* The operation of a loop function involves iterating over an R object (e.g. a list or vector or matrix), applying a function to each element of the object, and then collating the results and returning the collated results. -* Loop functions make heavy use of anonymous functions, which exist for the life of the loop function but are not stored anywhere +* Loop functions make heavy use of anonymous functions, which exist for the life of the loop function but are not stored anywhere. -* The `split()` function can be used to divide an R object in to subsets determined by another variable which can subsequently be looped over using loop functions. +* The `split()` function can be used to divide an R object into subsets determined by another variable which can subsequently be looped over using loop functions. diff --git a/manuscript/apply.md b/manuscript/apply.md index 60d71cf..ced9e33 100644 --- a/manuscript/apply.md +++ b/manuscript/apply.md @@ -37,8 +37,7 @@ This function takes three arguments: (1) a list `X`; (2) a function (or the name The body of the `lapply()` function can be seen here. -{line-numbers=off} -~~~~~~~~ +```r > lapply function (X, FUN, ...) { @@ -47,20 +46,19 @@ function (X, FUN, ...) X <- as.list(X) .Internal(lapply(X, FUN)) } - + -~~~~~~~~ +``` Note that the actual looping is done internally in C code for efficiency reasons. It's important to remember that `lapply()` always returns a list, regardless of the class of the input. -Here's an example of applying the `mean()` function to all elements of a list. If the original list has names, the the names will be preserved in the output. +Here's an example of applying the `mean()` function to all elements of a list. If the original list has names, then the names will be preserved in the output. -{line-numbers=off} -~~~~~~~~ +```r > x <- list(a = 1:5, b = rnorm(10)) > lapply(x, mean) $a @@ -68,7 +66,7 @@ $a $b [1] 0.1322028 -~~~~~~~~ +``` Notice that here we are passing the `mean()` function as an argument to the `lapply()` function. Functions in R can be used this way and can be passed back and forth as arguments just like any other object. When you pass a function to another function, you do not need to include the open and closed parentheses `()` like you do when you are *calling* a function. @@ -76,8 +74,7 @@ Here is another example of using `lapply()`. -{line-numbers=off} -~~~~~~~~ +```r > x <- list(a = 1:4, b = rnorm(10), c = rnorm(20, 1), d = rnorm(100, 5)) > lapply(x, mean) $a @@ -91,14 +88,13 @@ $c $d [1] 5.051388 -~~~~~~~~ +``` You can use `lapply()` to evaluate a function multiple times each with a different argument. Below, is an example where I call the `runif()` function (to generate uniformly distributed random variables) four times, each time generating a different number of random numbers. -{line-numbers=off} -~~~~~~~~ +```r > x <- 1:4 > lapply(x, runif) [[1]] @@ -112,7 +108,7 @@ You can use `lapply()` to evaluate a function multiple times each with a differe [[4]] [1] 0.3214921 0.1548316 0.1322282 0.2213059 -~~~~~~~~ +``` When you pass a function to `lapply()`, `lapply()` takes elements of the list and passes them as the *first argument* of the function you are applying. In the above example, the first argument of `runif()` is `n`, and so the elements of the sequence `1:4` all got passed to the `n` argument of `runif()`. @@ -123,8 +119,7 @@ Here is where the `...` argument to `lapply()` comes into play. Any arguments th Here, the `min = 0` and `max = 10` arguments are passed down to `runif()` every time it gets called. -{line-numbers=off} -~~~~~~~~ +```r > x <- 1:4 > lapply(x, runif, min = 0, max = 10) [[1]] @@ -138,17 +133,16 @@ Here, the `min = 0` and `max = 10` arguments are passed down to `runif()` every [[4]] [1] 0.9916910 1.1890256 0.5043966 9.2925392 -~~~~~~~~ +``` -So now, instead of the random numbers being between 0 and 1 (the default), the are all between 0 and 10. +So now, instead of the random numbers being between 0 and 1 (the default), they are all between 0 and 10. The `lapply()` function and its friends make heavy use of _anonymous_ functions. Anonymous functions are like members of [Project Mayhem](http://en.wikipedia.org/wiki/Fight_Club)---they have no names. These are functions are generated "on the fly" as you are using `lapply()`. Once the call to `lapply()` is finished, the function disappears and does not appear in the workspace. Here I am creating a list that contains two matrices. -{line-numbers=off} -~~~~~~~~ +```r > x <- list(a = matrix(1:4, 2, 2), b = matrix(1:6, 3, 2)) > x $a @@ -161,29 +155,27 @@ $b [1,] 1 4 [2,] 2 5 [3,] 3 6 -~~~~~~~~ +``` Suppose I wanted to extract the first column of each matrix in the list. I could write an anonymous function for extracting the first column of each matrix. -{line-numbers=off} -~~~~~~~~ +```r > lapply(x, function(elt) { elt[,1] }) $a [1] 1 2 $b [1] 1 2 3 -~~~~~~~~ +``` Notice that I put the `function()` definition right in the call to `lapply()`. This is perfectly legal and acceptable. You can put an arbitrarily complicated function definition inside `lapply()`, but if it's going to be more complicated, it's probably a better idea to define the function separately. For example, I could have done the following. -{line-numbers=off} -~~~~~~~~ +```r > f <- function(elt) { + elt[, 1] + } @@ -193,7 +185,7 @@ $a $b [1] 1 2 3 -~~~~~~~~ +``` Now the function is no longer anonymous; it's name is `f`. Whether you use an anonymous function or you define a function first depends on your context. If you think the function `f` is something you're going to need a lot in other parts of your code, you might want to define it separately. But if you're just going to use it for this call to `lapply()`, then it's probably simpler to use an anonymous function. @@ -211,8 +203,7 @@ The `sapply()` function behaves similarly to `lapply()`; the only real differenc Here's the result of calling `lapply()`. -{line-numbers=off} -~~~~~~~~ +```r > x <- list(a = 1:4, b = rnorm(10), c = rnorm(20, 1), d = rnorm(100, 5)) > lapply(x, mean) $a @@ -226,19 +217,18 @@ $c $d [1] 4.968715 -~~~~~~~~ +``` Notice that `lapply()` returns a list (as usual), but that each element of the list has length 1. Here's the result of calling `sapply()` on the same list. -{line-numbers=off} -~~~~~~~~ +```r > sapply(x, mean) a b c d 2.500000 -0.251483 1.481246 4.968715 -~~~~~~~~ +``` Because the result of `lapply()` was a list where each element had length 1, `sapply()` collapsed the output into a numeric vector, which is often more useful than a list. @@ -253,11 +243,10 @@ The `split()` function takes a vector or other objects and splits it into groups The arguments to `split()` are -{line-numbers=off} -~~~~~~~~ +```r > str(split) function (x, f, drop = FALSE, ...) -~~~~~~~~ +``` where @@ -265,34 +254,29 @@ where - `f` is a factor (or coerced to one) or a list of factors - `drop` indicates whether empty factors levels should be dropped -The combination of `split()` and a function like `lapply()` or `sapply()` is a common paradigm in R. The basic idea is that you can take a data structure, split it into subsets defined by another variable, and apply a function over those subsets. The results of applying tha function over the subsets are then collated and returned as an object. This sequence of operations is sometimes referred to as "map-reduce" in other contexts. +The combination of `split()` and a function like `lapply()` or `sapply()` is a common paradigm in R. The basic idea is that you can take a data structure, split it into subsets defined by another variable, and apply a function over those subsets. The results of applying the function over the subsets are then collated and returned as an object. This sequence of operations is sometimes referred to as "map-reduce" in other contexts. Here we simulate some data and split it according to a factor variable. Note that we use the `gl()` function to "generate levels" in a factor variable. -{line-numbers=off} -~~~~~~~~ +```r > x <- c(rnorm(10), runif(10), rnorm(10, 1)) > f <- gl(3, 10) > split(x, f) $`1` - [1] 0.3981302 -0.4075286 1.3242586 -0.7012317 -0.5806143 -1.0010722 - [7] -0.6681786 0.9451850 0.4337021 1.0051592 + [1] 0.3981302 -0.4075286 1.3242586 -0.7012317 -0.5806143 -1.0010722 -0.6681786 0.9451850 0.4337021 1.0051592 $`2` - [1] 0.34822440 0.94893818 0.64667919 0.03527777 0.59644846 0.41531800 - [7] 0.07689704 0.52804888 0.96233331 0.70874005 + [1] 0.34822440 0.94893818 0.64667919 0.03527777 0.59644846 0.41531800 0.07689704 0.52804888 0.96233331 0.70874005 $`3` - [1] 1.13444766 1.76559900 1.95513668 0.94943430 0.69418458 - [6] 1.89367370 -0.04729815 2.97133739 0.61636789 2.65414530 -~~~~~~~~ + [1] 1.13444766 1.76559900 1.95513668 0.94943430 0.69418458 1.89367370 -0.04729815 2.97133739 0.61636789 2.65414530 +``` A common idiom is `split` followed by an `lapply`. -{line-numbers=off} -~~~~~~~~ +```r > lapply(split(x, f), mean) $`1` [1] 0.07478098 @@ -302,13 +286,12 @@ $`2` $`3` [1] 1.458703 -~~~~~~~~ +``` ## Splitting a Data Frame -{line-numbers=off} -~~~~~~~~ +```r > library(datasets) > head(airquality) Ozone Solar.R Wind Temp Month Day @@ -318,14 +301,13 @@ $`3` 4 18 313 11.5 62 5 4 5 NA NA 14.3 56 5 5 6 28 NA 14.9 66 5 6 -~~~~~~~~ +``` We can split the `airquality` data frame by the `Month` variable so that we have separate sub-data frames for each month. -{line-numbers=off} -~~~~~~~~ +```r > s <- split(airquality, airquality$Month) > str(s) List of 5 @@ -364,13 +346,12 @@ List of 5 ..$ Temp : int [1:30] 91 92 93 93 87 84 80 78 75 73 ... ..$ Month : int [1:30] 9 9 9 9 9 9 9 9 9 9 ... ..$ Day : int [1:30] 1 2 3 4 5 6 7 8 9 10 ... -~~~~~~~~ +``` Then we can take the column means for `Ozone`, `Solar.R`, and `Wind` for each sub-data frame. -{line-numbers=off} -~~~~~~~~ +```r > lapply(s, function(x) { + colMeans(x[, c("Ozone", "Solar.R", "Wind")]) + }) @@ -393,13 +374,12 @@ $`8` $`9` Ozone Solar.R Wind NA 167.4333 10.1800 -~~~~~~~~ +``` Using `sapply()` might be better here for a more readable output. -{line-numbers=off} -~~~~~~~~ +```r > sapply(s, function(x) { + colMeans(x[, c("Ozone", "Solar.R", "Wind")]) + }) @@ -407,13 +387,12 @@ Using `sapply()` might be better here for a more readable output. Ozone NA NA NA NA NA Solar.R NA 190.16667 216.483871 NA 167.4333 Wind 11.62258 10.26667 8.941935 8.793548 10.1800 -~~~~~~~~ +``` Unfortunately, there are `NA`s in the data so we cannot simply take the means of those variables. However, we can tell the `colMeans` function to remove the `NA`s before computing the mean. -{line-numbers=off} -~~~~~~~~ +```r > sapply(s, function(x) { + colMeans(x[, c("Ozone", "Solar.R", "Wind")], + na.rm = TRUE) @@ -422,13 +401,12 @@ Unfortunately, there are `NA`s in the data so we cannot simply take the means of Ozone 23.61538 29.44444 59.115385 59.961538 31.44828 Solar.R 181.29630 190.16667 216.483871 171.857143 167.43333 Wind 11.62258 10.26667 8.941935 8.793548 10.18000 -~~~~~~~~ +``` Occasionally, we may want to split an R object according to levels defined in more than one variable. We can do this by creating an interaction of the variables with the `interaction()` function. -{line-numbers=off} -~~~~~~~~ +```r > x <- rnorm(10) > f1 <- gl(2, 5) > f2 <- gl(5, 2) @@ -442,13 +420,12 @@ Levels: 1 2 3 4 5 > interaction(f1, f2) [1] 1.1 1.1 1.2 1.2 1.3 2.3 2.4 2.4 2.5 2.5 Levels: 1.1 2.1 1.2 2.2 1.3 2.3 1.4 2.4 1.5 2.5 -~~~~~~~~ +``` With multiple factors and many levels, creating an interaction can result in many levels that are empty. -{line-numbers=off} -~~~~~~~~ +```r > str(split(x, list(f1, f2))) List of 10 $ 1.1: num [1:2] 1.512 0.083 @@ -461,13 +438,12 @@ List of 10 $ 2.4: num [1:2] 0.0991 -0.4541 $ 1.5: num(0) $ 2.5: num [1:2] -0.6558 -0.0359 -~~~~~~~~ +``` Notice that there are 4 categories with no data. But we can drop empty levels when we call the `split()` function. -{line-numbers=off} -~~~~~~~~ +```r > str(split(x, list(f1, f2), drop = TRUE)) List of 6 $ 1.1: num [1:2] 1.512 0.083 @@ -476,7 +452,7 @@ List of 6 $ 2.3: num 1.04 $ 2.4: num [1:2] 0.0991 -0.4541 $ 2.5: num [1:2] -0.6558 -0.0359 -~~~~~~~~ +``` ## tapply @@ -486,11 +462,10 @@ List of 6 `tapply()` is used to apply a function over subsets of a vector. It can be thought of as a combination of `split()` and `sapply()` for vectors only. I've been told that the "t" in `tapply()` refers to "table", but that is unconfirmed. -{line-numbers=off} -~~~~~~~~ +```r > str(tapply) -function (X, INDEX, FUN = NULL, ..., simplify = TRUE) -~~~~~~~~ +function (X, INDEX, FUN = NULL, ..., default = NA, simplify = TRUE) +``` The arguments to `tapply()` are as follows: @@ -503,8 +478,7 @@ The arguments to `tapply()` are as follows: Given a vector of numbers, one simple operation is to take group means. -{line-numbers=off} -~~~~~~~~ +```r > ## Simulate some data > x <- c(rnorm(10), runif(10), rnorm(10, 1)) > ## Define some groups with a factor variable @@ -515,13 +489,12 @@ Levels: 1 2 3 > tapply(x, f, mean) 1 2 3 0.1896235 0.5336667 0.9568236 -~~~~~~~~ +``` We can also take the group means without simplifying the result, which will give us a list. For functions that return a single value, usually, this is not what we want, but it can be done. -{line-numbers=off} -~~~~~~~~ +```r > tapply(x, f, mean, simplify = FALSE) $`1` [1] 0.1896235 @@ -531,14 +504,13 @@ $`2` $`3` [1] 0.9568236 -~~~~~~~~ +``` We can also apply functions that return more than a single value. In this case, `tapply()` will not simplify the result and will return a list. Here's an example of finding the range of each sub-group. -{line-numbers=off} -~~~~~~~~ +```r > tapply(x, f, range) $`1` [1] -1.869789 1.497041 @@ -548,7 +520,7 @@ $`2` $`3` [1] -0.5690822 2.3644349 -~~~~~~~~ +``` ## `apply()` @@ -559,11 +531,10 @@ The `apply()` function is used to a evaluate a function (often an anonymous one) -{line-numbers=off} -~~~~~~~~ +```r > str(apply) -function (X, MARGIN, FUN, ...) -~~~~~~~~ +function (X, MARGIN, FUN, ..., simplify = TRUE) +``` The arguments to `apply()` are @@ -576,25 +547,20 @@ The arguments to `apply()` are Here I create a 20 by 10 matrix of Normal random numbers. I then compute the mean of each column. -{line-numbers=off} -~~~~~~~~ +```r > x <- matrix(rnorm(200), 20, 10) > apply(x, 2, mean) ## Take the mean of each column - [1] 0.02218266 -0.15932850 0.09021391 0.14723035 -0.22431309 - [6] -0.49657847 0.30095015 0.07703985 -0.20818099 0.06809774 -~~~~~~~~ + [1] 0.02218266 -0.15932850 0.09021391 0.14723035 -0.22431309 -0.49657847 0.30095015 0.07703985 -0.20818099 0.06809774 +``` I can also compute the sum of each row. -{line-numbers=off} -~~~~~~~~ +```r > apply(x, 1, sum) ## Take the mean of each row - [1] -0.48483448 5.33222301 -3.33862932 -1.39998450 2.37859098 - [6] 0.01082604 -6.29457190 -0.26287700 0.71133578 -3.38125293 -[11] -4.67522818 3.01900232 -2.39466347 -2.16004389 5.33063755 -[16] -2.92024635 3.52026401 -1.84880901 -4.10213912 5.30667310 -~~~~~~~~ + [1] -0.48483448 5.33222301 -3.33862932 -1.39998450 2.37859098 0.01082604 -6.29457190 -0.26287700 0.71133578 -3.38125293 -4.67522818 3.01900232 +[13] -2.39466347 -2.16004389 5.33063755 -2.92024635 3.52026401 -1.84880901 -4.10213912 5.30667310 +``` Note that in both calls to `apply()`, the return value was a vector of numbers. @@ -603,18 +569,16 @@ You've probably noticed that the second argument is either a 1 or a 2, depending The `MARGIN` argument essentially indicates to `apply()` which dimension of the array you want to preserve or retain. So when taking the mean of each column, I specify -{line-numbers=off} -~~~~~~~~ +```r > apply(x, 2, mean) -~~~~~~~~ +``` because I want to collapse the first dimension (the rows) by taking the mean and I want to preserve the number of columns. Similarly, when I want the row sums, I run -{line-numbers=off} -~~~~~~~~ +```r > apply(x, 1, mean) -~~~~~~~~ +``` because I want to collapse the columns (the second dimension) and preserve the number of rows (the first dimension). @@ -635,51 +599,42 @@ The shortcut functions are heavily optimized and hence are _much_ faster, but yo You can do more than take sums and means with the `apply()` function. For example, you can compute quantiles of the rows of a matrix using the `quantile()` function. -{line-numbers=off} -~~~~~~~~ +```r > x <- matrix(rnorm(200), 20, 10) > ## Get row quantiles > apply(x, 1, quantile, probs = c(0.25, 0.75)) - [,1] [,2] [,3] [,4] [,5] [,6] -25% -1.0884151 -0.6693040 0.2908481 -0.4602083 -1.0432010 -1.12773555 -75% 0.1843547 0.8210295 1.3667301 0.4424153 0.3571219 0.03653687 - [,7] [,8] [,9] [,10] [,11] [,12] -25% -1.4571706 -0.2406991 -0.3226845 -0.329898 -0.8677524 -0.2023664 -75% -0.1705336 0.6504486 1.1460854 1.247092 0.4138139 0.9145331 - [,13] [,14] [,15] [,16] [,17] [,18] -25% -0.9796050 -1.3551031 -0.1823252 -1.260911898 -0.9954289 -0.3767354 -75% 0.5448777 -0.5396766 0.7795571 0.002908451 0.4323192 0.7542638 - [,19] [,20] -25% -0.8557544 -0.7000363 -75% 0.5440158 0.5432995 -~~~~~~~~ + [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] +25% -1.0884151 -0.6693040 0.2908481 -0.4602083 -1.0432010 -1.12773555 -1.4571706 -0.2406991 -0.3226845 -0.329898 -0.8677524 -0.2023664 -0.9796050 +75% 0.1843547 0.8210295 1.3667301 0.4424153 0.3571219 0.03653687 -0.1705336 0.6504486 1.1460854 1.247092 0.4138139 0.9145331 0.5448777 + [,14] [,15] [,16] [,17] [,18] [,19] [,20] +25% -1.3551031 -0.1823252 -1.260911898 -0.9954289 -0.3767354 -0.8557544 -0.7000363 +75% -0.5396766 0.7795571 0.002908451 0.4323192 0.7542638 0.5440158 0.5432995 +``` Notice that I had to pass the `probs = c(0.25, 0.75)` argument to `quantile()` via the `...` argument to `apply()`. -For a higher dimensional example, I can create an array of {$$}2\times2{/$$} matrices and the compute the average of the matrices in the array. +For a higher dimensional example, I can create an array of $2\times2$ matrices and the compute the average of the matrices in the array. -{line-numbers=off} -~~~~~~~~ +```r > a <- array(rnorm(2 * 2 * 10), c(2, 2, 10)) > apply(a, c(1, 2), mean) [,1] [,2] [1,] 0.1681387 -0.1039673 [2,] 0.3519741 -0.4029737 -~~~~~~~~ +``` In the call to `apply()` here, I indicated via the `MARGIN` argument that I wanted to preserve the first and second dimensions and to collapse the third dimension by taking the mean. There is a faster way to do this specific operation via the `colMeans()` function. -{line-numbers=off} -~~~~~~~~ +```r > rowMeans(a, dims = 2) ## Faster [,1] [,2] [1,] 0.1681387 -0.1039673 [2,] 0.3519741 -0.4029737 -~~~~~~~~ +``` In this situation, I might argue that the use of `rowMeans()` is less readable, but it is substantially faster with large arrays. @@ -691,11 +646,10 @@ In this situation, I might argue that the use of `rowMeans()` is less readable, The `mapply()` function is a multivariate apply of sorts which applies a function in parallel over a set of arguments. Recall that `lapply()` and friends only iterate over a single R object. What if you want to iterate over multiple R objects in parallel? This is what `mapply()` is for. -{line-numbers=off} -~~~~~~~~ +```r > str(mapply) function (FUN, ..., MoreArgs = NULL, SIMPLIFY = TRUE, USE.NAMES = TRUE) -~~~~~~~~ +``` The arguments to `mapply()` are @@ -713,8 +667,7 @@ For example, the following is tedious to type With `mapply()`, instead we can do -{line-numbers=off} -~~~~~~~~ +```r > mapply(rep, 1:4, 4:1) [[1]] [1] 1 1 1 1 @@ -727,34 +680,32 @@ With `mapply()`, instead we can do [[4]] [1] 4 -~~~~~~~~ +``` This passes the sequence `1:4` to the first argument of `rep()` and the sequence `4:1` to the second argument. -Here's another example for simulating randon Normal variables. +Here's another example for simulating random Normal variables. -{line-numbers=off} -~~~~~~~~ +```r > noise <- function(n, mean, sd) { + rnorm(n, mean, sd) + } -> ## Simulate 5 randon numbers +> ## Simulate 5 random numbers > noise(5, 1, 2) [1] -0.5196913 3.2979182 -0.6849525 1.7828267 2.7827545 > > ## This only simulates 1 set of numbers, not 5 > noise(1:5, 1:5, 2) [1] -1.670517 2.796247 2.776826 5.351488 3.422804 -~~~~~~~~ +``` Here we can use `mapply()` to pass the sequence `1:5` separately to the `noise()` function so that we can get 5 sets of random numbers, each with a different length and mean. -{line-numbers=off} -~~~~~~~~ +```r > mapply(noise, 1:5, 1:5, 2) [[1]] [1] 0.8260273 @@ -770,13 +721,12 @@ Here we can use `mapply()` to pass the sequence `1:5` separately to the `noise() [[5]] [1] 2.826182 1.347834 6.990564 4.976276 3.800743 -~~~~~~~~ +``` The above call to `mapply()` is the same as -{line-numbers=off} -~~~~~~~~ +```r > list(noise(1, 1, 2), noise(2, 2, 2), + noise(3, 3, 2), noise(4, 4, 2), + noise(5, 5, 2)) @@ -794,56 +744,50 @@ The above call to `mapply()` is the same as [[5]] [1] 8.959267 6.593589 1.581448 1.672663 5.982219 -~~~~~~~~ +``` ## Vectorizing a Function The `mapply()` function can be use to automatically "vectorize" a function. What this means is that it can be used to take a function that typically only takes single arguments and create a new function that can take vector arguments. This is often needed when you want to plot functions. -Here's an example of a function that computes the sum of squares given some data, a mean parameter and a standard deviation. The formula is {$$}\sum_{i=1}^n(x_i-\mu)^2/\sigma^2{/$$}. +Here's an example of a function that computes the sum of squares given some data, a mean parameter and a standard deviation. The formula is $\sum_{i=1}^n(x_i-\mu)^2/\sigma^2$. -{line-numbers=off} -~~~~~~~~ +```r > sumsq <- function(mu, sigma, x) { + sum(((x - mu) / sigma)^2) + } -~~~~~~~~ +``` This function takes a mean `mu`, a standard deviation `sigma`, and some data in a vector `x`. In many statistical applications, we want to minimize the sum of squares to find the optimal `mu` and `sigma`. Before we do that, we may want to evaluate or plot the function for many different values of `mu` or `sigma`. However, passing a vector of `mu`s or `sigma`s won't work with this function because it's not vectorized. -{line-numbers=off} -~~~~~~~~ +```r > x <- rnorm(100) ## Generate some data > sumsq(1:10, 1:10, x) ## This is not what we want [1] 110.2594 -~~~~~~~~ +``` Note that the call to `sumsq()` only produced one value instead of 10 values. However, we can do what we want to do by using `mapply()`. -{line-numbers=off} -~~~~~~~~ +```r > mapply(sumsq, 1:10, 1:10, MoreArgs = list(x = x)) - [1] 196.2289 121.4765 108.3981 104.0788 102.1975 101.2393 100.6998 - [8] 100.3745 100.1685 100.0332 -~~~~~~~~ + [1] 196.2289 121.4765 108.3981 104.0788 102.1975 101.2393 100.6998 100.3745 100.1685 100.0332 +``` There's even a function in R called `Vectorize()` that automatically can create a vectorized version of your function. So we could create a `vsumsq()` function that is fully vectorized as follows. -{line-numbers=off} -~~~~~~~~ +```r > vsumsq <- Vectorize(sumsq, c("mu", "sigma")) > vsumsq(1:10, 1:10, x) - [1] 196.2289 121.4765 108.3981 104.0788 102.1975 101.2393 100.6998 - [8] 100.3745 100.1685 100.0332 -~~~~~~~~ + [1] 196.2289 121.4765 108.3981 104.0788 102.1975 101.2393 100.6998 100.3745 100.1685 100.0332 +``` Pretty cool, right? @@ -852,9 +796,9 @@ Pretty cool, right? * The loop functions in R are very powerful because they allow you to conduct a series of operations on data using a compact form -* The operation of a loop function involves iterating over an R object (e.g. a list or vector or matrix), applying a function to each element of the object, and the collating the results and returning the collated results. +* The operation of a loop function involves iterating over an R object (e.g. a list or vector or matrix), applying a function to each element of the object, and then collating the results and returning the collated results. -* Loop functions make heavy use of anonymous functions, which exist for the life of the loop function but are not stored anywhere +* Loop functions make heavy use of anonymous functions, which exist for the life of the loop function but are not stored anywhere. -* The `split()` function can be used to divide an R object in to subsets determined by another variable which can subsequently be looped over using loop functions. +* The `split()` function can be used to divide an R object into subsets determined by another variable which can subsequently be looped over using loop functions. From e8f5c5144425a66f44735469745bae04b5527409 Mon Sep 17 00:00:00 2001 From: Richard Date: Sat, 6 Nov 2021 17:34:29 -0400 Subject: [PATCH 16/24] regex: markdown issues, spelling, syntax --- manuscript/regex.Rmd | 40 ++++--- manuscript/regex.md | 250 ++++++++++++++++++------------------------- 2 files changed, 121 insertions(+), 169 deletions(-) diff --git a/manuscript/regex.Rmd b/manuscript/regex.Rmd index 33b0611..e1ece96 100644 --- a/manuscript/regex.Rmd +++ b/manuscript/regex.Rmd @@ -21,15 +21,15 @@ If you want a very quick introduction to the general notion of regular expressio The primary R functions for dealing with regular expressions are -- `grep()`, `grepl()`: These functions search for matches of a regular expression/pattern in a character vector. `grep()` returns the indices into the character vector that contain a match or the specific strings that happen to have the match. `grepl()` returns a `TRUE`/`FALSE` vector indicating which elements of the character vector contain a match +- `grep()`, `grepl()`: These functions search for matches of a regular expression/pattern in a character vector. `grep()` returns the indices into the character vector that contain a match or the specific strings that happen to have the match. `grepl()` returns a `TRUE`/`FALSE` vector indicating which elements of the character vector contain a match. -- `regexpr()`, `gregexpr()`: Search a character vector for regular expression matches and return the indices of the string where the match begins and the length of the match +- `regexpr()`, `gregexpr()`: Search a character vector for regular expression matches and return the indices of the string where the match begins and the length of the match. -- `sub()`, `gsub()`: Search a character vector for regular expression matches and replace that match with another string +- `sub()`, `gsub()`: Search a character vector for regular expression matches and replace that match with another string. - `regexec()`: This function searches a character vector for a regular expression, much like `regexpr()`, but it will additionally return the locations of any parenthesized sub-expressions. Probably easier to explain through demonstration. -For this chapter, we will use a running example using data from homicides in Baltimore City. The Baltimore Sun newspaper collects information on all homicides that occur in the city (it also reports on many of them). That data is collected and presented in a [map that is publically available](http://data.baltimoresun.com/bing-maps/homicides/). I encourage you to go look at the web site/map to get a sense of what kinds of data are presented there. Unfortunately, the data on the web site are not particularly amenable to analysis, so I've scraped the data and put it in a separate file. The data in this file contain data from January 2007 to October 2013. +For this chapter, we will use a running example using data from homicides in Baltimore City. The Baltimore Sun newspaper collects information on all homicides that occur in the city (it also reports on many of them). That data is collected and presented in a [map that is publically available](http://data.baltimoresun.com/bing-maps/homicides/). I encourage you to go look at the website/map to get a sense of what kinds of data are presented there. Unfortunately, the data on the website are not particularly amenable to analysis, so I've scraped the data and put it in a separate file. The data in this file contain data from January 2007 to October 2013. Here is an excerpt of the Baltimore City homicides dataset: @@ -42,14 +42,14 @@ homicides[1] homicides[1000] ``` -The data set is formatted so that each homicide is presented on a single line of text. So when we read the data in with `readLines()`, each element of the character vector represents one homicide event. Notice that the data are riddled with HTML tags because they were scraped directly from the web site. +The dataset is formatted so that each homicide is presented on a single line of text. So when we read the data in with `readLines()`, each element of the character vector represents one homicide event. Notice that the data are riddled with HTML tags because they were scraped directly from the website. A few interesting features stand out: We have the latitude and longitude of where the victim was found; then there's the street address; the age, race, and gender of the victim; the date on which the victim was found; in which hospital the victim ultimately died; the cause of death. ## `grep()` Suppose we wanted to identify the records for all the victims of shootings (as opposed -to other causes)? How could we do that? From the map we know that for each cause of death there is a different icon/flag placed on the map. In particular, they are different colors. You can see that is indicated in the dataset for shooting deaths with a `iconHomicideShooting` label. Perhaps we can use this aspect of the data to idenfity all of the shootings. +to other causes)? How could we do that? From the map we know that for each cause of death there is a different icon/flag placed on the map. In particular, they are different colors. You can see that is indicated in the dataset for shooting deaths with a `iconHomicideShooting` label. Perhaps we can use this aspect of the data to identify all of the shootings. Here I use `grep()` to match the literal `iconHomicideShooting` into the character vector of homicides. @@ -58,14 +58,14 @@ g <- grep("iconHomicideShooting", homicides) length(g) ``` -Using this approach I get `r length(g)` shooting deaths. However, I notice that for some of the entries, the indicator for the homicide "flag" is noted as `icon_homicide_shooting`. It's not uncommon over time for web site maintainers to change the names of files or update files. What happens if we now `grep()` on both icon names using the `|` operator? +Using this approach I get `r length(g)` shooting deaths. However, I notice that for some of the entries, the indicator for the homicide "flag" is noted as `icon_homicide_shooting`. It's not uncommon over time for website maintainers to change the names of files or update files. What happens if we now `grep()` on both icon names using the `|` operator? ```{r} g <- grep("iconHomicideShooting|icon_homicide_shooting", homicides) length(g) ``` -Now we have `r length(g)` shooting deaths, which is quite a bit more. In fact, the vast majority of homicides in Baltimore are shooting deaths. +Now we have `r scales::comma(length(g))` shooting deaths, which is quite a bit more. In fact, the vast majority of homicides in Baltimore are shooting deaths. Another possible way to do this is to `grep()` on the cause of death field, which seems to have the format `Cause: shooting`. We can `grep()` on this literally and get @@ -74,21 +74,21 @@ g <- grep("Cause: shooting", homicides) length(g) ``` -Notice that we seem to be undercounting again. This is because for some of the entries, the word "shooting" uses a captial "S" while other entries use a lower case "s". We can handle this variation by using a character class in our regular expression. +Notice that we seem to be undercounting again. This is because for some of the entries, the word "shooting" uses a capital "S" while other entries use a lower case "s". We can handle this variation by using a character class in our regular expression. ```{r} g <- grep("Cause: [Ss]hooting", homicides) length(g) ``` -One thing you have to be careful of when processing text data is not not `grep()` things out of context. For example, suppose we just `grep()`-ed on the expression `[Ss]hooting`. +One thing you have to be careful of when processing text data is to not `grep()` things out of context. For example, suppose we just `grep()`-ed on the expression `[Ss]hooting`. ```{r} g <- grep("[Ss]hooting", homicides) length(g) ``` -Notice that we see to pick up 2 extra homicides this way. We can figure out which ones they are by comparing the results of the two expressions. +Notice that we seem to pick up 2 extra homicides this way. We can figure out which ones they are by comparing the results of the two expressions. First we can get the indices for the first expresssion match. @@ -147,11 +147,9 @@ state.name[g] Here, we can see that `grepl()` returns a logical vector that can be used to subset the original `state.name` vector. - - ## `regexpr()` -Both the `grep()` and the `grepl()` functions have some limitations. In particular, both functions tell you which strings in a character vector match a certain pattern but they don't tell you exactly where the match occurs or what the match is for a more complicated regular expression. +Both the `grep()` and the `grepl()` functions have some limitations. In particular, both functions tell you which strings in a character vector match a certain pattern but they don't tell you exactly where the match occurs or if the match is for a more complicated regular expression. The `regexpr()` function gives you the (a) index into each string where the match begins and the (b) length of the match for that string. `regexpr()` only gives you the *first* match of the string (reading left to right). `gregexpr()` will give you *all* of the matches in a given string if there are is more than one match. @@ -221,7 +219,7 @@ Notice that the `sub()` function found the first match (at the beginning of the gsub("
[F|f]ound on |
", "", x) ``` -The `sub() and `gsub()` functions can take vector arguments so we don't have to process each string one by one. +The `sub()` and `gsub()` functions can take vector arguments so we don't have to process each string one by one. ```{r} r <- regexpr("
[F|f]ound(.*?)
", homicides[1:5]) @@ -257,7 +255,7 @@ By contrast, if we only use the `regexpr()` function, we get regexec("
[F|f]ound on .*?
", homicides[1]) ``` -We can use the `substr()` function to demonstrate which parts of a strings are matched by the `regexec()` function. +We can use the `substr()` function to demonstrate which parts of the strings are matched by the `regexec()` function. Here's the output for `regexec()`. @@ -310,7 +308,7 @@ We can see from the picture that homicides do not occur uniformly throughout the ## The `stringr` Package -The `stringr` package is part of the [tidyverse](https://www.tidyverse.org) collection of packages and wraps they underlying `stringi` package in a series of convenience functions. Some of the complexity of using the base R regular expression functions is usefully hidden by the `stringr` functions. In addition, the `stringr` functions provide a more rational interface to regular expressions with more consistency in the arguments and argument ordering. +The `stringr` package is part of the [tidyverse](https://www.tidyverse.org) collection of packages and wraps the underlying `stringi` package in a series of convenience functions. Some of the complexity of using the base R regular expression functions is usefully hidden by the `stringr` functions. In addition, the `stringr` functions provide a more rational interface to regular expressions with more consistency in the arguments and argument ordering. Given what we have discussed so far, there is a fairly straightforward mapping from the base R functions to the `stringr` functions. In general, for the `stringr` functions, the data are the first argument and the regular expression is the second argument, with optional arguments afterwards. @@ -344,14 +342,14 @@ Note how the second column of the output contains the values of the parenthesize The primary R functions for dealing with regular expressions are - `grep()`, `grepl()`: Search for matches of a regular expression/pattern in a - character vector + character vector. -- `regexpr()`, `gregexpr(): Search a character vector for regular expression matches and +- `regexpr()`, `gregexpr()`: Search a character vector for regular expression matches and return the indices where the match begins; useful in conjunction - with `regmatches()` + with `regmatches()`. - `sub()`, `gsub()`: Search a character vector for regular expression matches and - replace that match with another string + replace that match with another string. - `regexec()`: Gives you indices of parethensized sub-expressions. diff --git a/manuscript/regex.md b/manuscript/regex.md index 51297d9..6303cae 100644 --- a/manuscript/regex.md +++ b/manuscript/regex.md @@ -18,21 +18,20 @@ If you want a very quick introduction to the general notion of regular expressio The primary R functions for dealing with regular expressions are -- `grep()`, `grepl()`: These functions search for matches of a regular expression/pattern in a character vector. `grep()` returns the indices into the character vector that contain a match or the specific strings that happen to have the match. `grepl()` returns a `TRUE`/`FALSE` vector indicating which elements of the character vector contain a match +- `grep()`, `grepl()`: These functions search for matches of a regular expression/pattern in a character vector. `grep()` returns the indices into the character vector that contain a match or the specific strings that happen to have the match. `grepl()` returns a `TRUE`/`FALSE` vector indicating which elements of the character vector contain a match. -- `regexpr()`, `gregexpr()`: Search a character vector for regular expression matches and return the indices of the string where the match begins and the length of the match +- `regexpr()`, `gregexpr()`: Search a character vector for regular expression matches and return the indices of the string where the match begins and the length of the match. -- `sub()`, `gsub()`: Search a character vector for regular expression matches and replace that match with another string +- `sub()`, `gsub()`: Search a character vector for regular expression matches and replace that match with another string. - `regexec()`: This function searches a character vector for a regular expression, much like `regexpr()`, but it will additionally return the locations of any parenthesized sub-expressions. Probably easier to explain through demonstration. -For this chapter, we will use a running example using data from homicides in Baltimore City. The Baltimore Sun newspaper collects information on all homicides that occur in the city (it also reports on many of them). That data is collected and presented in a [map that is publically available](http://data.baltimoresun.com/bing-maps/homicides/). I encourage you to go look at the web site/map to get a sense of what kinds of data are presented there. Unfortunately, the data on the web site are not particularly amenable to analysis, so I've scraped the data and put it in a separate file. The data in this file contain data from January 2007 to October 2013. +For this chapter, we will use a running example using data from homicides in Baltimore City. The Baltimore Sun newspaper collects information on all homicides that occur in the city (it also reports on many of them). That data is collected and presented in a [map that is publically available](http://data.baltimoresun.com/bing-maps/homicides/). I encourage you to go look at the website/map to get a sense of what kinds of data are presented there. Unfortunately, the data on the website are not particularly amenable to analysis, so I've scraped the data and put it in a separate file. The data in this file contain data from January 2007 to October 2013. Here is an excerpt of the Baltimore City homicides dataset: -{line-numbers=off} -~~~~~~~~ +```r > homicides <- readLines("homicides.txt") > > ## Total number of events recorded @@ -42,110 +41,101 @@ Here is an excerpt of the Baltimore City homicides dataset: [1] "39.311024, -76.674227, iconHomicideShooting, 'p2', '
Leon Nelson
3400 Clifton Ave.
Baltimore, MD 21216
black male, 17 years old
Found on January 1, 2007
Victim died at Shock Trauma
Cause: shooting
'" > homicides[1000] [1] "39.33626300000, -76.55553990000, icon_homicide_shooting, 'p1200', '
Davon Diggs
4100 Parkwood Ave
Baltimore, MD 21206
Race: Black
Gender: male
Age: 21 years old
Found on November 5, 2011
Victim died at Johns Hopkins Bayview Medical Center
Cause: Shooting

Originally reported in 5000 Belair Road; later determined to be rear alley of 4100 block Parkwood

'" -~~~~~~~~ +``` -The data set is formatted so that each homicide is presented on a single line of text. So when we read the data in with `readLines()`, each element of the character vector represents one homicide event. Notice that the data are riddled with HTML tags because they were scraped directly from the web site. +The dataset is formatted so that each homicide is presented on a single line of text. So when we read the data in with `readLines()`, each element of the character vector represents one homicide event. Notice that the data are riddled with HTML tags because they were scraped directly from the website. A few interesting features stand out: We have the latitude and longitude of where the victim was found; then there's the street address; the age, race, and gender of the victim; the date on which the victim was found; in which hospital the victim ultimately died; the cause of death. ## `grep()` Suppose we wanted to identify the records for all the victims of shootings (as opposed -to other causes)? How could we do that? From the map we know that for each cause of death there is a different icon/flag placed on the map. In particular, they are different colors. You can see that is indicated in the dataset for shooting deaths with a `iconHomicideShooting` label. Perhaps we can use this aspect of the data to idenfity all of the shootings. +to other causes)? How could we do that? From the map we know that for each cause of death there is a different icon/flag placed on the map. In particular, they are different colors. You can see that is indicated in the dataset for shooting deaths with a `iconHomicideShooting` label. Perhaps we can use this aspect of the data to identify all of the shootings. Here I use `grep()` to match the literal `iconHomicideShooting` into the character vector of homicides. -{line-numbers=off} -~~~~~~~~ +```r > g <- grep("iconHomicideShooting", homicides) > length(g) [1] 228 -~~~~~~~~ +``` -Using this approach I get 228 shooting deaths. However, I notice that for some of the entries, the indicator for the homicide "flag" is noted as `icon_homicide_shooting`. It's not uncommon over time for web site maintainers to change the names of files or update files. What happens if we now `grep()` on both icon names using the `|` operator? +Using this approach I get 228 shooting deaths. However, I notice that for some of the entries, the indicator for the homicide "flag" is noted as `icon_homicide_shooting`. It's not uncommon over time for website maintainers to change the names of files or update files. What happens if we now `grep()` on both icon names using the `|` operator? -{line-numbers=off} -~~~~~~~~ +```r > g <- grep("iconHomicideShooting|icon_homicide_shooting", homicides) > length(g) [1] 1263 -~~~~~~~~ +``` -Now we have 1263 shooting deaths, which is quite a bit more. In fact, the vast majority of homicides in Baltimore are shooting deaths. +Now we have 1,263 shooting deaths, which is quite a bit more. In fact, the vast majority of homicides in Baltimore are shooting deaths. Another possible way to do this is to `grep()` on the cause of death field, which seems to have the format `Cause: shooting`. We can `grep()` on this literally and get -{line-numbers=off} -~~~~~~~~ +```r > g <- grep("Cause: shooting", homicides) > length(g) [1] 228 -~~~~~~~~ +``` -Notice that we seem to be undercounting again. This is because for some of the entries, the word "shooting" uses a captial "S" while other entries use a lower case "s". We can handle this variation by using a character class in our regular expression. +Notice that we seem to be undercounting again. This is because for some of the entries, the word "shooting" uses a capital "S" while other entries use a lower case "s". We can handle this variation by using a character class in our regular expression. -{line-numbers=off} -~~~~~~~~ +```r > g <- grep("Cause: [Ss]hooting", homicides) > length(g) [1] 1263 -~~~~~~~~ +``` -One thing you have to be careful of when processing text data is not not `grep()` things out of context. For example, suppose we just `grep()`-ed on the expression `[Ss]hooting`. +One thing you have to be careful of when processing text data is to not `grep()` things out of context. For example, suppose we just `grep()`-ed on the expression `[Ss]hooting`. -{line-numbers=off} -~~~~~~~~ +```r > g <- grep("[Ss]hooting", homicides) > length(g) [1] 1265 -~~~~~~~~ +``` -Notice that we see to pick up 2 extra homicides this way. We can figure out which ones they are by comparing the results of the two expressions. +Notice that we seem to pick up 2 extra homicides this way. We can figure out which ones they are by comparing the results of the two expressions. First we can get the indices for the first expresssion match. -{line-numbers=off} -~~~~~~~~ +```r > i <- grep("[cC]ause: [Ss]hooting", homicides) > str(i) int [1:1263] 1 2 6 7 8 9 10 11 12 13 ... -~~~~~~~~ +``` Then we can get the indices for just matching on `[Ss]hooting`. -{line-numbers=off} -~~~~~~~~ +```r > j <- grep("[Ss]hooting", homicides) > str(j) int [1:1265] 1 2 6 7 8 9 10 11 12 13 ... -~~~~~~~~ +``` Now we just need to identify which are the entries that the vectors `i` and `j` do *not* have in common. -{line-numbers=off} -~~~~~~~~ +```r > setdiff(i, j) integer(0) > setdiff(j, i) [1] 318 859 -~~~~~~~~ +``` Here we can see that the index vector `j` has two entries that are not in `i`: entries 318, 859. We can take a look at these entries directly to see what makes them different. -{line-numbers=off} -~~~~~~~~ +```r > homicides[859] [1] "39.33743900000, -76.66316500000, icon_homicide_bluntforce, 'p914', '
Steven Harris
4200 Pimlico Road
Baltimore, MD 21215
Race: Black
Gender: male
Age: 38 years old
Found on July 29, 2010
Victim died at Scene
Cause: Blunt Force

Harris was found dead July 22 and ruled a shooting victim; an autopsy subsequently showed that he had not been shot,...

'" -~~~~~~~~ +``` Here we can see that the word "shooting" appears in the narrative text that accompanies the data, but the ultimate cause of death was in fact blunt force. @@ -155,73 +145,62 @@ A> When developing a regular expression to extract entries from a large dataset, Sometimes we want to identify elements of a character vector that match a pattern, but instead of returning their indices we want the actual values that satisfy the match. For example, we may want to identify all of the states in the United States whose names start with "New". -{line-numbers=off} -~~~~~~~~ +```r > grep("^New", state.name) [1] 29 30 31 32 -~~~~~~~~ +``` This gives us the indices into the `state.name` variable that match, but setting `value = TRUE` returns the actual elements of the character vector that match. -{line-numbers=off} -~~~~~~~~ +```r > grep("^New", state.name, value = TRUE) [1] "New Hampshire" "New Jersey" "New Mexico" "New York" -~~~~~~~~ +``` ## `grepl()` The function `grepl()` works much like `grep()` except that it differs in its return value. `grepl()` returns a logical vector indicating which element of a character vector contains the match. For example, suppose we want to know which states in the United States begin with word "New". -{line-numbers=off} -~~~~~~~~ +```r > g <- grepl("^New", state.name) > g - [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE -[13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE -[25] FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE -[37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE -[49] FALSE FALSE + [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE +[26] FALSE FALSE FALSE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE > state.name[g] [1] "New Hampshire" "New Jersey" "New Mexico" "New York" -~~~~~~~~ +``` Here, we can see that `grepl()` returns a logical vector that can be used to subset the original `state.name` vector. - - ## `regexpr()` -Both the `grep()` and the `grepl()` functions have some limitations. In particular, both functions tell you which strings in a character vector match a certain pattern but they don't tell you exactly where the match occurs or what the match is for a more complicated regular expression. +Both the `grep()` and the `grepl()` functions have some limitations. In particular, both functions tell you which strings in a character vector match a certain pattern but they don't tell you exactly where the match occurs or if the match is for a more complicated regular expression. The `regexpr()` function gives you the (a) index into each string where the match begins and the (b) length of the match for that string. `regexpr()` only gives you the *first* match of the string (reading left to right). `gregexpr()` will give you *all* of the matches in a given string if there are is more than one match. In our Baltimore City homicides dataset, we might be interested in finding the date on which each victim was found. Taking a look at the dataset -{line-numbers=off} -~~~~~~~~ +```r > homicides[1] [1] "39.311024, -76.674227, iconHomicideShooting, 'p2', '
Leon Nelson
3400 Clifton Ave.
Baltimore, MD 21216
black male, 17 years old
Found on January 1, 2007
Victim died at Shock Trauma
Cause: shooting
'" -~~~~~~~~ +``` it seems that we might be able to just `grep` on the word "Found". However, the word "found" may be found elsewhere in the entry, such as in this entry, where the word "found" appears in the narrative text at the end. -{line-numbers=off} -~~~~~~~~ +```r > homicides[954] [1] "39.30677400000, -76.59891100000, icon_homicide_shooting, 'p816', '
Kenly Wheeler
1400 N Caroline St
Baltimore, MD 21213
Race: Black
Gender: male
Age: 29 years old
Found on March 3, 2010
Victim died at Scene
Cause: Shooting

Wheeler\\'s body was found on the grounds of Dr. Bernard Harris Sr. Elementary School

'" -~~~~~~~~ +``` But we can see that the date is typically preceded by "Found on" and is surrounded by `
` tags, so let's use the pattern `
[F|f]ound(.*)
` and see what it brings up. -{line-numbers=off} -~~~~~~~~ +```r > regexpr("
[F|f]ound(.*)
", homicides[1:10]) [1] 177 178 188 189 178 182 178 187 182 183 attr(,"match.length") @@ -230,22 +209,20 @@ attr(,"index.type") [1] "chars" attr(,"useBytes") [1] TRUE -~~~~~~~~ +``` We can use the `substr()` function to extract the first match in the first string. -{line-numbers=off} -~~~~~~~~ +```r > substr(homicides[1], 177, 177 + 93 - 1) [1] "
Found on January 1, 2007
Victim died at Shock Trauma
Cause: shooting
" -~~~~~~~~ +``` Immediately, we can see that the regular expression picked up too much information. This is because the previous pattern was too greedy and matched too much of the string. We need to use the `?` metacharacter to make the regular expression "lazy" so that it stops at the *first* `` tag. -{line-numbers=off} -~~~~~~~~ +```r > regexpr("
[F|f]ound(.*?)
", homicides[1:10]) [1] 177 178 188 189 178 182 178 187 182 183 attr(,"match.length") @@ -254,28 +231,25 @@ attr(,"index.type") [1] "chars" attr(,"useBytes") [1] TRUE -~~~~~~~~ +``` Now when we look at the substrings indicated by the `regexpr()` output, we get -{line-numbers=off} -~~~~~~~~ +```r > substr(homicides[1], 177, 177 + 33 - 1) [1] "
Found on January 1, 2007
" -~~~~~~~~ +``` While it's straightforward to take the output of `regexpr()` and feed it into `substr()` to get the matches out of the original data, one handy function is `regmatches()` which extracts the matches in the strings for you without you having to use `substr()`. -{line-numbers=off} -~~~~~~~~ +```r > r <- regexpr("
[F|f]ound(.*?)
", homicides[1:5]) > regmatches(homicides[1:5], r) -[1] "
Found on January 1, 2007
" "
Found on January 2, 2007
" -[3] "
Found on January 2, 2007
" "
Found on January 3, 2007
" +[1] "
Found on January 1, 2007
" "
Found on January 2, 2007
" "
Found on January 2, 2007
" "
Found on January 3, 2007
" [5] "
Found on January 5, 2007
" -~~~~~~~~ +``` @@ -284,58 +258,51 @@ While it's straightforward to take the output of `regexpr()` and feed it into `s Sometimes we need to clean things up or modify strings by matching a pattern and replacing it with something else. For example, how can we extract the date from this string? -{line-numbers=off} -~~~~~~~~ +```r > x <- substr(homicides[1], 177, 177 + 33 - 1) > x [1] "
Found on January 1, 2007
" -~~~~~~~~ +``` We want to strip out the stuff surrounding the "January 1, 2007" portion. We can do that by matching on the text that comes before and after it using the `|` operator and then replacing it with the empty string. -{line-numbers=off} -~~~~~~~~ +```r > sub("
[F|f]ound on |
", "", x) [1] "January 1, 2007" -~~~~~~~~ +``` Notice that the `sub()` function found the first match (at the beginning of the string) and replaced it and then stopped. However, there was another match at the end of the string that we also wanted to replace. To get both matches, we need the `gsub()` function. -{line-numbers=off} -~~~~~~~~ +```r > gsub("
[F|f]ound on |
", "", x) [1] "January 1, 2007" -~~~~~~~~ +``` -The `sub() and `gsub()` functions can take vector arguments so we don't have to process each string one by one. +The `sub()` and `gsub()` functions can take vector arguments so we don't have to process each string one by one. -{line-numbers=off} -~~~~~~~~ +```r > r <- regexpr("
[F|f]ound(.*?)
", homicides[1:5]) > m <- regmatches(homicides[1:5], r) > m -[1] "
Found on January 1, 2007
" "
Found on January 2, 2007
" -[3] "
Found on January 2, 2007
" "
Found on January 3, 2007
" +[1] "
Found on January 1, 2007
" "
Found on January 2, 2007
" "
Found on January 2, 2007
" "
Found on January 3, 2007
" [5] "
Found on January 5, 2007
" > d <- gsub("
[F|f]ound on |
", "", m) > > ## Nice and clean > d -[1] "January 1, 2007" "January 2, 2007" "January 2, 2007" "January 3, 2007" -[5] "January 5, 2007" -~~~~~~~~ +[1] "January 1, 2007" "January 2, 2007" "January 2, 2007" "January 3, 2007" "January 5, 2007" +``` Finally, it may be useful to convert these strings to the `Date` class so that we can do some date-related computations. -{line-numbers=off} -~~~~~~~~ +```r > as.Date(d, "%B %d, %Y") [1] "2007-01-01" "2007-01-02" "2007-01-02" "2007-01-03" "2007-01-05" -~~~~~~~~ +``` ## `regexec()` @@ -344,8 +311,7 @@ The `regexec()` function works like `regexpr()` except it gives you the indices for parenthesized sub-expressions. For example, take a look at the following expression. -{line-numbers=off} -~~~~~~~~ +```r > regexec("
[F|f]ound on (.*?)
", homicides[1]) [[1]] [1] 177 190 @@ -355,15 +321,14 @@ attr(,"index.type") [1] "chars" attr(,"useBytes") [1] TRUE -~~~~~~~~ +``` Notice first that the regular expression itself has a portion in parentheses `()`. That is the portion of the expression that I presume will contain the date. In the output, you'll notice that there are two indices and two "match.length" values. The first index tells you where the overall match begins (character 177) and the second index tells you where the expression in the parentheses begins (character 190). By contrast, if we only use the `regexpr()` function, we get -{line-numbers=off} -~~~~~~~~ +```r > regexec("
[F|f]ound on .*?
", homicides[1]) [[1]] [1] 177 @@ -373,15 +338,14 @@ attr(,"index.type") [1] "chars" attr(,"useBytes") [1] TRUE -~~~~~~~~ +``` -We can use the `substr()` function to demonstrate which parts of a strings are matched by the `regexec()` function. +We can use the `substr()` function to demonstrate which parts of the strings are matched by the `regexec()` function. Here's the output for `regexec()`. -{line-numbers=off} -~~~~~~~~ +```r > regexec("
[F|f]ound on (.*?)
", homicides[1]) [[1]] [1] 177 190 @@ -391,31 +355,28 @@ attr(,"index.type") [1] "chars" attr(,"useBytes") [1] TRUE -~~~~~~~~ +``` Here's the overall expression match. -{line-numbers=off} -~~~~~~~~ +```r > substr(homicides[1], 177, 177 + 33 - 1) [1] "
Found on January 1, 2007
" -~~~~~~~~ +``` And here's the parenthesized sub-expression. -{line-numbers=off} -~~~~~~~~ +```r > substr(homicides[1], 190, 190 + 15 - 1) [1] "January 1, 2007" -~~~~~~~~ +``` All this can be done much more easily with the `regmatches()` function. -{line-numbers=off} -~~~~~~~~ +```r > r <- regexec("
[F|f]ound on (.*?)
", homicides[1:2]) > regmatches(homicides[1:2], r) [[1]] @@ -423,35 +384,32 @@ All this can be done much more easily with the `regmatches()` function. [[2]] [1] "
Found on January 2, 2007
" "January 2, 2007" -~~~~~~~~ +``` Notice that `regmatches()` returns a list in this case, where each element of the list contains two strings: the overall match and the parenthesized sub-expression. As an example, we can make a plot of monthly homicide counts. First we need a regular expression to capture the dates. -{line-numbers=off} -~~~~~~~~ +```r > r <- regexec("
[F|f]ound on (.*?)
", homicides) > m <- regmatches(homicides, r) -~~~~~~~~ +``` Then we can loop through the list returned by `regmatches()` and extract the second element of each (the parenthesized sub-expression). -{line-numbers=off} -~~~~~~~~ +```r > dates <- sapply(m, function(x) x[2]) -~~~~~~~~ +``` Finally, we can convert the date strings into the `Date` class and make a histogram of the counts. -{line-numbers=off} -~~~~~~~~ +```r > dates <- as.Date(dates, "%B %d, %Y") > hist(dates, "month", freq = TRUE, main = "Monthly Homicides in Baltimore") -~~~~~~~~ +``` ![plot of chunk unnamed-chunk-35](images/regex-unnamed-chunk-35-1.png) @@ -459,40 +417,36 @@ We can see from the picture that homicides do not occur uniformly throughout the ## The `stringr` Package -The `stringr` package is part of the [tidyverse](https://www.tidyverse.org) collection of packages and wraps they underlying `stringi` package in a series of convenience functions. Some of the complexity of using the base R regular expression functions is usefully hidden by the `stringr` functions. In addition, the `stringr` functions provide a more rational interface to regular expressions with more consistency in the arguments and argument ordering. +The `stringr` package is part of the [tidyverse](https://www.tidyverse.org) collection of packages and wraps the underlying `stringi` package in a series of convenience functions. Some of the complexity of using the base R regular expression functions is usefully hidden by the `stringr` functions. In addition, the `stringr` functions provide a more rational interface to regular expressions with more consistency in the arguments and argument ordering. Given what we have discussed so far, there is a fairly straightforward mapping from the base R functions to the `stringr` functions. In general, for the `stringr` functions, the data are the first argument and the regular expression is the second argument, with optional arguments afterwards. `str_subset()` is much like `grep(value = TRUE)` and returns a character vector of strings that contain a given match. -{line-numbers=off} -~~~~~~~~ +```r > library(stringr) > g <- str_subset(homicides, "iconHomicideShooting") > length(g) [1] 228 -~~~~~~~~ -`str_detect()` is essentially `grepl()` +``` + +`str_detect()` is essentially equivalent `grepl()`. `str_extract()` plays the role of `regexpr()` and `regmatches()`, extracting the matches from the output. -{line-numbers=off} -~~~~~~~~ +```r > str_extract(homicides[1:10], "
[F|f]ound(.*?)
") - [1] "
Found on January 1, 2007
" "
Found on January 2, 2007
" - [3] "
Found on January 2, 2007
" "
Found on January 3, 2007
" - [5] "
Found on January 5, 2007
" "
Found on January 5, 2007
" - [7] "
Found on January 5, 2007
" "
Found on January 7, 2007
" + [1] "
Found on January 1, 2007
" "
Found on January 2, 2007
" "
Found on January 2, 2007
" "
Found on January 3, 2007
" + [5] "
Found on January 5, 2007
" "
Found on January 5, 2007
" "
Found on January 5, 2007
" "
Found on January 7, 2007
" [9] "
Found on January 8, 2007
" "
Found on January 8, 2007
" -~~~~~~~~ +``` Finally, `str_match()` does the job of `regexec()` by provide a matrix containing the parenthesized sub-expressions. -{line-numbers=off} -~~~~~~~~ +```r > str_match(homicides[1:5], "
[F|f]ound on (.*?)
") [,1] [,2] [1,] "
Found on January 1, 2007
" "January 1, 2007" @@ -500,7 +454,7 @@ Finally, `str_match()` does the job of `regexec()` by provide a matrix containin [3,] "
Found on January 2, 2007
" "January 2, 2007" [4,] "
Found on January 3, 2007
" "January 3, 2007" [5,] "
Found on January 5, 2007
" "January 5, 2007" -~~~~~~~~ +``` Note how the second column of the output contains the values of the parenthesized sub-expressions. We could now obtain these values by extracting the second column of the matrix. If there had been more parenthesized sub-expressions, there would have been more columns in the output matrix. @@ -510,14 +464,14 @@ Note how the second column of the output contains the values of the parenthesize The primary R functions for dealing with regular expressions are - `grep()`, `grepl()`: Search for matches of a regular expression/pattern in a - character vector + character vector. -- `regexpr()`, `gregexpr(): Search a character vector for regular expression matches and +- `regexpr()`, `gregexpr()`: Search a character vector for regular expression matches and return the indices where the match begins; useful in conjunction - with `regmatches()` + with `regmatches()`. - `sub()`, `gsub()`: Search a character vector for regular expression matches and - replace that match with another string + replace that match with another string. - `regexec()`: Gives you indices of parethensized sub-expressions. From 24394e2163e362fa7d2ba5d82665c3b7021f7319 Mon Sep 17 00:00:00 2001 From: Richard Date: Sat, 6 Nov 2021 18:24:44 -0400 Subject: [PATCH 17/24] debugging: spelling, syntax --- manuscript/debugging.Rmd | 10 ++--- manuscript/debugging.md | 96 ++++++++++++++++------------------------ 2 files changed, 44 insertions(+), 62 deletions(-) diff --git a/manuscript/debugging.Rmd b/manuscript/debugging.Rmd index 53333e3..a8c1705 100644 --- a/manuscript/debugging.Rmd +++ b/manuscript/debugging.Rmd @@ -128,7 +128,7 @@ You can see now that the correct messages are printed without any warning or err ## Figuring Out What's Wrong -The primary task of debugging any R code is correctly diagnosing what the problem is. When diagnosing a problem with your code (or somebody else's), it's important first understand what you were expecting to occur. Then you need to idenfity what *did* occur and how did it deviate from your expectations. Some basic questions you need to ask are +The primary task of debugging any R code is correctly diagnosing what the problem is. When diagnosing a problem with your code (or somebody else's), it's important to first understand what you were expecting to occur. Then you need to identify what *did* occur and how did it deviate from your expectations. Some basic questions you need to ask are - What was your input? How did you call the function? - What were you expecting? Output, messages, other results? @@ -269,11 +269,11 @@ Enter a frame number, or 0 to exit Selection: ``` -The `recover()` function will first print out the function call stack when an error occurrs. Then, you can choose to jump around the call stack and investigate the problem. When you choose a frame number, you will be put in the browser (just like the interactive debugger triggered with `debug()`) and will have the ability to poke around. +The `recover()` function will first print out the function call stack when an error occurs. Then, you can choose to jump around the call stack and investigate the problem. When you choose a frame number, you will be put in the browser (just like the interactive debugger triggered with `debug()`) and will have the ability to poke around. ## Summary -- There are three main indications of a problem/condition: `message`, `warning`, `error`; only an `error` is fatal -- When analyzing a function with a problem, make sure you can reproduce the problem, clearly state your expectations and how the output differs from your expectation -- Interactive debugging tools `traceback`, `debug`, `browser`, `trace`, and `recover` can be used to find problematic code in functions +- There are three main indications of a problem/condition: `message`, `warning`, `error`; only an `error` is fatal. +- When analyzing a function with a problem, make sure you can reproduce the problem, clearly state your expectations and how the output differs from your expectation. +- Interactive debugging tools `traceback`, `debug`, `browser`, `trace`, and `recover` can be used to find problematic code in functions. - Debugging tools are not a substitute for thinking! diff --git a/manuscript/debugging.md b/manuscript/debugging.md index 19b2345..292a07c 100644 --- a/manuscript/debugging.md +++ b/manuscript/debugging.md @@ -17,12 +17,11 @@ R has a number of ways to indicate to you that something’s not right. There ar Here is an example of a warning that you might receive in the course of using R. -{line-numbers=off} -~~~~~~~~ +```r > log(-1) Warning in log(-1): NaNs produced [1] NaN -~~~~~~~~ +``` This warning lets you know that taking the log of a negative number results in a `NaN` value because you can't take the log of negative numbers. Nevertheless, R doesn't give an error, because it has a useful value that it can return, the `NaN` value. The warning is just there to let you know that something unexpected happen. Depending on what you are programming, you may have intentionally taken the log of a negative number in order to move on to another section of code. @@ -30,8 +29,7 @@ Here is another function that is designed to print a message to the console depe -{line-numbers=off} -~~~~~~~~ +```r > printmessage <- function(x) { + if(x > 0) + print("x is greater than zero") @@ -39,7 +37,7 @@ Here is another function that is designed to print a message to the console depe + print("x is less than or equal to zero") + invisible(x) + } -~~~~~~~~ +``` This function is simple---it prints a message telling you whether `x` is greater than zero or less than or equal to zero. It also returns its input *invisibly*, which is a common practice with "print" functions. Returning an object invisibly means that the return value does not get auto-printed when the function is called. @@ -48,20 +46,18 @@ Take a hard look at the function above and see if you can identify any bugs or p We can execute the function as follows. -{line-numbers=off} -~~~~~~~~ +```r > printmessage(1) [1] "x is greater than zero" -~~~~~~~~ +``` The function seems to work fine at this point. No errors, warnings, or messages. -{line-numbers=off} -~~~~~~~~ +```r > printmessage(NA) Error in if (x > 0) print("x is greater than zero") else print("x is less than or equal to zero"): missing value where TRUE/FALSE needed -~~~~~~~~ +``` What happened? @@ -70,8 +66,7 @@ Well, the first thing the function does is test if `x > 0`. But you can't do tha We can fix this problem by anticipating the possibility of `NA` values and checking to see if the input is `NA` with the `is.na()` function. -{line-numbers=off} -~~~~~~~~ +```r > printmessage2 <- function(x) { + if(is.na(x)) + print("x is a missing value!") @@ -81,33 +76,29 @@ We can fix this problem by anticipating the possibility of `NA` values and check + print("x is less than or equal to zero") + invisible(x) + } -~~~~~~~~ +``` Now we can run the following. -{line-numbers=off} -~~~~~~~~ +```r > printmessage2(NA) [1] "x is a missing value!" -~~~~~~~~ +``` And all is fine. Now what about the following situation. -{line-numbers=off} -~~~~~~~~ +```r > x <- log(c(-1, 2)) Warning in log(c(-1, 2)): NaNs produced > printmessage2(x) -Warning in if (is.na(x)) print("x is a missing value!") else if (x > 0) -print("x is greater than zero") else print("x is less than or equal to -zero"): the condition has length > 1 and only the first element will be -used +Warning in if (is.na(x)) print("x is a missing value!") else if (x > 0) print("x is greater than zero") else print("x is less than or equal to zero"): the +condition has length > 1 and only the first element will be used [1] "x is a missing value!" -~~~~~~~~ +``` Now what?? Why are we getting this warning? The warning says "the condition has length > 1 and only the first element will be used". @@ -118,8 +109,7 @@ We can solve this problem two ways. One is by simply not allowing vector argumen For the first way, we simply need to check the length of the input. -{line-numbers=off} -~~~~~~~~ +```r > printmessage3 <- function(x) { + if(length(x) > 1L) + stop("'x' has length > 1") @@ -131,34 +121,32 @@ For the first way, we simply need to check the length of the input. + print("x is less than or equal to zero") + invisible(x) + } -~~~~~~~~ +``` Now when we pass `printmessage3()` a vector we should get an error. -{line-numbers=off} -~~~~~~~~ +```r > printmessage3(1:2) Error in printmessage3(1:2): 'x' has length > 1 -~~~~~~~~ +``` Vectorizing the function can be accomplished easily with the `Vectorize()` function. -{line-numbers=off} -~~~~~~~~ +```r > printmessage4 <- Vectorize(printmessage2) > out <- printmessage4(c(-1, 2)) [1] "x is less than or equal to zero" [1] "x is greater than zero" -~~~~~~~~ +``` You can see now that the correct messages are printed without any warning or error. Note that I stored the return value of `printmessage4()` in a separate R object called `out`. This is because when I use the `Vectorize()` function it no longer preserves the invisibility of the return value. ## Figuring Out What's Wrong -The primary task of debugging any R code is correctly diagnosing what the problem is. When diagnosing a problem with your code (or somebody else's), it's important first understand what you were expecting to occur. Then you need to idenfity what *did* occur and how did it deviate from your expectations. Some basic questions you need to ask are +The primary task of debugging any R code is correctly diagnosing what the problem is. When diagnosing a problem with your code (or somebody else's), it's important to first understand what you were expecting to occur. Then you need to identify what *did* occur and how did it deviate from your expectations. Some basic questions you need to ask are - What was your input? How did you call the function? - What were you expecting? Output, messages, other results? @@ -192,13 +180,12 @@ The `traceback()` function prints out the *function call stack* after an error h For example, you may have a function `a()` which subsequently calls function `b()` which calls `c()` and then `d()`. If an error occurs, it may not be immediately clear in which function the error occurred. The `traceback()` function shows you how many levels deep you were when the error occurred. -{line-numbers=off} -~~~~~~~~ +```r > mean(x) Error in mean(x) : object 'x' not found > traceback() 1: mean(x) -~~~~~~~~ +``` Here, it's clear that the error occurred inside the `mean()` function because the object `x` does not exist. The `traceback()` function must be called immediately after an error occurs. Once another function is called, you lose the traceback. @@ -206,8 +193,7 @@ The `traceback()` function must be called immediately after an error occurs. Onc Here is a slightly more complicated example using the `lm()` function for linear modeling. -{line-numbers=off} -~~~~~~~~ +```r > lm(y ~ x) Error in eval(expr, envir, enclos) : object ’y’ not found > traceback() @@ -218,7 +204,7 @@ Error in eval(expr, envir, enclos) : object ’y’ not found 3: eval(expr, envir, enclos) 2: eval(mf, parent.frame()) 1: lm(y ~ x) -~~~~~~~~ +``` You can see now that the error did not get thrown until the 7th level of the function call stack, in which case the `eval()` function tried to evaluate the formula `y ~ x` and realized the object `y` did not exist. @@ -230,8 +216,7 @@ The `debug()` function initiates an interactive debugger (also known as the "bro The `debug()` function takes a function as its first argument. Here is an example of debugging the `lm()` function. -{line-numbers=off} -~~~~~~~~ +```r > debug(lm) ## Flag the 'lm()' function for interactive debugging > lm(y ~ x) debugging in: lm(y ~ x) @@ -245,7 +230,7 @@ debug: { z } Browse[2]> -~~~~~~~~ +``` Now, every time you call the `lm()` function it will launch the interactive debugger. To turn this behavior off you need to call the `undebug()` function. @@ -257,8 +242,7 @@ The debugger calls the browser at the very top level of the function body. From Here's an example of a browser session with the `lm()` function. -{line-numbers=off} -~~~~~~~~ +```r Browse[2]> n ## Evalute this expression and move to the next one debug: ret.x <- x Browse[2]> n @@ -270,16 +254,15 @@ debug: mf <- match.call(expand.dots = FALSE) Browse[2]> n debug: m <- match(c("formula", "data", "subset", "weights", "na.action", "offset"), names(mf), 0L) -~~~~~~~~ +``` While you are in the browser you can execute any other R function that might be available to you in a regular session. In particular, you can use `ls()` to see what is in your current environment (the function environment) and `print()` to print out the values of R objects in the function environment. You can turn off interactive debugging with the `undebug()` function. -{line-numbers=off} -~~~~~~~~ +```r undebug(lm) ## Unflag the 'lm()' function for debugging -~~~~~~~~ +``` ## Using `recover()` @@ -287,8 +270,7 @@ The `recover()` function can be used to modify the error behavior of R when an e With `recover()` you can tell R that when an error occurs, it should halt execution at the exact point at which the error occurred. That can give you the opportunity to poke around in the environment in which the error occurred. This can be useful to see if there are any R objects or data that have been corrupted or mistakenly modified. -{line-numbers=off} -~~~~~~~~ +```r > options(error = recover) ## Change default R error behavior > read.csv("nosuchfile") ## This code doesn't work Error in file(file, "rt") : cannot open the connection @@ -303,13 +285,13 @@ Enter a frame number, or 0 to exit 3: file(file, "rt") Selection: -~~~~~~~~ +``` -The `recover()` function will first print out the function call stack when an error occurrs. Then, you can choose to jump around the call stack and investigate the problem. When you choose a frame number, you will be put in the browser (just like the interactive debugger triggered with `debug()`) and will have the ability to poke around. +The `recover()` function will first print out the function call stack when an error occurs. Then, you can choose to jump around the call stack and investigate the problem. When you choose a frame number, you will be put in the browser (just like the interactive debugger triggered with `debug()`) and will have the ability to poke around. ## Summary -- There are three main indications of a problem/condition: `message`, `warning`, `error`; only an `error` is fatal -- When analyzing a function with a problem, make sure you can reproduce the problem, clearly state your expectations and how the output differs from your expectation -- Interactive debugging tools `traceback`, `debug`, `browser`, `trace`, and `recover` can be used to find problematic code in functions +- There are three main indications of a problem/condition: `message`, `warning`, `error`; only an `error` is fatal. +- When analyzing a function with a problem, make sure you can reproduce the problem, clearly state your expectations and how the output differs from your expectation. +- Interactive debugging tools `traceback`, `debug`, `browser`, `trace`, and `recover` can be used to find problematic code in functions. - Debugging tools are not a substitute for thinking! From a166ab6d845114c5489ba115d84c73dc62981139 Mon Sep 17 00:00:00 2001 From: Richard Date: Sat, 6 Nov 2021 19:00:49 -0400 Subject: [PATCH 18/24] nutsbolts: consistent spelling of 'modeling' --- manuscript/nutsbolts.Rmd | 2 +- manuscript/nutsbolts.md | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/manuscript/nutsbolts.Rmd b/manuscript/nutsbolts.Rmd index b1552dd..6b23e4a 100644 --- a/manuscript/nutsbolts.Rmd +++ b/manuscript/nutsbolts.Rmd @@ -311,7 +311,7 @@ x Factors are used to represent categorical data and can be unordered or ordered. One can think of a factor as an integer vector where each integer has a _label_. Factors are important in statistical modeling -and are treated specially by modelling functions like `lm()` and +and are treated specially by modeling functions like `lm()` and `glm()`. Using factors with labels is _better_ than using integers because diff --git a/manuscript/nutsbolts.md b/manuscript/nutsbolts.md index 104e0ae..9e37824 100644 --- a/manuscript/nutsbolts.md +++ b/manuscript/nutsbolts.md @@ -380,7 +380,7 @@ NULL Factors are used to represent categorical data and can be unordered or ordered. One can think of a factor as an integer vector where each integer has a _label_. Factors are important in statistical modeling -and are treated specially by modelling functions like `lm()` and +and are treated specially by modeling functions like `lm()` and `glm()`. Using factors with labels is _better_ than using integers because From 133ed33c0cbadff91c770879202e172b25c73456 Mon Sep 17 00:00:00 2001 From: Richard Date: Sat, 6 Nov 2021 19:08:42 -0400 Subject: [PATCH 19/24] profiling: consistent wording, spelling, syntax --- manuscript/profiler.Rmd | 28 +++++++-------- manuscript/profiler.md | 75 ++++++++++++++++++----------------------- 2 files changed, 47 insertions(+), 56 deletions(-) diff --git a/manuscript/profiler.Rmd b/manuscript/profiler.Rmd index 0db0a57..b8e5824 100644 --- a/manuscript/profiler.Rmd +++ b/manuscript/profiler.Rmd @@ -9,9 +9,9 @@ knitr::opts_chunk$set(comment = NA, prompt = TRUE, collapse = TRUE) ``` -R comes with a profiler to help you optimize your code and improve its performance. In generall, it's usually a bad idea to focus on optimizing your code at the very beginning of development. Rather, in the beginning it's better to focus on translating your ideas into code and writing code that's coherent and readable. The problem is that heavily optimized code tends to be obscure and difficult to read, making it harder to debug and revise. Better to get all the bugs out first, then focus on optimizing. +R comes with a profiler to help you optimize your code and improve its performance. In general, it's usually a bad idea to focus on optimizing your code at the very beginning of development. Rather, in the beginning it's better to focus on translating your ideas into code and writing code that's coherent and readable. The problem is that heavily optimized code tends to be obscure and difficult to read, making it harder to debug and revise. Better to get all the bugs out first, then focus on optimizing. -Of course, when it comes to optimizing code, the question is what should you optimize? Well, clearly should optimize the parts of your code that are running slowly, but how do we know what parts those are? +Of course, when it comes to optimizing code, the question is what should you optimize? Well, clearly you should optimize the parts of your code that are running slowly, but how do we know what parts those are? This is what the profiler is for. Profiling is a systematic way to examine how much time is spent in different parts of a program. @@ -26,9 +26,9 @@ the code spends most of its time. This cannot be done without some sort of rigor The basic principles of optimizing your code are: -* Design first, then optimize +* Design first, then optimize. -* Remember: Premature optimization is the root of all evil +* Remember: Premature optimization is the root of all evil. * Measure (collect data), don’t guess. @@ -37,15 +37,15 @@ The basic principles of optimizing your code are: ## Using `system.time()` -They `system.time()` function takes an arbitrary R expression as input (can be wrapped in curly braces) and returns the amount of time taken to evaluate the expression. The `system.time()` function computes the time (in seconds) needed to execute an expression and if there’s an error, gives the time until the error occurred. The function returns an object of class `proc_time` which contains two useful bits of information: +The `system.time()` function takes an arbitrary R expression as input (can be wrapped in curly braces) and returns the amount of time taken to evaluate the expression. The `system.time()` function computes the time (in seconds) needed to execute an expression and if there’s an error, gives the time until the error occurred. The function returns an object of class `proc_time` which contains two useful bits of information: - *user time*: time charged to the CPU(s) for this expression - *elapsed time*: "wall clock" time, the amount of time that passes for *you* as you're sitting there -Usually, the user time and elapsed time are relatively close, for straight computing tasks. But there are a few situations where the two can diverge, sometimes dramatically. The elapsed time may be *greater than* the user time if the CPU spends a lot of time waiting around. This commonly happens if your R expression involes some input or output, which depends on the activity of the file system and the disk (or the Internet, if using a network connection). +Usually, the user time and elapsed time are relatively close, for straight computing tasks. But there are a few situations where the two can diverge, sometimes dramatically. The elapsed time may be *greater than* the user time if the CPU spends a lot of time waiting around. This commonly happens if your R expression involves some input or output, which depends on the activity of the file system and the disk (or the Internet, if using a network connection). -The elapsed time may be *smaller than* the user time if your machine has multiple cores/processors (and is capable of using them). For example, multi-threaded BLAS libraries (vecLib/Accelerate, ATLAS, ACML, MKL) can greatly speed up linear algebra calculations and are commonly installed on even desktop systems these days. Also, parallel processing done via something like the `parallell` package can make the elapsed time smaller than the user time. When you have multiple processors/cores/machines working in parallel, the amount of time that the collection of CPUs spends working on a problem is the same as with a single CPU, but because they are operating in parallel, there is a savings in elapsed time. +The elapsed time may be *smaller than* the user time if your machine has multiple cores/processors (and is capable of using them). For example, multi-threaded BLAS libraries (vecLib/Accelerate, ATLAS, ACML, MKL) can greatly speed up linear algebra calculations and are commonly installed on even desktop systems these days. Also, parallel processing done via something like the `parallel` package can make the elapsed time smaller than the user time. When you have multiple processors/cores/machines working in parallel, the amount of time that the collection of CPUs spends working on a problem is the same as with a single CPU, but because they are operating in parallel, there is a savings in elapsed time. Here's an example of where the elapsed time is greater than the user time. @@ -95,12 +95,12 @@ If your expression is getting pretty long (more than 2 or 3 lines), it might be [Watch a video of this section](https://youtu.be/BZVcMPtlJ4A) -Using `system.time()` allows you to test certain functions or code blocks to see if they are taking excessive amounts of time. However, this approach assumes that you already know where the problem is and can call `system.time()` on it that piece of code. What if you don’t know where to start? +Using `system.time()` allows you to test certain functions or code blocks to see if they are taking excessive amounts of time. However, this approach assumes that you already know where the problem is and can call `system.time()` on that piece of code. What if you don’t know where to start? This is where the profiler comes in handy. The `Rprof()` function starts the profiler in R. Note that R must be compiled with profiler support (but this is usually the case). In conjunction with `Rprof()`, we will use the `summaryRprof()` function which summarizes the output from `Rprof()` (otherwise it’s not really readable). Note that you should NOT use `system.time()` and `Rprof()` together, or you will be sad. -`Rprof()` keeps track of the function call stack at regularly sampled intervals and tabulates how much time is spent inside each function. By default, the profiler samples the function call stack every 0.02 seconds. This means that if your code runs very quickly (say, under 0.02 seconds), the profiler is not useful. But of your code runs that fast, you probably don't need the profiler. +`Rprof()` keeps track of the function call stack at regularly sampled intervals and tabulates how much time is spent inside each function. By default, the profiler samples the function call stack every 0.02 seconds. This means that if your code runs very quickly (say, under 0.02 seconds), the profiler is not useful. But if your code runs that fast, you probably don't need the profiler. The profiler is started by calling the `Rprof()` function. @@ -192,7 +192,7 @@ $by.self Now you can see that only about 4% of the runtime is spent in the actual `lm()` function, whereas over 40% of the time is spent in `lm.fit()`. In this case, this is no surprise since the `lm.fit()` function is the function that actually fits the linear model. -You can see that a reasonable amount of time is spent in functions not necessarily associated with linear modeling (i.e. `as.list.data.frame`, `[.data.frame`). This is because the `lm()` function does a bit of pre-processing and checking before it actually fits the model. This is common with modeling functions---the preprocessing and checking is useful to see if there are any errors. But those two functions take up over 1.5 seconds of runtime. What if you want to fit this model 10,000 times? You're going to be spending a lot of time in preprocessing and checking. +You can see that a reasonable amount of time is spent in functions not necessarily associated with linear modeling (i.e. `as.list.data.frame`, `[.data.frame`). This is because the `lm()` function does a bit of preprocessing and checking before it actually fits the model. This is common with modeling functions---the preprocessing and checking is useful to see if there are any errors. But those two functions take up over 1.5 seconds of runtime. What if you want to fit this model 10,000 times? You're going to be spending a lot of time in preprocessing and checking. The final bit of output that `summaryRprof()` provides is the sampling interval and the total runtime. @@ -206,13 +206,13 @@ $sampling.time ## Summary -* `Rprof()` runs the profiler for performance of analysis of R code +* `Rprof()` runs the profiler for performance analysis of R code. * `summaryRprof()` summarizes the output of `Rprof()` and gives percent of time spent in each function (with two types of - normalization) + normalization). * Good to break your code into functions so that the profiler can give - useful information about where time is being spent + useful information about where time is being spent. -* C or Fortran code is not profiled +* C or Fortran code is not profiled. diff --git a/manuscript/profiler.md b/manuscript/profiler.md index 1414e39..8ca9fdb 100644 --- a/manuscript/profiler.md +++ b/manuscript/profiler.md @@ -7,9 +7,9 @@ -R comes with a profiler to help you optimize your code and improve its performance. In generall, it's usually a bad idea to focus on optimizing your code at the very beginning of development. Rather, in the beginning it's better to focus on translating your ideas into code and writing code that's coherent and readable. The problem is that heavily optimized code tends to be obscure and difficult to read, making it harder to debug and revise. Better to get all the bugs out first, then focus on optimizing. +R comes with a profiler to help you optimize your code and improve its performance. In general, it's usually a bad idea to focus on optimizing your code at the very beginning of development. Rather, in the beginning it's better to focus on translating your ideas into code and writing code that's coherent and readable. The problem is that heavily optimized code tends to be obscure and difficult to read, making it harder to debug and revise. Better to get all the bugs out first, then focus on optimizing. -Of course, when it comes to optimizing code, the question is what should you optimize? Well, clearly should optimize the parts of your code that are running slowly, but how do we know what parts those are? +Of course, when it comes to optimizing code, the question is what should you optimize? Well, clearly you should optimize the parts of your code that are running slowly, but how do we know what parts those are? This is what the profiler is for. Profiling is a systematic way to examine how much time is spent in different parts of a program. @@ -24,9 +24,9 @@ the code spends most of its time. This cannot be done without some sort of rigor The basic principles of optimizing your code are: -* Design first, then optimize +* Design first, then optimize. -* Remember: Premature optimization is the root of all evil +* Remember: Premature optimization is the root of all evil. * Measure (collect data), don’t guess. @@ -35,31 +35,29 @@ The basic principles of optimizing your code are: ## Using `system.time()` -They `system.time()` function takes an arbitrary R expression as input (can be wrapped in curly braces) and returns the amount of time taken to evaluate the expression. The `system.time()` function computes the time (in seconds) needed to execute an expression and if there’s an error, gives the time until the error occurred. The function returns an object of class `proc_time` which contains two useful bits of information: +The `system.time()` function takes an arbitrary R expression as input (can be wrapped in curly braces) and returns the amount of time taken to evaluate the expression. The `system.time()` function computes the time (in seconds) needed to execute an expression and if there’s an error, gives the time until the error occurred. The function returns an object of class `proc_time` which contains two useful bits of information: - *user time*: time charged to the CPU(s) for this expression - *elapsed time*: "wall clock" time, the amount of time that passes for *you* as you're sitting there -Usually, the user time and elapsed time are relatively close, for straight computing tasks. But there are a few situations where the two can diverge, sometimes dramatically. The elapsed time may be *greater than* the user time if the CPU spends a lot of time waiting around. This commonly happens if your R expression involes some input or output, which depends on the activity of the file system and the disk (or the Internet, if using a network connection). +Usually, the user time and elapsed time are relatively close, for straight computing tasks. But there are a few situations where the two can diverge, sometimes dramatically. The elapsed time may be *greater than* the user time if the CPU spends a lot of time waiting around. This commonly happens if your R expression involves some input or output, which depends on the activity of the file system and the disk (or the Internet, if using a network connection). -The elapsed time may be *smaller than* the user time if your machine has multiple cores/processors (and is capable of using them). For example, multi-threaded BLAS libraries (vecLib/Accelerate, ATLAS, ACML, MKL) can greatly speed up linear algebra calculations and are commonly installed on even desktop systems these days. Also, parallel processing done via something like the `parallell` package can make the elapsed time smaller than the user time. When you have multiple processors/cores/machines working in parallel, the amount of time that the collection of CPUs spends working on a problem is the same as with a single CPU, but because they are operating in parallel, there is a savings in elapsed time. +The elapsed time may be *smaller than* the user time if your machine has multiple cores/processors (and is capable of using them). For example, multi-threaded BLAS libraries (vecLib/Accelerate, ATLAS, ACML, MKL) can greatly speed up linear algebra calculations and are commonly installed on even desktop systems these days. Also, parallel processing done via something like the `parallel` package can make the elapsed time smaller than the user time. When you have multiple processors/cores/machines working in parallel, the amount of time that the collection of CPUs spends working on a problem is the same as with a single CPU, but because they are operating in parallel, there is a savings in elapsed time. Here's an example of where the elapsed time is greater than the user time. -{line-numbers=off} -~~~~~~~~ +```r ## Elapsed time > user time system.time(readLines("http://www.jhsph.edu")) user system elapsed 0.004 0.002 0.431 -~~~~~~~~ +``` Most of the time in this expression is spent waiting for the connection to the web server and waiting for the data to travel back to my computer. This doesn't involve the CPU and so the CPU simply waits around for things to get done. Hence, the user time is small. In this example, the elapsed time is smaller than the user time. -{line-numbers=off} -~~~~~~~~ +```r ## Elapsed time < user time > hilbert <- function(n) { + i <- 1:n @@ -69,7 +67,7 @@ In this example, the elapsed time is smaller than the user time. > system.time(svd(x)) user system elapsed 1.035 0.255 0.462 -~~~~~~~~ +``` In this case I ran a singular value decomposition on the matrix in `x`, which is a common linear algebra procedure. Because my computer is able to split the work across multiple processors, the elapsed time is about half the user time. @@ -79,8 +77,7 @@ In this case I ran a singular value decomposition on the matrix in `x`, which is You can time longer expressions by wrapping them in curly braces within the call to `system.time()`. -{line-numbers=off} -~~~~~~~~ +```r > system.time({ + n <- 1000 + r <- numeric(n) @@ -90,8 +87,8 @@ You can time longer expressions by wrapping them in curly braces within the call + } + }) user system elapsed - 0.105 0.002 0.116 -~~~~~~~~ + 0.06 0.00 0.06 +``` If your expression is getting pretty long (more than 2 or 3 lines), it might be better to either break it into smaller pieces or to use the profiler. The problem is that if the expression is too long, you won't be able to identify which part of the code is causing the bottleneck. @@ -99,20 +96,19 @@ If your expression is getting pretty long (more than 2 or 3 lines), it might be [Watch a video of this section](https://youtu.be/BZVcMPtlJ4A) -Using `system.time()` allows you to test certain functions or code blocks to see if they are taking excessive amounts of time. However, this approach assumes that you already know where the problem is and can call `system.time()` on it that piece of code. What if you don’t know where to start? +Using `system.time()` allows you to test certain functions or code blocks to see if they are taking excessive amounts of time. However, this approach assumes that you already know where the problem is and can call `system.time()` on that piece of code. What if you don’t know where to start? This is where the profiler comes in handy. The `Rprof()` function starts the profiler in R. Note that R must be compiled with profiler support (but this is usually the case). In conjunction with `Rprof()`, we will use the `summaryRprof()` function which summarizes the output from `Rprof()` (otherwise it’s not really readable). Note that you should NOT use `system.time()` and `Rprof()` together, or you will be sad. -`Rprof()` keeps track of the function call stack at regularly sampled intervals and tabulates how much time is spent inside each function. By default, the profiler samples the function call stack every 0.02 seconds. This means that if your code runs very quickly (say, under 0.02 seconds), the profiler is not useful. But of your code runs that fast, you probably don't need the profiler. +`Rprof()` keeps track of the function call stack at regularly sampled intervals and tabulates how much time is spent inside each function. By default, the profiler samples the function call stack every 0.02 seconds. This means that if your code runs very quickly (say, under 0.02 seconds), the profiler is not useful. But if your code runs that fast, you probably don't need the profiler. The profiler is started by calling the `Rprof()` function. -{line-numbers=off} -~~~~~~~~ +```r > Rprof() ## Turn on the profiler -~~~~~~~~ +``` You don't need any other arguments. By default it will write its output to a file called `Rprof.out`. You can specify the name of the output file if you don't want to use this default. @@ -121,15 +117,13 @@ Once you call the `Rprof()` function, everything that you do from then on will b The profiler can be turned off by passing `NULL` to `Rprof()`. -{line-numbers=off} -~~~~~~~~ +```r > Rprof(NULL) ## Turn off the profiler -~~~~~~~~ +``` The raw output from the profiler looks something like this. Here I'm calling the `lm()` function on some data with the profiler running. -{line-numbers=off} -~~~~~~~~ +```r ## lm(y ~ x) sample.interval=10000 @@ -147,7 +141,7 @@ sample.interval=10000 "lm.fit" "lm" "lm.fit" "lm" "lm.fit" "lm" -~~~~~~~~ +``` At each line of the output, the profiler writes out the function call stack. For example, on the very first line of the output you can see that the code is 8 levels deep in the call stack. This is where you need the `summaryRprof()` function to help you interpret this data. @@ -161,8 +155,7 @@ The `summaryRprof()` function tabulates the R profiler output and calculates how Here is what `summaryRprof()` reports in the "by.total" output. -{line-numbers=off} -~~~~~~~~ +```r $by.total total.time total.pct self.time self.pct "lm" 7.41 100.00 0.30 4.05 @@ -177,14 +170,13 @@ $by.total "[" 1.03 13.90 0.00 0.00 "as.list.data.frame" 0.82 11.07 0.82 11.07 "as.list" 0.82 11.07 0.00 0.00 -~~~~~~~~ +``` Because `lm()` is the function that I called from the command line, of course 100% of the time is spent somewhere in that function. However, what this doesn't show is that if `lm()` immediately calls another function (like `lm.fit()`, which does most of the heavy lifting), then in reality, most of the time is spent in *that* function, rather than in the top-level `lm()` function. The "by.self" output corrects for this discrepancy. -{line-numbers=off} -~~~~~~~~ +```r $by.self self.time self.pct total.time total.pct "lm.fit" 2.99 40.35 3.50 47.23 @@ -199,32 +191,31 @@ $by.self "as.character" 0.18 2.43 0.18 2.43 "model.frame.default" 0.12 1.62 2.24 30.23 "anyDuplicated.default" 0.02 0.27 0.02 0.27 -~~~~~~~~ +``` Now you can see that only about 4% of the runtime is spent in the actual `lm()` function, whereas over 40% of the time is spent in `lm.fit()`. In this case, this is no surprise since the `lm.fit()` function is the function that actually fits the linear model. -You can see that a reasonable amount of time is spent in functions not necessarily associated with linear modeling (i.e. `as.list.data.frame`, `[.data.frame`). This is because the `lm()` function does a bit of pre-processing and checking before it actually fits the model. This is common with modeling functions---the preprocessing and checking is useful to see if there are any errors. But those two functions take up over 1.5 seconds of runtime. What if you want to fit this model 10,000 times? You're going to be spending a lot of time in preprocessing and checking. +You can see that a reasonable amount of time is spent in functions not necessarily associated with linear modeling (i.e. `as.list.data.frame`, `[.data.frame`). This is because the `lm()` function does a bit of preprocessing and checking before it actually fits the model. This is common with modeling functions---the preprocessing and checking is useful to see if there are any errors. But those two functions take up over 1.5 seconds of runtime. What if you want to fit this model 10,000 times? You're going to be spending a lot of time in preprocessing and checking. The final bit of output that `summaryRprof()` provides is the sampling interval and the total runtime. -{line-numbers=off} -~~~~~~~~ +```r $sample.interval [1] 0.02 $sampling.time [1] 7.41 -~~~~~~~~ +``` ## Summary -* `Rprof()` runs the profiler for performance of analysis of R code +* `Rprof()` runs the profiler for performance analysis of R code. * `summaryRprof()` summarizes the output of `Rprof()` and gives percent of time spent in each function (with two types of - normalization) + normalization). * Good to break your code into functions so that the profiler can give - useful information about where time is being spent + useful information about where time is being spent. -* C or Fortran code is not profiled +* C or Fortran code is not profiled. From 4d86aff289a8b1da3bc0258c6de6f1468537b764 Mon Sep 17 00:00:00 2001 From: Richard Date: Sat, 6 Nov 2021 19:37:24 -0400 Subject: [PATCH 20/24] simulation: wording, spelling, syntax --- manuscript/simulation.Rmd | 20 ++--- manuscript/simulation.md | 153 +++++++++++++++++--------------------- 2 files changed, 77 insertions(+), 96 deletions(-) diff --git a/manuscript/simulation.Rmd b/manuscript/simulation.Rmd index 2477ec1..f04c36c 100644 --- a/manuscript/simulation.Rmd +++ b/manuscript/simulation.Rmd @@ -12,14 +12,14 @@ set.seed(10) Simulation is an important (and big) topic for both statistics and for a variety of other areas where there is a need to introduce randomness. Sometimes you want to implement a statistical procedure that requires random number generation or sampling (i.e. Markov chain Monte Carlo, the bootstrap, random forests, bagging) and sometimes you want to simulate a system and random number generators can be used to model random inputs. -R comes with a set of pseuodo-random number generators that allow you to simulate from well-known probability distributions like the Normal, Poisson, and binomial. Some example functions for probability distributions in R +R comes with a set of pseudo-random number generators that allow you to simulate from well-known probability distributions like the Normal, Poisson, and binomial. Some example functions for probability distributions in R - `rnorm`: generate random Normal variates with a given mean and standard deviation - `dnorm`: evaluate the Normal probability density (with a given mean/SD) at a point (or vector of points) - `pnorm`: evaluate the cumulative distribution function for a Normal distribution - `rpois`: generate random Poisson variates with a given rate -For each probability distribution there are typically four functions available that start with a "r", "d", "p", and "q". The "r" function is the one that actually simulates randon numbers from that distribution. The other functions are prefixed with a +For each probability distribution there are typically four functions available that start with a "r", "d", "p", and "q". The "r" function is the one that actually simulates random numbers from that distribution. The other functions are prefixed with a - `d` for density - `r` for random number generation @@ -28,7 +28,7 @@ For each probability distribution there are typically four functions available t If you're only interested in simulating random numbers, then you will likely only need the "r" functions and not the others. However, if you intend to simulate from arbitrary probability distributions using something like rejection sampling, then you will need the other functions too. -Probably the most common probability distribution to work with the is the Normal distribution (also known as the Gaussian). Working with the Normal distributions requires using these four functions +Probably the most common probability distribution to work with is the Normal distribution (also known as the Gaussian). Working with the Normal distribution requires using these four functions ```r dnorm(x, mean = 0, sd = 1, log = FALSE) @@ -37,7 +37,7 @@ qnorm(p, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE) rnorm(n, mean = 0, sd = 1) ``` -Here we simulate standard Normal random numbers with mean 0 and standard deviation 1. +Here we simulate 10 standard Normal random numbers with mean 0 and standard deviation 1. ```{r} ## Simulate standard Normal random numbers @@ -45,7 +45,7 @@ x <- rnorm(10) x ``` -We can modify the default parameters to simulate numbers with mean 20 and standard deviation 2. +We can modify the default parameters to simulate 10 numbers with mean 20 and standard deviation 2. ```{r} x <- rnorm(10, 20, 2) @@ -149,7 +149,7 @@ plot(x, y) ``` -We can also simulate from *generalized linear model* where the errors are no longer from a Normal distribution but come from some other distribution. For examples, suppose we want to simulate from a Poisson log-linear model where +We can also simulate from a *generalized linear model* where the errors are no longer from a Normal distribution but come from some other distribution. For example, suppose we want to simulate from a Poisson log-linear model where \[ Y \sim Poisson(\mu) @@ -197,7 +197,7 @@ sample(letters, 5) sample(1:10) sample(1:10) -## Sample w/replacement +## Sample w/ replacement sample(1:10, replace = TRUE) ``` @@ -229,7 +229,7 @@ Other more complex objects can be sampled in this way, as long as there's a way ## Summary -- Drawing samples from specific probability distributions can be done with "r" functions +- Drawing samples from specific probability distributions can be done with "r" functions. - Standard distributions are built in: Normal, Poisson, Binomial, Exponential, Gamma, etc. -- The `sample()` function can be used to draw random samples from arbitrary vectors -- Setting the random number generator seed via `set.seed()` is critical for reproducibility +- The `sample()` function can be used to draw random samples from arbitrary vectors. +- Setting the random number generator seed via `set.seed()` is critical for reproducibility. diff --git a/manuscript/simulation.md b/manuscript/simulation.md index 8fec973..72898b8 100644 --- a/manuscript/simulation.md +++ b/manuscript/simulation.md @@ -9,14 +9,14 @@ Simulation is an important (and big) topic for both statistics and for a variety of other areas where there is a need to introduce randomness. Sometimes you want to implement a statistical procedure that requires random number generation or sampling (i.e. Markov chain Monte Carlo, the bootstrap, random forests, bagging) and sometimes you want to simulate a system and random number generators can be used to model random inputs. -R comes with a set of pseuodo-random number generators that allow you to simulate from well-known probability distributions like the Normal, Poisson, and binomial. Some example functions for probability distributions in R +R comes with a set of pseudo-random number generators that allow you to simulate from well-known probability distributions like the Normal, Poisson, and binomial. Some example functions for probability distributions in R - `rnorm`: generate random Normal variates with a given mean and standard deviation - `dnorm`: evaluate the Normal probability density (with a given mean/SD) at a point (or vector of points) - `pnorm`: evaluate the cumulative distribution function for a Normal distribution - `rpois`: generate random Poisson variates with a given rate -For each probability distribution there are typically four functions available that start with a "r", "d", "p", and "q". The "r" function is the one that actually simulates randon numbers from that distribution. The other functions are prefixed with a +For each probability distribution there are typically four functions available that start with a "r", "d", "p", and "q". The "r" function is the one that actually simulates random numbers from that distribution. The other functions are prefixed with a - `d` for density - `r` for random number generation @@ -25,50 +25,44 @@ For each probability distribution there are typically four functions available t If you're only interested in simulating random numbers, then you will likely only need the "r" functions and not the others. However, if you intend to simulate from arbitrary probability distributions using something like rejection sampling, then you will need the other functions too. -Probably the most common probability distribution to work with the is the Normal distribution (also known as the Gaussian). Working with the Normal distributions requires using these four functions +Probably the most common probability distribution to work with is the Normal distribution (also known as the Gaussian). Working with the Normal distribution requires using these four functions -{line-numbers=off} -~~~~~~~~ +```r dnorm(x, mean = 0, sd = 1, log = FALSE) pnorm(q, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE) qnorm(p, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE) rnorm(n, mean = 0, sd = 1) -~~~~~~~~ +``` -Here we simulate standard Normal random numbers with mean 0 and standard deviation 1. +Here we simulate 10 standard Normal random numbers with mean 0 and standard deviation 1. -{line-numbers=off} -~~~~~~~~ +```r > ## Simulate standard Normal random numbers > x <- rnorm(10) > x - [1] 0.01874617 -0.18425254 -1.37133055 -0.59916772 0.29454513 - [6] 0.38979430 -1.20807618 -0.36367602 -1.62667268 -0.25647839 -~~~~~~~~ + [1] 0.01874617 -0.18425254 -1.37133055 -0.59916772 0.29454513 0.38979430 -1.20807618 -0.36367602 -1.62667268 -0.25647839 +``` -We can modify the default parameters to simulate numbers with mean 20 and standard deviation 2. +We can modify the default parameters to simulate 10 numbers with mean 20 and standard deviation 2. -{line-numbers=off} -~~~~~~~~ +```r > x <- rnorm(10, 20, 2) > x - [1] 22.20356 21.51156 19.52353 21.97489 21.48278 20.17869 18.09011 - [8] 19.60970 21.85104 20.96596 + [1] 22.20356 21.51156 19.52353 21.97489 21.48278 20.17869 18.09011 19.60970 21.85104 20.96596 > summary(x) Min. 1st Qu. Median Mean 3rd Qu. Max. 18.09 19.75 21.22 20.74 21.77 22.20 -~~~~~~~~ +``` If you wanted to know what was the probability of a random Normal variable of being less than, say, 2, you could use the `pnorm()` function to do that calculation. -{line-numbers=off} -~~~~~~~~ +```r > pnorm(2) [1] 0.9772499 -~~~~~~~~ +``` You never know when that calculation will come in handy. @@ -79,30 +73,27 @@ When simulating any random numbers it is essential to set the *random number see For example, I can generate 5 Normal random numbers with `rnorm()`. -{line-numbers=off} -~~~~~~~~ +```r > set.seed(1) > rnorm(5) [1] -0.6264538 0.1836433 -0.8356286 1.5952808 0.3295078 -~~~~~~~~ +``` Note that if I call `rnorm()` again I will of course get a different set of 5 random numbers. -{line-numbers=off} -~~~~~~~~ +```r > rnorm(5) [1] -0.8204684 0.4874291 0.7383247 0.5757814 -0.3053884 -~~~~~~~~ +``` If I want to reproduce the original set of random numbers, I can just reset the seed with `set.seed()`. -{line-numbers=off} -~~~~~~~~ +```r > set.seed(1) > rnorm(5) ## Same as before [1] -0.6264538 0.1836433 -0.8356286 1.5952808 0.3295078 -~~~~~~~~ +``` In general, you should **always set the random number seed when conducting a simulation!** Otherwise, you will not be able to reconstruct the exact numbers that you produced in an analysis. @@ -110,15 +101,14 @@ It is possible to generate random numbers from other probability distributions l -{line-numbers=off} -~~~~~~~~ +```r > rpois(10, 1) ## Counts with a mean of 1 [1] 0 0 1 1 2 1 1 4 1 2 > rpois(10, 2) ## Counts with a mean of 2 [1] 4 1 2 0 1 1 0 1 4 1 > rpois(10, 20) ## Counts with a mean of 20 [1] 19 19 24 23 22 24 23 20 11 22 -~~~~~~~~ +``` ## Simulating a Linear Model @@ -128,16 +118,15 @@ Simulating random numbers is useful but sometimes we want to simulate values tha Suppose we want to simulate from the following linear model -{$$} +\[ y = \beta_0 + \beta_1 x + \varepsilon -{/$$} +\] -where {$$}\varepsilon\sim\mathcal{N}(0,2^2){/$$}. Assume {$$}x\sim\mathcal{N}(0,1^2){/$$}, {$$}\beta_0=0.5{/$$} and {$$}\beta_1=2{/$$}. The variable `x` might represent an important predictor of the outcome `y`. Here's how we could do that in R. +where $\varepsilon\sim\mathcal{N}(0,2^2)$. Assume $x\sim\mathcal{N}(0,1^2)$, $\beta_0=0.5$ and $\beta_1=2$. The variable `x` might represent an important predictor of the outcome `y`. Here's how we could do that in R. -{line-numbers=off} -~~~~~~~~ +```r > ## Always set your seed! > set.seed(20) > @@ -151,16 +140,15 @@ where {$$}\varepsilon\sim\mathcal{N}(0,2^2){/$$}. Assume {$$}x\sim\mathcal{N}(0, > y <- 0.5 + 2 * x + e > summary(y) Min. 1st Qu. Median Mean 3rd Qu. Max. --6.4080 -1.5400 0.6789 0.6893 2.9300 6.5050 -~~~~~~~~ +-6.4084 -1.5402 0.6789 0.6893 2.9303 6.5052 +``` We can plot the results of the model simulation. -{line-numbers=off} -~~~~~~~~ +```r > plot(x, y) -~~~~~~~~ +``` ![plot of chunk Linear Model](images/Linear Model-1.png) @@ -168,60 +156,56 @@ We can plot the results of the model simulation. What if we wanted to simulate a predictor variable `x` that is binary instead of having a Normal distribution. We can use the `rbinom()` function to simulate binary random variables. -{line-numbers=off} -~~~~~~~~ +```r > set.seed(10) > x <- rbinom(100, 1, 0.5) > str(x) ## 'x' is now 0s and 1s int [1:100] 1 0 0 1 0 0 0 0 1 0 ... -~~~~~~~~ +``` Then we can procede with the rest of the model as before. -{line-numbers=off} -~~~~~~~~ +```r > e <- rnorm(100, 0, 2) > y <- 0.5 + 2 * x + e > plot(x, y) -~~~~~~~~ +``` ![plot of chunk Linear Model Binary](images/Linear Model Binary-1.png) -We can also simulate from *generalized linear model* where the errors are no longer from a Normal distribution but come from some other distribution. For examples, suppose we want to simulate from a Poisson log-linear model where +We can also simulate from a *generalized linear model* where the errors are no longer from a Normal distribution but come from some other distribution. For example, suppose we want to simulate from a Poisson log-linear model where -{$$} +\[ Y \sim Poisson(\mu) -{/$$} +\] -{$$} +\[ \log \mu = \beta_0 + \beta_1 x -{/$$} +\] -and {$$}\beta_0=0.5{/$$} and {$$}\beta_1=0.3{/$$}. We need to use the `rpois()` function for this +and $\beta_0=0.5$ and $\beta_1=0.3$. We need to use the `rpois()` function for this -{line-numbers=off} -~~~~~~~~ +```r > set.seed(1) > > ## Simulate the predictor variable as before > x <- rnorm(100) -~~~~~~~~ +``` Now we need to compute the log mean of the model and then exponentiate it to get the mean to pass to `rpois()`. -{line-numbers=off} -~~~~~~~~ +```r > log.mu <- 0.5 + 0.3 * x > y <- rpois(100, exp(log.mu)) > summary(y) Min. 1st Qu. Median Mean 3rd Qu. Max. 0.00 1.00 1.00 1.55 2.00 6.00 > plot(x, y) -~~~~~~~~ +``` ![plot of chunk Poisson Log-Linear Model](images/Poisson Log-Linear Model-1.png) @@ -234,36 +218,34 @@ You can build arbitrarily complex models like this by simulating more predictors The `sample()` function draws randomly from a specified set of (scalar) objects allowing you to sample from arbitrary distributions of numbers. -{line-numbers=off} -~~~~~~~~ +```r > set.seed(1) > sample(1:10, 4) -[1] 3 4 5 7 +[1] 9 4 7 1 > sample(1:10, 4) -[1] 3 9 8 5 +[1] 2 7 3 6 > > ## Doesn't have to be numbers > sample(letters, 5) -[1] "q" "b" "e" "x" "p" +[1] "r" "s" "a" "u" "w" > > ## Do a random permutation > sample(1:10) - [1] 4 7 10 6 9 2 8 3 1 5 + [1] 10 6 9 2 1 5 8 4 3 7 > sample(1:10) - [1] 2 3 4 1 9 5 10 8 6 7 + [1] 5 10 2 8 6 1 4 3 9 7 > -> ## Sample w/replacement +> ## Sample w/ replacement > sample(1:10, replace = TRUE) - [1] 2 9 7 8 2 8 5 9 7 8 -~~~~~~~~ + [1] 3 6 10 10 6 4 4 10 9 7 +``` To sample more complicated things, such as rows from a data frame or a list, you can sample the indices into an object rather than the elements of the object itself. Here's how you can sample rows from a data frame. -{line-numbers=off} -~~~~~~~~ +```r > library(datasets) > data(airquality) > head(airquality) @@ -274,14 +256,13 @@ Here's how you can sample rows from a data frame. 4 18 313 11.5 62 5 4 5 NA NA 14.3 56 5 5 6 28 NA 14.9 66 5 6 -~~~~~~~~ +``` Now we just need to create the index vector indexing the rows of the data frame and sample directly from that index vector. -{line-numbers=off} -~~~~~~~~ +```r > set.seed(20) > > ## Create index vector @@ -291,19 +272,19 @@ Now we just need to create the index vector indexing the rows of the data frame > samp <- sample(idx, 6) > airquality[samp, ] Ozone Solar.R Wind Temp Month Day -135 21 259 15.5 76 9 12 -117 168 238 3.4 81 8 25 -43 NA 250 9.2 92 6 12 -80 79 187 5.1 87 7 19 -144 13 238 12.6 64 9 21 -146 36 139 10.3 81 9 23 -~~~~~~~~ +107 NA 64 11.5 79 8 15 +120 76 203 9.7 97 8 28 +130 20 252 10.9 80 9 7 +98 66 NA 4.6 87 8 6 +29 45 252 14.9 81 5 29 +45 NA 332 13.8 80 6 14 +``` Other more complex objects can be sampled in this way, as long as there's a way to index the sub-elements of the object. ## Summary -- Drawing samples from specific probability distributions can be done with "r" functions +- Drawing samples from specific probability distributions can be done with "r" functions. - Standard distributions are built in: Normal, Poisson, Binomial, Exponential, Gamma, etc. -- The `sample()` function can be used to draw random samples from arbitrary vectors -- Setting the random number generator seed via `set.seed()` is critical for reproducibility +- The `sample()` function can be used to draw random samples from arbitrary vectors. +- Setting the random number generator seed via `set.seed()` is critical for reproducibility. From 65911e3f6c1f7389cf4c7b5a09ca2d4cc2ee9a29 Mon Sep 17 00:00:00 2001 From: Richard Date: Sat, 6 Nov 2021 20:18:22 -0400 Subject: [PATCH 21/24] example: fix dead url, add proportion metric to narrative, spelling --- manuscript/example.Rmd | 19 ++--- manuscript/example.md | 177 +++++++++++++++++------------------------ 2 files changed, 81 insertions(+), 115 deletions(-) diff --git a/manuscript/example.Rmd b/manuscript/example.Rmd index 3b5f0a6..202ac5e 100644 --- a/manuscript/example.Rmd +++ b/manuscript/example.Rmd @@ -4,7 +4,7 @@ knitr::opts_chunk$set(comment = NA, fig.path = "images/", collapse = TRUE, prompt = TRUE) ``` -This chapter presents an example data analysis looking at changes in fine particulate matter (PM) air pollution in the United States using the Environmental Protection Agencies freely available national monitoring data. The purpose of the chapter is to just show how the various tools that we have covered in this book can be used to read, manipulate, and summarize data so that you can develop statistical evidence for relevant real-world questions. +This chapter presents an example data analysis looking at changes in fine particulate matter (PM) air pollution in the United States using the Environmental Protection Agency's freely available national monitoring data. The purpose of the chapter is to just show how the various tools that we have covered in this book can be used to read, manipulate, and summarize data so that you can develop statistical evidence for relevant real-world questions. [Watch a video of this chapter](https://youtu.be/VE-6bQvyfTQ) @@ -16,12 +16,12 @@ In this chapter we aim to describe the changes in fine particle (PM2.5) outdoor ## Loading and Processing the Raw Data -From the [EPA Air Quality System](http://aqsdr1.epa.gov/aqsweb/aqstmp/airdata/download_files.html) we obtained data on fine particulate matter air pollution (PM2.5) that is monitored across the U.S. as part of the nationwide PM monitoring network. We obtained the files for the years 1999 and 2012. +From the [EPA's Air Quality System](https://aqs.epa.gov/aqsweb/airdata/download_files.html), we obtained data on fine particulate matter air pollution (PM2.5) that is monitored across the U.S. as part of the nationwide PM monitoring network. We obtained the files for the years 1999 and 2012. ### Reading in the 1999 data -We first read in the 1999 data from the raw text file included in the zip archive. The data is a delimited file were fields are delimited with the `|` character and missing values are coded as blank fields. We skip some commented lines in the beginning of the file and initially we do not read the header data. +We first read in the 1999 data from the raw text file included in the zip archive. The data is a delimited file where fields are delimited with the `|` character and missing values are coded as blank fields. We skip some commented lines in the beginning of the file and initially we do not read the header data. ```{r read 1999 data,cache=TRUE,tidy=FALSE} @@ -99,10 +99,11 @@ Interestingly, from the summary of `x1` it appears there are some negative value ```{r check negative values} negative <- x1 < 0 +length(negative)/length(x1) # proportion of negative values mean(negative, na.rm = T) -```` +``` -There is a relatively small proportion of values that are negative, which is perhaps reassuring. In order to investigate this a step further we can extract the date of each measurement from the original data frame. The idea here is that perhaps negative values occur more often in some parts of the year than other parts. However, the original data are formatted as character strings so we convert them to R's `Date` format for easier manipulation. +There is a relatively small proportion of values that are negative (`r length(negative)/length(x1)`), which is perhaps reassuring. In order to investigate this a step further we can extract the date of each measurement from the original data frame. The idea here is that perhaps negative values occur more often in some parts of the year than other parts. However, the original data are formatted as character strings so we convert them to R's `Date` format for easier manipulation. ```{r converting dates,cache=TRUE} dates <- pm1$Date @@ -118,7 +119,7 @@ tab <- table(factor(missing.months, levels = month.name)) round(100 * tab / sum(tab)) ``` -From the table above it appears that bulk of the negative values occur in the first six months of the year (January--June). However, beyond that simple observation, it is not clear why the negative values occur. That said, given the relatively low proportion of negative values, we will ignore them for now. +From the table above it appears the bulk of the negative values occur in the first six months of the year (January--June). However, beyond that simple observation, it is not clear why the negative values occur. That said, given the relatively low proportion of negative values, we will ignore them for now. ### Changes in PM levels at an individual monitor @@ -141,7 +142,7 @@ str(site0) str(site1) ``` -Finaly, we want the intersection between the sites present in 1999 and 2012 so that we might choose a monitor that has data in both periods. +Finally, we want the intersection between the sites present in 1999 and 2012 so that we might choose a monitor that has data in both periods. ```{r} both <- intersect(site0, site1) @@ -195,11 +196,11 @@ plot(dates1, x1sub, pch = 20, ylim = rng, xlab = "", ylab = expression(PM[2.5] * abline(h = median(x1sub, na.rm = T)) ``` -From the plot above, we can that median levels of PM (horizontal solid line) have decreased a little from `r median(x0sub,na.rm=TRUE)` in 1999 to `r median(x1sub,na.rm=TRUE)` in 2012. However, perhaps more interesting is that the variation (spread) in the PM values in 2012 is much smaller than it was in 1999. This suggest that not only are median levels of PM lower in 2012, but that there are fewer large spikes from day to day. One issue with the data here is that the 1999 data are from July through December while the 2012 data are recorded in January through April. It would have been better if we'd had full-year data for both years as there could be some seasonal confounding going on. +From the plot above, we can that median levels of PM (horizontal solid line) have decreased a little from `r median(x0sub,na.rm=TRUE)` in 1999 to `r median(x1sub,na.rm=TRUE)` in 2012. However, perhaps more interesting is that the variation (spread) in the PM values in 2012 is much smaller than it was in 1999. This suggests that not only are median levels of PM lower in 2012, but that there are fewer large spikes from day to day. One issue with the data here is that the 1999 data are from July through December while the 2012 data are recorded in January through April. It would have been better if we'd had full-year data for both years as there could be some seasonal confounding going on. ### Changes in state-wide PM levels -Although ambient air quality standards are set at the federal level in the U.S. and hence affect the entire country, the actual reduction and management of PM is left to the individual states. States that are not "in attainment" have to develop a plan to reduce PM so that that the are in attainment (eventually). Therefore, it might be useful to examine changes in PM at the state level. This analysis falls somewhere in between looking at the entire country all at once and looking at an individual monitor. +Although ambient air quality standards are set at the federal level in the U.S. and hence affect the entire country, the actual reduction and management of PM is left to the individual states. States that are not "in attainment" have to develop a plan to reduce PM so that that they are in attainment (eventually). Therefore, it might be useful to examine changes in PM at the state level. This analysis falls somewhere in between looking at the entire country all at once and looking at an individual monitor. What we do here is calculate the mean of PM for each state in 1999 and 2012. diff --git a/manuscript/example.md b/manuscript/example.md index b5f0109..49ad5cb 100644 --- a/manuscript/example.md +++ b/manuscript/example.md @@ -2,7 +2,7 @@ -This chapter presents an example data analysis looking at changes in fine particulate matter (PM) air pollution in the United States using the Environmental Protection Agencies freely available national monitoring data. The purpose of the chapter is to just show how the various tools that we have covered in this book can be used to read, manipulate, and summarize data so that you can develop statistical evidence for relevant real-world questions. +This chapter presents an example data analysis looking at changes in fine particulate matter (PM) air pollution in the United States using the Environmental Protection Agency's freely available national monitoring data. The purpose of the chapter is to just show how the various tools that we have covered in this book can be used to read, manipulate, and summarize data so that you can develop statistical evidence for relevant real-world questions. [Watch a video of this chapter](https://youtu.be/VE-6bQvyfTQ) @@ -14,25 +14,23 @@ In this chapter we aim to describe the changes in fine particle (PM2.5) outdoor ## Loading and Processing the Raw Data -From the [EPA Air Quality System](http://aqsdr1.epa.gov/aqsweb/aqstmp/airdata/download_files.html) we obtained data on fine particulate matter air pollution (PM2.5) that is monitored across the U.S. as part of the nationwide PM monitoring network. We obtained the files for the years 1999 and 2012. +From the [EPA's Air Quality System](https://aqs.epa.gov/aqsweb/airdata/download_files.html), we obtained data on fine particulate matter air pollution (PM2.5) that is monitored across the U.S. as part of the nationwide PM monitoring network. We obtained the files for the years 1999 and 2012. ### Reading in the 1999 data -We first read in the 1999 data from the raw text file included in the zip archive. The data is a delimited file were fields are delimited with the `|` character and missing values are coded as blank fields. We skip some commented lines in the beginning of the file and initially we do not read the header data. +We first read in the 1999 data from the raw text file included in the zip archive. The data is a delimited file where fields are delimited with the `|` character and missing values are coded as blank fields. We skip some commented lines in the beginning of the file and initially we do not read the header data. -{line-numbers=off} -~~~~~~~~ +```r > pm0 <- read.table("pm25_data/RD_501_88101_1999-0.txt", comment.char = "#", header = FALSE, sep = "|", na.strings = "") -~~~~~~~~ +``` After reading in the 1999 we check the first few rows (there are 117,421) rows in this dataset. -{line-numbers=off} -~~~~~~~~ +```r > dim(pm0) [1] 117421 28 > head(pm0[, 1:13]) @@ -43,55 +41,45 @@ After reading in the 1999 we check the first few rows (there are 117,421) rows i 4 RD I 1 27 1 88101 1 7 105 120 19990112 00:00 8.841 5 RD I 1 27 1 88101 1 7 105 120 19990115 00:00 14.920 6 RD I 1 27 1 88101 1 7 105 120 19990118 00:00 3.878 -~~~~~~~~ +``` We then attach the column headers to the dataset and make sure that they are properly formated for R data frames. -{line-numbers=off} -~~~~~~~~ +```r > cnames <- readLines("pm25_data/RD_501_88101_1999-0.txt", 1) > cnames <- strsplit(cnames, "|", fixed = TRUE) > ## Ensure names are properly formatted > names(pm0) <- make.names(cnames[[1]]) > head(pm0[, 1:13]) - X..RD Action.Code State.Code County.Code Site.ID Parameter POC -1 RD I 1 27 1 88101 1 -2 RD I 1 27 1 88101 1 -3 RD I 1 27 1 88101 1 -4 RD I 1 27 1 88101 1 -5 RD I 1 27 1 88101 1 -6 RD I 1 27 1 88101 1 - Sample.Duration Unit Method Date Start.Time Sample.Value -1 7 105 120 19990103 00:00 NA -2 7 105 120 19990106 00:00 NA -3 7 105 120 19990109 00:00 NA -4 7 105 120 19990112 00:00 8.841 -5 7 105 120 19990115 00:00 14.920 -6 7 105 120 19990118 00:00 3.878 -~~~~~~~~ + X..RD Action.Code State.Code County.Code Site.ID Parameter POC Sample.Duration Unit Method Date Start.Time Sample.Value +1 RD I 1 27 1 88101 1 7 105 120 19990103 00:00 NA +2 RD I 1 27 1 88101 1 7 105 120 19990106 00:00 NA +3 RD I 1 27 1 88101 1 7 105 120 19990109 00:00 NA +4 RD I 1 27 1 88101 1 7 105 120 19990112 00:00 8.841 +5 RD I 1 27 1 88101 1 7 105 120 19990115 00:00 14.920 +6 RD I 1 27 1 88101 1 7 105 120 19990118 00:00 3.878 +``` The column we are interested in is the `Sample.Value` column which contains the PM2.5 measurements. Here we extract that column and print a brief summary. -{line-numbers=off} -~~~~~~~~ +```r > x0 <- pm0$Sample.Value > summary(x0) Min. 1st Qu. Median Mean 3rd Qu. Max. NA's 0.00 7.20 11.50 13.74 17.90 157.10 13217 -~~~~~~~~ +``` Missing values are a common problem with environmental data and so we check to se what proportion of the observations are missing (i.e. coded as `NA`). -{line-numbers=off} -~~~~~~~~ +```r > mean(is.na(x0)) ## Are missing values important here? [1] 0.1125608 -~~~~~~~~ +``` Because the proportion of missing values is relatively low (0.1125608), we choose to ignore missing values for now. @@ -102,20 +90,18 @@ We then read in the 2012 data in the same manner in which we read the 1999 data -{line-numbers=off} -~~~~~~~~ +```r > pm1 <- read.table("pm25_data/RD_501_88101_2012-0.txt", comment.char = "#", + header = FALSE, sep = "|", na.strings = "", nrow = 1304290) -~~~~~~~~ +``` We also set the column names (they are the same as the 1999 dataset) and extract the `Sample.Value` column from this dataset. -{line-numbers=off} -~~~~~~~~ +```r > names(pm1) <- make.names(cnames[[1]]) > x1 <- pm1$Sample.Value -~~~~~~~~ +``` ## Results @@ -124,65 +110,58 @@ We also set the column names (they are the same as the 1999 dataset) and extract In order to show aggregate changes in PM across the entire monitoring network, we can make boxplots of all monitor values in 1999 and 2012. Here, we take the log of the PM values to adjust for the skew in the data. -{line-numbers=off} -~~~~~~~~ +```r > boxplot(log2(x0), log2(x1)) Warning in boxplot.default(log2(x0), log2(x1)): NaNs produced -Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out = z$out[z -$group == : Outlier (-Inf) in boxplot 1 is not drawn -Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out = z$out[z -$group == : Outlier (-Inf) in boxplot 2 is not drawn -~~~~~~~~ +Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out = z$out[z$group == : Outlier (-Inf) in boxplot 1 is not drawn +Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out = z$out[z$group == : Outlier (-Inf) in boxplot 2 is not drawn +``` ![plot of chunk unnamed-chunk-5](images/unnamed-chunk-5-1.png) -{line-numbers=off} -~~~~~~~~ +```r > summary(x0) Min. 1st Qu. Median Mean 3rd Qu. Max. NA's 0.00 7.20 11.50 13.74 17.90 157.10 13217 > summary(x1) Min. 1st Qu. Median Mean 3rd Qu. Max. NA's - -10.00 4.00 7.63 9.14 12.00 909.00 73133 -~~~~~~~~ + -10.00 4.00 7.63 9.14 12.00 908.97 73133 +``` Interestingly, from the summary of `x1` it appears there are some negative values of PM, which in general should not occur. We can investigate that somewhat to see if there is anything we should worry about. -{line-numbers=off} -~~~~~~~~ +```r > negative <- x1 < 0 +> length(negative)/length(x1) # proportion of negative values +[1] 1 > mean(negative, na.rm = T) [1] 0.0215034 -~~~~~~~~ +``` -There is a relatively small proportion of values that are negative, which is perhaps reassuring. In order to investigate this a step further we can extract the date of each measurement from the original data frame. The idea here is that perhaps negative values occur more often in some parts of the year than other parts. However, the original data are formatted as character strings so we convert them to R's `Date` format for easier manipulation. +There is a relatively small proportion of values that are negative (1), which is perhaps reassuring. In order to investigate this a step further we can extract the date of each measurement from the original data frame. The idea here is that perhaps negative values occur more often in some parts of the year than other parts. However, the original data are formatted as character strings so we convert them to R's `Date` format for easier manipulation. -{line-numbers=off} -~~~~~~~~ +```r > dates <- pm1$Date > dates <- as.Date(as.character(dates), "%Y%m%d") -~~~~~~~~ +``` We can then extract the month from each of the dates with negative values and attempt to identify when negative values occur most often. -{line-numbers=off} -~~~~~~~~ +```r > missing.months <- month.name[as.POSIXlt(dates)$mon + 1] > tab <- table(factor(missing.months, levels = month.name)) > round(100 * tab / sum(tab)) - January February March April May June July - 15 13 15 13 14 13 8 - August September October November December - 6 3 0 0 0 -~~~~~~~~ + January February March April May June July August September October November December + 15 13 15 13 14 13 8 6 3 0 0 0 +``` -From the table above it appears that bulk of the negative values occur in the first six months of the year (January--June). However, beyond that simple observation, it is not clear why the negative values occur. That said, given the relatively low proportion of negative values, we will ignore them for now. +From the table above it appears the bulk of the negative values occur in the first six months of the year (January--June). However, beyond that simple observation, it is not clear why the negative values occur. That said, given the relatively low proportion of negative values, we will ignore them for now. ### Changes in PM levels at an individual monitor @@ -192,85 +171,73 @@ So far we have examined the change in PM levels on average across the country. O Our first task is to identify a monitor in New York State that has data in 1999 and 2012 (not all monitors operated during both time periods). First we subset the data frames to only include data from New York (`State.Code == 36`) and only include the `County.Code` and the `Site.ID` (i.e. monitor number) variables. -{line-numbers=off} -~~~~~~~~ +```r > site0 <- unique(subset(pm0, State.Code == 36, c(County.Code, Site.ID))) > site1 <- unique(subset(pm1, State.Code == 36, c(County.Code, Site.ID))) -~~~~~~~~ +``` Then we create a new variable that combines the county code and the site ID into a single string. -{line-numbers=off} -~~~~~~~~ +```r > site0 <- paste(site0[,1], site0[,2], sep = ".") > site1 <- paste(site1[,1], site1[,2], sep = ".") > str(site0) - chr [1:33] "1.5" "1.12" "5.73" "5.80" "5.83" "5.110" ... + chr [1:33] "1.5" "1.12" "5.73" "5.80" "5.83" "5.110" "13.11" "27.1004" "29.2" "29.5" "29.1007" "31.3" "47.11" "47.76" "55.6001" "59.5" "59.8" "59.11" ... > str(site1) - chr [1:18] "1.5" "1.12" "5.80" "5.133" "13.11" "29.5" ... -~~~~~~~~ + chr [1:18] "1.5" "1.12" "5.80" "5.133" "13.11" "29.5" "31.3" "47.122" "55.1007" "61.79" "61.134" "63.2008" "67.1015" "71.2" "81.124" "85.55" "101.3" ... +``` -Finaly, we want the intersection between the sites present in 1999 and 2012 so that we might choose a monitor that has data in both periods. +Finally, we want the intersection between the sites present in 1999 and 2012 so that we might choose a monitor that has data in both periods. -{line-numbers=off} -~~~~~~~~ +```r > both <- intersect(site0, site1) > print(both) - [1] "1.5" "1.12" "5.80" "13.11" "29.5" "31.3" "63.2008" - [8] "67.1015" "85.55" "101.3" -~~~~~~~~ + [1] "1.5" "1.12" "5.80" "13.11" "29.5" "31.3" "63.2008" "67.1015" "85.55" "101.3" +``` Here (above) we can see that there are 10 monitors that were operating in both time periods. However, rather than choose one at random, it might best to choose one that had a reasonable amount of data in each year. -{line-numbers=off} -~~~~~~~~ +```r > ## Find how many observations available at each monitor > pm0$county.site <- with(pm0, paste(County.Code, Site.ID, sep = ".")) > pm1$county.site <- with(pm1, paste(County.Code, Site.ID, sep = ".")) > cnt0 <- subset(pm0, State.Code == 36 & county.site %in% both) > cnt1 <- subset(pm1, State.Code == 36 & county.site %in% both) -~~~~~~~~ +``` Now that we have subsetted the original data frames to only include the data from the monitors that overlap between 1999 and 2012, we can split the data frames and count the number of observations at each monitor to see which ones have the most observations. -{line-numbers=off} -~~~~~~~~ +```r > ## 1999 > sapply(split(cnt0, cnt0$county.site), nrow) - 1.12 1.5 101.3 13.11 29.5 31.3 5.80 63.2008 67.1015 - 61 122 152 61 61 183 61 122 122 - 85.55 - 7 + 1.12 1.5 101.3 13.11 29.5 31.3 5.80 63.2008 67.1015 85.55 + 61 122 152 61 61 183 61 122 122 7 > ## 2012 > sapply(split(cnt1, cnt1$county.site), nrow) - 1.12 1.5 101.3 13.11 29.5 31.3 5.80 63.2008 67.1015 - 31 64 31 31 33 15 31 30 31 - 85.55 - 31 -~~~~~~~~ + 1.12 1.5 101.3 13.11 29.5 31.3 5.80 63.2008 67.1015 85.55 + 31 64 31 31 33 15 31 30 31 31 +``` A number of monitors seem suitable from the output, but we will focus here on County 63 and site ID 2008. -{line-numbers=off} -~~~~~~~~ +```r > both.county <- 63 > both.id <- 2008 > > ## Choose county 63 and side ID 2008 > pm1sub <- subset(pm1, State.Code == 36 & County.Code == both.county & Site.ID == both.id) > pm0sub <- subset(pm0, State.Code == 36 & County.Code == both.county & Site.ID == both.id) -~~~~~~~~ +``` Now we plot the time series data of PM for the monitor in both years. -{line-numbers=off} -~~~~~~~~ +```r > dates1 <- as.Date(as.character(pm1sub$Date), "%Y%m%d") > x1sub <- pm1sub$Sample.Value > dates0 <- as.Date(as.character(pm0sub$Date), "%Y%m%d") @@ -283,21 +250,20 @@ Now we plot the time series data of PM for the monitor in both years. > abline(h = median(x0sub, na.rm = T)) > plot(dates1, x1sub, pch = 20, ylim = rng, xlab = "", ylab = expression(PM[2.5] * " (" * mu * g/m^3 * ")")) > abline(h = median(x1sub, na.rm = T)) -~~~~~~~~ +``` ![plot of chunk unnamed-chunk-12](images/unnamed-chunk-12-1.png) -From the plot above, we can that median levels of PM (horizontal solid line) have decreased a little from 10.45 in 1999 to 8.29 in 2012. However, perhaps more interesting is that the variation (spread) in the PM values in 2012 is much smaller than it was in 1999. This suggest that not only are median levels of PM lower in 2012, but that there are fewer large spikes from day to day. One issue with the data here is that the 1999 data are from July through December while the 2012 data are recorded in January through April. It would have been better if we'd had full-year data for both years as there could be some seasonal confounding going on. +From the plot above, we can that median levels of PM (horizontal solid line) have decreased a little from 10.45 in 1999 to 8.29 in 2012. However, perhaps more interesting is that the variation (spread) in the PM values in 2012 is much smaller than it was in 1999. This suggests that not only are median levels of PM lower in 2012, but that there are fewer large spikes from day to day. One issue with the data here is that the 1999 data are from July through December while the 2012 data are recorded in January through April. It would have been better if we'd had full-year data for both years as there could be some seasonal confounding going on. ### Changes in state-wide PM levels -Although ambient air quality standards are set at the federal level in the U.S. and hence affect the entire country, the actual reduction and management of PM is left to the individual states. States that are not "in attainment" have to develop a plan to reduce PM so that that the are in attainment (eventually). Therefore, it might be useful to examine changes in PM at the state level. This analysis falls somewhere in between looking at the entire country all at once and looking at an individual monitor. +Although ambient air quality standards are set at the federal level in the U.S. and hence affect the entire country, the actual reduction and management of PM is left to the individual states. States that are not "in attainment" have to develop a plan to reduce PM so that that they are in attainment (eventually). Therefore, it might be useful to examine changes in PM at the state level. This analysis falls somewhere in between looking at the entire country all at once and looking at an individual monitor. What we do here is calculate the mean of PM for each state in 1999 and 2012. -{line-numbers=off} -~~~~~~~~ +```r > ## 1999 > mn0 <- with(pm0, tapply(Sample.Value, State.Code, mean, na.rm = TRUE)) > ## 2012 @@ -315,20 +281,19 @@ What we do here is calculate the mean of PM for each state in 1999 and 2012. 4 12 11.137139 8.239690 5 13 19.943240 11.321364 6 15 4.861821 8.749336 -~~~~~~~~ +``` Now make a plot that shows the 1999 state-wide means in one "column" and the 2012 state-wide means in another columns. We then draw a line connecting the means for each year in the same state to highlight the trend. -{line-numbers=off} -~~~~~~~~ +```r > par(mfrow = c(1, 1)) > rng <- range(mrg[,2], mrg[,3]) > with(mrg, plot(rep(1, 52), mrg[, 2], xlim = c(.5, 2.5), ylim = rng, xaxt = "n", xlab = "", ylab = "State-wide Mean PM")) > with(mrg, points(rep(2, 52), mrg[, 3])) > segments(rep(1, 52), mrg[, 2], rep(2, 52), mrg[, 3]) > axis(1, c(1, 2), c("1999", "2012")) -~~~~~~~~ +``` ![plot of chunk unnamed-chunk-14](images/unnamed-chunk-14-1.png) From 8685a78ee3bdf04ac612f0103781e4fd003b7119 Mon Sep 17 00:00:00 2001 From: Richard Date: Sat, 6 Nov 2021 20:26:32 -0400 Subject: [PATCH 22/24] example: fix prop calculation --- manuscript/example.Rmd | 4 ++-- manuscript/example.md | 6 +++--- 2 files changed, 5 insertions(+), 5 deletions(-) diff --git a/manuscript/example.Rmd b/manuscript/example.Rmd index 202ac5e..3b0df0d 100644 --- a/manuscript/example.Rmd +++ b/manuscript/example.Rmd @@ -99,11 +99,11 @@ Interestingly, from the summary of `x1` it appears there are some negative value ```{r check negative values} negative <- x1 < 0 -length(negative)/length(x1) # proportion of negative values +length(negative[negative=="TRUE"])/length(x1) # proportion of negative values mean(negative, na.rm = T) ``` -There is a relatively small proportion of values that are negative (`r length(negative)/length(x1)`), which is perhaps reassuring. In order to investigate this a step further we can extract the date of each measurement from the original data frame. The idea here is that perhaps negative values occur more often in some parts of the year than other parts. However, the original data are formatted as character strings so we convert them to R's `Date` format for easier manipulation. +There is a relatively small proportion of values that are negative (`r round(length(negative[negative=="TRUE"])/length(x1),3)`), which is perhaps reassuring. In order to investigate this a step further we can extract the date of each measurement from the original data frame. The idea here is that perhaps negative values occur more often in some parts of the year than other parts. However, the original data are formatted as character strings so we convert them to R's `Date` format for easier manipulation. ```{r converting dates,cache=TRUE} dates <- pm1$Date diff --git a/manuscript/example.md b/manuscript/example.md index 49ad5cb..70f8367 100644 --- a/manuscript/example.md +++ b/manuscript/example.md @@ -134,13 +134,13 @@ Interestingly, from the summary of `x1` it appears there are some negative value ```r > negative <- x1 < 0 -> length(negative)/length(x1) # proportion of negative values -[1] 1 +> length(negative[negative=="TRUE"])/length(x1) # proportion of negative values +[1] 0.07636893 > mean(negative, na.rm = T) [1] 0.0215034 ``` -There is a relatively small proportion of values that are negative (1), which is perhaps reassuring. In order to investigate this a step further we can extract the date of each measurement from the original data frame. The idea here is that perhaps negative values occur more often in some parts of the year than other parts. However, the original data are formatted as character strings so we convert them to R's `Date` format for easier manipulation. +There is a relatively small proportion of values that are negative (0.076), which is perhaps reassuring. In order to investigate this a step further we can extract the date of each measurement from the original data frame. The idea here is that perhaps negative values occur more often in some parts of the year than other parts. However, the original data are formatted as character strings so we convert them to R's `Date` format for easier manipulation. ```r From 701ed4de3dbd583e92cb944c74a38b11d4cd2e1c Mon Sep 17 00:00:00 2001 From: Richard Date: Sat, 6 Nov 2021 22:08:40 -0400 Subject: [PATCH 23/24] parallel: update AMD library/link, spelling, syntax --- manuscript/parallel.Rmd | 30 +++++++++++++++--------------- 1 file changed, 15 insertions(+), 15 deletions(-) diff --git a/manuscript/parallel.Rmd b/manuscript/parallel.Rmd index 25b3eb3..a16c2ca 100644 --- a/manuscript/parallel.Rmd +++ b/manuscript/parallel.Rmd @@ -6,7 +6,7 @@ knitr::opts_chunk$set(comment = NA, prompt = TRUE, collapse = TRUE, fig.path = " Many computations in R can be made faster by the use of parallel computation. Generally, parallel computation is the simultaneous execution of different pieces of a larger computation across multiple computing processors or cores. The basic idea is that if you can execute a computation in $X$ seconds on a single processor, then you should be able to execute it in $X/n$ seconds on $n$ processors. Such a speed-up is generally not possible because of overhead and various barriers to splitting up a problem into $n$ pieces, but it is often possible to come close in simple problems. -It used to be that parallel computation was squarely in the domain of "high-performance computing", where expensive machines were linked together via high-speed networking to create large clusters of computers. In those kinds of settings, it was important to have sophisticated software to manage the communication of data between different computers in the cluster. Parallel computing in that setting was a highly tuned, and carefully customized operation and not something you could just saunter into. +It used to be that parallel computation was squarely in the domain of "high-performance computing", where expensive machines were linked together via high-speed networking to create large clusters of computers. In those kinds of settings, it was important to have sophisticated software to manage the communication of data between different computers in the cluster. Parallel computing in that setting was a highly tuned, carefully customized operation, and not something you could just saunter into. These days though, almost all computers contain multiple processors or cores on them. Even Apple's iPhone 6S comes with a [dual-core CPU](https://en.wikipedia.org/wiki/Apple_A9) as part of its A9 system-on-a-chip. Getting access to a "cluster" of CPUs, in this case all built into the same computer, is much easier than it used to be and this has opened the door to parallel computing for a wide range of people. @@ -18,7 +18,7 @@ You may be computing in parallel without even knowing it! These days, many compu ### Parallel BLAS -A common example in R is the use of linear algebra functions. Some versions of R that you use may be linked to on optimized Basic Linear Algebra Subroutines (BLAS) library. Such libraries are custom coded for specific CPUs/chipsets to take advantage of the architecture of the chip. It's important to realize that while R can do linear algebra out of the box, its default BLAS library is a *reference implementation* that is not necessarily optimized to any particular chipset. +A common example in R is the use of linear algebra functions. Some versions of R that you use may be linked to an optimized Basic Linear Algebra Subroutines (BLAS) library. Such libraries are custom coded for specific CPUs/chipsets to take advantage of the architecture of the chip. It's important to realize that while R can do linear algebra out of the box, its default BLAS library is a *reference implementation* that is not necessarily optimized to any particular chipset. When possible, it's always a good idea to install an optimized BLAS on your system because it can dramatically improve the performance of those kinds of computations. Part of the increase in performance comes from the customization of the code to a particular chipset while part of it comes from the multi-threading that many libraries use to parallelize their computations. @@ -44,22 +44,22 @@ Here, you can see that the `user` time is just under 1 second while the `elapsed Here's a summary of some of the optimized BLAS libraries out there: -* The [AMD Core Math Library](http://developer.amd.com/tools-and-sdks/archive/amd-core-math-library-acml/) (ACML) is built for AMD chips and contains a full set of BLAS and LAPACK routines. The library is closed-source and is maintained/released by AMD. +* The [AMD Optimizing CPU Libraries](http://developer.amd.com/tools-and-sdks/archive/amd-core-math-library-acml/) (AOCL) is built for AMD chips and contains BLIS (BLAS-like), LAPACK, and FFT routines. The library is mostly open-source and is maintained/released by AMD. -* The [Intel Math Kernel](https://software.intel.com/en-us/intel-mkl) is an analogous optimized library for Intel-based chips +* The [Intel Math Kernel](https://software.intel.com/en-us/intel-mkl) is an analogous optimized library for Intel-based chips. * The [Accelerate framework](https://developer.apple.com/library/tvos/documentation/Accelerate/Reference/AccelerateFWRef/index.html) on the Mac contains an optimized BLAS built by Apple. * The [Automatically Tuned Linear Algebra Software](http://math-atlas.sourceforge.net) (ATLAS) library is a special "adaptive" software package that is designed to be compiled on the computer where it will be used. As part of the build process, the library extracts detailed CPU information and optimizes the code as it goes along. The ATLAS library is hence a generic package that can be built on a wider array of CPUs. -Detailed instructions on how to use R with optimized BLAS libraries can be found in the [R Installation and Administration](https://cran.r-project.org/doc/manuals/r-release/R-admin.html#BLAS) manual. In some cases, you may need to build R from the sources in order to link it with the optimized BLAS library. +Detailed instructions on how to use R with optimized BLAS libraries can be found in the [R Installation and Administration](https://cran.r-project.org/doc/manuals/r-release/R-admin.html#BLAS) manual. In some cases, you may need to build R from source in order to link it with an optimized BLAS library. ## Embarrassing Parallelism -Many problems in statistics and data science can be executed in an "embarrassingly parallel" way, whereby multiple independent pieces of a problem are executed simultaneously because the different pieces of the problem never really have to communicate with each other (except perhaps at the end when all the results are assembled). Despite the name, there's nothing really "embarrassing" about taking advantage of the structure of the problem and using it speed up your computation. In fact, embarrassingly parallel computation is a common paradigm in statistics and data science. +Many problems in statistics and data science can be executed in an "embarrassingly parallel" way, whereby multiple independent pieces of a problem are executed simultaneously because the different pieces of the problem never really have to communicate with each other (except perhaps at the end when all the results are assembled). Despite the name, there's nothing really "embarrassing" about taking advantage of the structure of the problem and using it to speed up your computation. In fact, embarrassingly parallel computation is a common paradigm in statistics and data science. -A> In general, it is NOT a good idea to use the functions described in this chapter with graphical user interfaces (GUIs) because, to summarize the help page for `mclapply()`, bad things can happen. That said, the functions in the `parallel` package seem two work okay in RStudio. +A> In general, it is NOT a good idea to use the functions described in this chapter with graphical user interfaces (GUIs) because, to summarize the help page for `mclapply()`, bad things can happen. That said, the functions in the `parallel` package seem to work okay in RStudio. The basic mode of an embarrassingly parallel operation can be seen with the `lapply()` function, which we have reviewed in a [previous chapter](#loop-functions). Recall that the `lapply()` function has two arguments: @@ -69,7 +69,7 @@ The basic mode of an embarrassingly parallel operation can be seen with the `lap Finally, recall that `lapply()` always returns a list whose length is equal to the length of the input list. -The `lapply()` function works much like a loop--it cycles through each element of the list and applies the supplied function to that element. While `lapply()` is applying your function to a list element, the other elements of the list are just...sitting around in memory. Note that in the description of `lapply()` above, there's no mention of the different elements of the list communicating with each other, and the function being applied to a given list element does not need to know about other list elements. +The `lapply()` function works much like a loop -- it cycles through each element of the list and applies the supplied function to that element. While `lapply()` is applying your function to a list element, the other elements of the list are just... sitting around in memory. Note that in the description of `lapply()` above, there's no mention of the different elements of the list communicating with each other, and the function being applied to a given list element does not need to know about other list elements. Just about any operation that is handled by the `lapply()` function can be parallelized. This approach is analogous to the ["map-reduce"](https://en.wikipedia.org/wiki/MapReduce) approach in large-scale cluster systems. The idea is that a list object can be split across multiple cores of a processor and then the function can be applied to each subset of the list object on each of the cores. Conceptually, the steps in the parallel procedure are @@ -85,9 +85,9 @@ The differences between the many packages/functions in R essentially come down t ## The Parallel Package -The `parallel` package which comes with your R installation. It represents a combining of two historical packages--the `multicore` and `snow` packages, and the functions in `parallel` have overlapping names with those older packages. For our purposes, it's not necessary to know anything about the `multicore` or `snow` packages, but long-time users of R may remember them from back in the day. +The `parallel` package comes with your R installation. It represents a combining of two historical packages -- the `multicore` and `snow` packages, and the functions in `parallel` have overlapping names with those older packages. For our purposes, it's not necessary to know anything about the `multicore` or `snow` packages, but long-time users of R may remember them from back in the day. -The `mclapply()` function essentially parallelizes calls to `lapply()`. The first two arguments to `mclapply()` are exactly the same as they are for `lapply()`. However, `mclapply()` has further arguments (that must be named), the most important of which is the `mc.cores` argument which you can use to specify the number of processors/cores you want to split the computation across. For example, if your machine has 4 cores on it, you might specify `mc.cores = 4` to break your parallelize your operation across 4 cores (although this may not be the best idea if you are running other operations in the background besides R). +The `mclapply()` function essentially parallelizes calls to `lapply()`. The first two arguments to `mclapply()` are exactly the same as they are for `lapply()`. However, `mclapply()` has further arguments (that must be named), the most important of which is the `mc.cores` argument which you can use to specify the number of processors/cores you want to split the computation across. For example, if your machine has 4 cores on it, you might specify `mc.cores = 4` to break your operation (parallelize) across 4 cores (although this may not be the best idea if you are running other operations in the background besides R). The `mclapply()` function (and related `mc*` functions) works via the fork mechanism on Unix-style operating systems. Briefly, your R session is the main process and when you call a function like `mclapply()`, you fork a series of sub-processes that operate independently from the main process (although they share a few low-level features). These sub-processes then execute your function on their subsets of the data, presumably on separate cores of your CPU. Once the computation is complete, each sub-process returns its results and then the sub-process is killed. The `parallel` package manages the logistics of forking the sub-processes and handling them once they've finished. @@ -120,9 +120,9 @@ r <- mclapply(1:10, function(i) { While this "job" was running, I took a screen shot of the system activity monitor ("top"). Here's what it looks like on Mac OS X. -![Multiple sub-processes spawned by `mclapply()`](images/topparallel.png) +![Multiple `rsession` sub-processes are spawned by `mclapply()`.](images/topparallel.png) -In case you are not used to viewing this output, each row of the table is an application or process running on your computer. You can see that there are 11 rows where the COMMAND is labelled `rsession`. One of these is my primary R session (being run through RStudio), and the other 10 are the sub-processes spawned by the `mclapply()` function. +In case you are not used to viewing this output, each row of the table is an application or process running on your computer. You can see that there are 11 rows where the COMMAND is labeled `rsession`. One of these is my primary R session (being run through RStudio), and the other 10 are the sub-processes spawned by the `mclapply()` function. We will use as a second (slightly more realistic) example processing data from multiple files. Often this is something that can be easily parallelized. @@ -182,7 +182,7 @@ When either `mclapply()` or `mcmapply()` are called, the functions supplied will This error handling behavior is a significant difference from the usual call to `lapply()`. With `lapply()`, if the supplied function fails on one component of the list, the entire function call to `lapply()` fails and you only get an error as a result. -With `mclapply()`, when a sub-process fails, the return value for that sub-process will be an R object that inherits from the class `"try-error"`, which is something you can test with the `inherits()` function. Conceptually, each child process is executed with the `try()` function wrapped around it. The code below deliberately causes an error in the 3 element of the list. +With `mclapply()`, when a sub-process fails, the return value for that sub-process will be an R object that inherits from the class `"try-error"`, which is something you can test with the `inherits()` function. Conceptually, each child process is executed with the `try()` function wrapped around it. The code below deliberately causes an error in the 3rd element of the list. ```{r} r <- mclapply(1:5, function(i) { @@ -240,7 +240,7 @@ We can see from the histogram that the distribution of sulfate is skewed to the summary(sulf) ``` -How can we construct confidence interval for the median of sulfate for this monitor? The bootstrap is simple procedure that can work well. Here's how we might do it in the usual (non-parallel) way. +How can we construct a confidence interval for the median of sulfate for this monitor? The bootstrap is a simple procedure that can work well. Here's how we might do it in the usual (non-parallel) way. ```{r} set.seed(1) @@ -256,7 +256,7 @@ A 95% confidence interval would then take the 2.5th and 97.5th percentiles of th quantile(med.boot, c(0.025, 0.975)) ``` -How could be done in parallel? We could simply wrap the expression passed to `replicate()` in a function and pass it to `mclapply()`. However, one thing we need to be careful of is generating random numbers. +How could this be done in parallel? We could simply wrap the expression passed to `replicate()` in a function and pass it to `mclapply()`. However, one thing we need to be careful of is generating random numbers. ### Generating Random Numbers From 8a9d2bf0747b8438f98c567e1407e882656c61be Mon Sep 17 00:00:00 2001 From: Richard Date: Sat, 6 Nov 2021 22:48:13 -0400 Subject: [PATCH 24/24] revert to previous .md versions --- manuscript/apply.md | 278 ++++++++++++++++++------------ manuscript/control.md | 104 ++++++----- manuscript/debugging.md | 96 ++++++----- manuscript/dplyr.md | 310 ++++++++++++++++++--------------- manuscript/example.md | 177 +++++++++++-------- manuscript/functions.md | 148 +++++++++------- manuscript/gettingstarted.md | 4 +- manuscript/nutsbolts.md | 136 +++++++++------ manuscript/overview.md | 16 +- manuscript/profiler.md | 75 ++++---- manuscript/readwritedata.md | 323 +++++++++++++++++++---------------- manuscript/regex.md | 250 ++++++++++++++++----------- manuscript/scoping.md | 113 +++++++----- manuscript/simulation.md | 153 +++++++++-------- manuscript/vectorized.md | 32 ++-- 15 files changed, 1279 insertions(+), 936 deletions(-) diff --git a/manuscript/apply.md b/manuscript/apply.md index ced9e33..60d71cf 100644 --- a/manuscript/apply.md +++ b/manuscript/apply.md @@ -37,7 +37,8 @@ This function takes three arguments: (1) a list `X`; (2) a function (or the name The body of the `lapply()` function can be seen here. -```r +{line-numbers=off} +~~~~~~~~ > lapply function (X, FUN, ...) { @@ -46,19 +47,20 @@ function (X, FUN, ...) X <- as.list(X) .Internal(lapply(X, FUN)) } - + -``` +~~~~~~~~ Note that the actual looping is done internally in C code for efficiency reasons. It's important to remember that `lapply()` always returns a list, regardless of the class of the input. -Here's an example of applying the `mean()` function to all elements of a list. If the original list has names, then the names will be preserved in the output. +Here's an example of applying the `mean()` function to all elements of a list. If the original list has names, the the names will be preserved in the output. -```r +{line-numbers=off} +~~~~~~~~ > x <- list(a = 1:5, b = rnorm(10)) > lapply(x, mean) $a @@ -66,7 +68,7 @@ $a $b [1] 0.1322028 -``` +~~~~~~~~ Notice that here we are passing the `mean()` function as an argument to the `lapply()` function. Functions in R can be used this way and can be passed back and forth as arguments just like any other object. When you pass a function to another function, you do not need to include the open and closed parentheses `()` like you do when you are *calling* a function. @@ -74,7 +76,8 @@ Here is another example of using `lapply()`. -```r +{line-numbers=off} +~~~~~~~~ > x <- list(a = 1:4, b = rnorm(10), c = rnorm(20, 1), d = rnorm(100, 5)) > lapply(x, mean) $a @@ -88,13 +91,14 @@ $c $d [1] 5.051388 -``` +~~~~~~~~ You can use `lapply()` to evaluate a function multiple times each with a different argument. Below, is an example where I call the `runif()` function (to generate uniformly distributed random variables) four times, each time generating a different number of random numbers. -```r +{line-numbers=off} +~~~~~~~~ > x <- 1:4 > lapply(x, runif) [[1]] @@ -108,7 +112,7 @@ You can use `lapply()` to evaluate a function multiple times each with a differe [[4]] [1] 0.3214921 0.1548316 0.1322282 0.2213059 -``` +~~~~~~~~ When you pass a function to `lapply()`, `lapply()` takes elements of the list and passes them as the *first argument* of the function you are applying. In the above example, the first argument of `runif()` is `n`, and so the elements of the sequence `1:4` all got passed to the `n` argument of `runif()`. @@ -119,7 +123,8 @@ Here is where the `...` argument to `lapply()` comes into play. Any arguments th Here, the `min = 0` and `max = 10` arguments are passed down to `runif()` every time it gets called. -```r +{line-numbers=off} +~~~~~~~~ > x <- 1:4 > lapply(x, runif, min = 0, max = 10) [[1]] @@ -133,16 +138,17 @@ Here, the `min = 0` and `max = 10` arguments are passed down to `runif()` every [[4]] [1] 0.9916910 1.1890256 0.5043966 9.2925392 -``` +~~~~~~~~ -So now, instead of the random numbers being between 0 and 1 (the default), they are all between 0 and 10. +So now, instead of the random numbers being between 0 and 1 (the default), the are all between 0 and 10. The `lapply()` function and its friends make heavy use of _anonymous_ functions. Anonymous functions are like members of [Project Mayhem](http://en.wikipedia.org/wiki/Fight_Club)---they have no names. These are functions are generated "on the fly" as you are using `lapply()`. Once the call to `lapply()` is finished, the function disappears and does not appear in the workspace. Here I am creating a list that contains two matrices. -```r +{line-numbers=off} +~~~~~~~~ > x <- list(a = matrix(1:4, 2, 2), b = matrix(1:6, 3, 2)) > x $a @@ -155,27 +161,29 @@ $b [1,] 1 4 [2,] 2 5 [3,] 3 6 -``` +~~~~~~~~ Suppose I wanted to extract the first column of each matrix in the list. I could write an anonymous function for extracting the first column of each matrix. -```r +{line-numbers=off} +~~~~~~~~ > lapply(x, function(elt) { elt[,1] }) $a [1] 1 2 $b [1] 1 2 3 -``` +~~~~~~~~ Notice that I put the `function()` definition right in the call to `lapply()`. This is perfectly legal and acceptable. You can put an arbitrarily complicated function definition inside `lapply()`, but if it's going to be more complicated, it's probably a better idea to define the function separately. For example, I could have done the following. -```r +{line-numbers=off} +~~~~~~~~ > f <- function(elt) { + elt[, 1] + } @@ -185,7 +193,7 @@ $a $b [1] 1 2 3 -``` +~~~~~~~~ Now the function is no longer anonymous; it's name is `f`. Whether you use an anonymous function or you define a function first depends on your context. If you think the function `f` is something you're going to need a lot in other parts of your code, you might want to define it separately. But if you're just going to use it for this call to `lapply()`, then it's probably simpler to use an anonymous function. @@ -203,7 +211,8 @@ The `sapply()` function behaves similarly to `lapply()`; the only real differenc Here's the result of calling `lapply()`. -```r +{line-numbers=off} +~~~~~~~~ > x <- list(a = 1:4, b = rnorm(10), c = rnorm(20, 1), d = rnorm(100, 5)) > lapply(x, mean) $a @@ -217,18 +226,19 @@ $c $d [1] 4.968715 -``` +~~~~~~~~ Notice that `lapply()` returns a list (as usual), but that each element of the list has length 1. Here's the result of calling `sapply()` on the same list. -```r +{line-numbers=off} +~~~~~~~~ > sapply(x, mean) a b c d 2.500000 -0.251483 1.481246 4.968715 -``` +~~~~~~~~ Because the result of `lapply()` was a list where each element had length 1, `sapply()` collapsed the output into a numeric vector, which is often more useful than a list. @@ -243,10 +253,11 @@ The `split()` function takes a vector or other objects and splits it into groups The arguments to `split()` are -```r +{line-numbers=off} +~~~~~~~~ > str(split) function (x, f, drop = FALSE, ...) -``` +~~~~~~~~ where @@ -254,29 +265,34 @@ where - `f` is a factor (or coerced to one) or a list of factors - `drop` indicates whether empty factors levels should be dropped -The combination of `split()` and a function like `lapply()` or `sapply()` is a common paradigm in R. The basic idea is that you can take a data structure, split it into subsets defined by another variable, and apply a function over those subsets. The results of applying the function over the subsets are then collated and returned as an object. This sequence of operations is sometimes referred to as "map-reduce" in other contexts. +The combination of `split()` and a function like `lapply()` or `sapply()` is a common paradigm in R. The basic idea is that you can take a data structure, split it into subsets defined by another variable, and apply a function over those subsets. The results of applying tha function over the subsets are then collated and returned as an object. This sequence of operations is sometimes referred to as "map-reduce" in other contexts. Here we simulate some data and split it according to a factor variable. Note that we use the `gl()` function to "generate levels" in a factor variable. -```r +{line-numbers=off} +~~~~~~~~ > x <- c(rnorm(10), runif(10), rnorm(10, 1)) > f <- gl(3, 10) > split(x, f) $`1` - [1] 0.3981302 -0.4075286 1.3242586 -0.7012317 -0.5806143 -1.0010722 -0.6681786 0.9451850 0.4337021 1.0051592 + [1] 0.3981302 -0.4075286 1.3242586 -0.7012317 -0.5806143 -1.0010722 + [7] -0.6681786 0.9451850 0.4337021 1.0051592 $`2` - [1] 0.34822440 0.94893818 0.64667919 0.03527777 0.59644846 0.41531800 0.07689704 0.52804888 0.96233331 0.70874005 + [1] 0.34822440 0.94893818 0.64667919 0.03527777 0.59644846 0.41531800 + [7] 0.07689704 0.52804888 0.96233331 0.70874005 $`3` - [1] 1.13444766 1.76559900 1.95513668 0.94943430 0.69418458 1.89367370 -0.04729815 2.97133739 0.61636789 2.65414530 -``` + [1] 1.13444766 1.76559900 1.95513668 0.94943430 0.69418458 + [6] 1.89367370 -0.04729815 2.97133739 0.61636789 2.65414530 +~~~~~~~~ A common idiom is `split` followed by an `lapply`. -```r +{line-numbers=off} +~~~~~~~~ > lapply(split(x, f), mean) $`1` [1] 0.07478098 @@ -286,12 +302,13 @@ $`2` $`3` [1] 1.458703 -``` +~~~~~~~~ ## Splitting a Data Frame -```r +{line-numbers=off} +~~~~~~~~ > library(datasets) > head(airquality) Ozone Solar.R Wind Temp Month Day @@ -301,13 +318,14 @@ $`3` 4 18 313 11.5 62 5 4 5 NA NA 14.3 56 5 5 6 28 NA 14.9 66 5 6 -``` +~~~~~~~~ We can split the `airquality` data frame by the `Month` variable so that we have separate sub-data frames for each month. -```r +{line-numbers=off} +~~~~~~~~ > s <- split(airquality, airquality$Month) > str(s) List of 5 @@ -346,12 +364,13 @@ List of 5 ..$ Temp : int [1:30] 91 92 93 93 87 84 80 78 75 73 ... ..$ Month : int [1:30] 9 9 9 9 9 9 9 9 9 9 ... ..$ Day : int [1:30] 1 2 3 4 5 6 7 8 9 10 ... -``` +~~~~~~~~ Then we can take the column means for `Ozone`, `Solar.R`, and `Wind` for each sub-data frame. -```r +{line-numbers=off} +~~~~~~~~ > lapply(s, function(x) { + colMeans(x[, c("Ozone", "Solar.R", "Wind")]) + }) @@ -374,12 +393,13 @@ $`8` $`9` Ozone Solar.R Wind NA 167.4333 10.1800 -``` +~~~~~~~~ Using `sapply()` might be better here for a more readable output. -```r +{line-numbers=off} +~~~~~~~~ > sapply(s, function(x) { + colMeans(x[, c("Ozone", "Solar.R", "Wind")]) + }) @@ -387,12 +407,13 @@ Using `sapply()` might be better here for a more readable output. Ozone NA NA NA NA NA Solar.R NA 190.16667 216.483871 NA 167.4333 Wind 11.62258 10.26667 8.941935 8.793548 10.1800 -``` +~~~~~~~~ Unfortunately, there are `NA`s in the data so we cannot simply take the means of those variables. However, we can tell the `colMeans` function to remove the `NA`s before computing the mean. -```r +{line-numbers=off} +~~~~~~~~ > sapply(s, function(x) { + colMeans(x[, c("Ozone", "Solar.R", "Wind")], + na.rm = TRUE) @@ -401,12 +422,13 @@ Unfortunately, there are `NA`s in the data so we cannot simply take the means of Ozone 23.61538 29.44444 59.115385 59.961538 31.44828 Solar.R 181.29630 190.16667 216.483871 171.857143 167.43333 Wind 11.62258 10.26667 8.941935 8.793548 10.18000 -``` +~~~~~~~~ Occasionally, we may want to split an R object according to levels defined in more than one variable. We can do this by creating an interaction of the variables with the `interaction()` function. -```r +{line-numbers=off} +~~~~~~~~ > x <- rnorm(10) > f1 <- gl(2, 5) > f2 <- gl(5, 2) @@ -420,12 +442,13 @@ Levels: 1 2 3 4 5 > interaction(f1, f2) [1] 1.1 1.1 1.2 1.2 1.3 2.3 2.4 2.4 2.5 2.5 Levels: 1.1 2.1 1.2 2.2 1.3 2.3 1.4 2.4 1.5 2.5 -``` +~~~~~~~~ With multiple factors and many levels, creating an interaction can result in many levels that are empty. -```r +{line-numbers=off} +~~~~~~~~ > str(split(x, list(f1, f2))) List of 10 $ 1.1: num [1:2] 1.512 0.083 @@ -438,12 +461,13 @@ List of 10 $ 2.4: num [1:2] 0.0991 -0.4541 $ 1.5: num(0) $ 2.5: num [1:2] -0.6558 -0.0359 -``` +~~~~~~~~ Notice that there are 4 categories with no data. But we can drop empty levels when we call the `split()` function. -```r +{line-numbers=off} +~~~~~~~~ > str(split(x, list(f1, f2), drop = TRUE)) List of 6 $ 1.1: num [1:2] 1.512 0.083 @@ -452,7 +476,7 @@ List of 6 $ 2.3: num 1.04 $ 2.4: num [1:2] 0.0991 -0.4541 $ 2.5: num [1:2] -0.6558 -0.0359 -``` +~~~~~~~~ ## tapply @@ -462,10 +486,11 @@ List of 6 `tapply()` is used to apply a function over subsets of a vector. It can be thought of as a combination of `split()` and `sapply()` for vectors only. I've been told that the "t" in `tapply()` refers to "table", but that is unconfirmed. -```r +{line-numbers=off} +~~~~~~~~ > str(tapply) -function (X, INDEX, FUN = NULL, ..., default = NA, simplify = TRUE) -``` +function (X, INDEX, FUN = NULL, ..., simplify = TRUE) +~~~~~~~~ The arguments to `tapply()` are as follows: @@ -478,7 +503,8 @@ The arguments to `tapply()` are as follows: Given a vector of numbers, one simple operation is to take group means. -```r +{line-numbers=off} +~~~~~~~~ > ## Simulate some data > x <- c(rnorm(10), runif(10), rnorm(10, 1)) > ## Define some groups with a factor variable @@ -489,12 +515,13 @@ Levels: 1 2 3 > tapply(x, f, mean) 1 2 3 0.1896235 0.5336667 0.9568236 -``` +~~~~~~~~ We can also take the group means without simplifying the result, which will give us a list. For functions that return a single value, usually, this is not what we want, but it can be done. -```r +{line-numbers=off} +~~~~~~~~ > tapply(x, f, mean, simplify = FALSE) $`1` [1] 0.1896235 @@ -504,13 +531,14 @@ $`2` $`3` [1] 0.9568236 -``` +~~~~~~~~ We can also apply functions that return more than a single value. In this case, `tapply()` will not simplify the result and will return a list. Here's an example of finding the range of each sub-group. -```r +{line-numbers=off} +~~~~~~~~ > tapply(x, f, range) $`1` [1] -1.869789 1.497041 @@ -520,7 +548,7 @@ $`2` $`3` [1] -0.5690822 2.3644349 -``` +~~~~~~~~ ## `apply()` @@ -531,10 +559,11 @@ The `apply()` function is used to a evaluate a function (often an anonymous one) -```r +{line-numbers=off} +~~~~~~~~ > str(apply) -function (X, MARGIN, FUN, ..., simplify = TRUE) -``` +function (X, MARGIN, FUN, ...) +~~~~~~~~ The arguments to `apply()` are @@ -547,20 +576,25 @@ The arguments to `apply()` are Here I create a 20 by 10 matrix of Normal random numbers. I then compute the mean of each column. -```r +{line-numbers=off} +~~~~~~~~ > x <- matrix(rnorm(200), 20, 10) > apply(x, 2, mean) ## Take the mean of each column - [1] 0.02218266 -0.15932850 0.09021391 0.14723035 -0.22431309 -0.49657847 0.30095015 0.07703985 -0.20818099 0.06809774 -``` + [1] 0.02218266 -0.15932850 0.09021391 0.14723035 -0.22431309 + [6] -0.49657847 0.30095015 0.07703985 -0.20818099 0.06809774 +~~~~~~~~ I can also compute the sum of each row. -```r +{line-numbers=off} +~~~~~~~~ > apply(x, 1, sum) ## Take the mean of each row - [1] -0.48483448 5.33222301 -3.33862932 -1.39998450 2.37859098 0.01082604 -6.29457190 -0.26287700 0.71133578 -3.38125293 -4.67522818 3.01900232 -[13] -2.39466347 -2.16004389 5.33063755 -2.92024635 3.52026401 -1.84880901 -4.10213912 5.30667310 -``` + [1] -0.48483448 5.33222301 -3.33862932 -1.39998450 2.37859098 + [6] 0.01082604 -6.29457190 -0.26287700 0.71133578 -3.38125293 +[11] -4.67522818 3.01900232 -2.39466347 -2.16004389 5.33063755 +[16] -2.92024635 3.52026401 -1.84880901 -4.10213912 5.30667310 +~~~~~~~~ Note that in both calls to `apply()`, the return value was a vector of numbers. @@ -569,16 +603,18 @@ You've probably noticed that the second argument is either a 1 or a 2, depending The `MARGIN` argument essentially indicates to `apply()` which dimension of the array you want to preserve or retain. So when taking the mean of each column, I specify -```r +{line-numbers=off} +~~~~~~~~ > apply(x, 2, mean) -``` +~~~~~~~~ because I want to collapse the first dimension (the rows) by taking the mean and I want to preserve the number of columns. Similarly, when I want the row sums, I run -```r +{line-numbers=off} +~~~~~~~~ > apply(x, 1, mean) -``` +~~~~~~~~ because I want to collapse the columns (the second dimension) and preserve the number of rows (the first dimension). @@ -599,42 +635,51 @@ The shortcut functions are heavily optimized and hence are _much_ faster, but yo You can do more than take sums and means with the `apply()` function. For example, you can compute quantiles of the rows of a matrix using the `quantile()` function. -```r +{line-numbers=off} +~~~~~~~~ > x <- matrix(rnorm(200), 20, 10) > ## Get row quantiles > apply(x, 1, quantile, probs = c(0.25, 0.75)) - [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] -25% -1.0884151 -0.6693040 0.2908481 -0.4602083 -1.0432010 -1.12773555 -1.4571706 -0.2406991 -0.3226845 -0.329898 -0.8677524 -0.2023664 -0.9796050 -75% 0.1843547 0.8210295 1.3667301 0.4424153 0.3571219 0.03653687 -0.1705336 0.6504486 1.1460854 1.247092 0.4138139 0.9145331 0.5448777 - [,14] [,15] [,16] [,17] [,18] [,19] [,20] -25% -1.3551031 -0.1823252 -1.260911898 -0.9954289 -0.3767354 -0.8557544 -0.7000363 -75% -0.5396766 0.7795571 0.002908451 0.4323192 0.7542638 0.5440158 0.5432995 -``` + [,1] [,2] [,3] [,4] [,5] [,6] +25% -1.0884151 -0.6693040 0.2908481 -0.4602083 -1.0432010 -1.12773555 +75% 0.1843547 0.8210295 1.3667301 0.4424153 0.3571219 0.03653687 + [,7] [,8] [,9] [,10] [,11] [,12] +25% -1.4571706 -0.2406991 -0.3226845 -0.329898 -0.8677524 -0.2023664 +75% -0.1705336 0.6504486 1.1460854 1.247092 0.4138139 0.9145331 + [,13] [,14] [,15] [,16] [,17] [,18] +25% -0.9796050 -1.3551031 -0.1823252 -1.260911898 -0.9954289 -0.3767354 +75% 0.5448777 -0.5396766 0.7795571 0.002908451 0.4323192 0.7542638 + [,19] [,20] +25% -0.8557544 -0.7000363 +75% 0.5440158 0.5432995 +~~~~~~~~ Notice that I had to pass the `probs = c(0.25, 0.75)` argument to `quantile()` via the `...` argument to `apply()`. -For a higher dimensional example, I can create an array of $2\times2$ matrices and the compute the average of the matrices in the array. +For a higher dimensional example, I can create an array of {$$}2\times2{/$$} matrices and the compute the average of the matrices in the array. -```r +{line-numbers=off} +~~~~~~~~ > a <- array(rnorm(2 * 2 * 10), c(2, 2, 10)) > apply(a, c(1, 2), mean) [,1] [,2] [1,] 0.1681387 -0.1039673 [2,] 0.3519741 -0.4029737 -``` +~~~~~~~~ In the call to `apply()` here, I indicated via the `MARGIN` argument that I wanted to preserve the first and second dimensions and to collapse the third dimension by taking the mean. There is a faster way to do this specific operation via the `colMeans()` function. -```r +{line-numbers=off} +~~~~~~~~ > rowMeans(a, dims = 2) ## Faster [,1] [,2] [1,] 0.1681387 -0.1039673 [2,] 0.3519741 -0.4029737 -``` +~~~~~~~~ In this situation, I might argue that the use of `rowMeans()` is less readable, but it is substantially faster with large arrays. @@ -646,10 +691,11 @@ In this situation, I might argue that the use of `rowMeans()` is less readable, The `mapply()` function is a multivariate apply of sorts which applies a function in parallel over a set of arguments. Recall that `lapply()` and friends only iterate over a single R object. What if you want to iterate over multiple R objects in parallel? This is what `mapply()` is for. -```r +{line-numbers=off} +~~~~~~~~ > str(mapply) function (FUN, ..., MoreArgs = NULL, SIMPLIFY = TRUE, USE.NAMES = TRUE) -``` +~~~~~~~~ The arguments to `mapply()` are @@ -667,7 +713,8 @@ For example, the following is tedious to type With `mapply()`, instead we can do -```r +{line-numbers=off} +~~~~~~~~ > mapply(rep, 1:4, 4:1) [[1]] [1] 1 1 1 1 @@ -680,32 +727,34 @@ With `mapply()`, instead we can do [[4]] [1] 4 -``` +~~~~~~~~ This passes the sequence `1:4` to the first argument of `rep()` and the sequence `4:1` to the second argument. -Here's another example for simulating random Normal variables. +Here's another example for simulating randon Normal variables. -```r +{line-numbers=off} +~~~~~~~~ > noise <- function(n, mean, sd) { + rnorm(n, mean, sd) + } -> ## Simulate 5 random numbers +> ## Simulate 5 randon numbers > noise(5, 1, 2) [1] -0.5196913 3.2979182 -0.6849525 1.7828267 2.7827545 > > ## This only simulates 1 set of numbers, not 5 > noise(1:5, 1:5, 2) [1] -1.670517 2.796247 2.776826 5.351488 3.422804 -``` +~~~~~~~~ Here we can use `mapply()` to pass the sequence `1:5` separately to the `noise()` function so that we can get 5 sets of random numbers, each with a different length and mean. -```r +{line-numbers=off} +~~~~~~~~ > mapply(noise, 1:5, 1:5, 2) [[1]] [1] 0.8260273 @@ -721,12 +770,13 @@ Here we can use `mapply()` to pass the sequence `1:5` separately to the `noise() [[5]] [1] 2.826182 1.347834 6.990564 4.976276 3.800743 -``` +~~~~~~~~ The above call to `mapply()` is the same as -```r +{line-numbers=off} +~~~~~~~~ > list(noise(1, 1, 2), noise(2, 2, 2), + noise(3, 3, 2), noise(4, 4, 2), + noise(5, 5, 2)) @@ -744,50 +794,56 @@ The above call to `mapply()` is the same as [[5]] [1] 8.959267 6.593589 1.581448 1.672663 5.982219 -``` +~~~~~~~~ ## Vectorizing a Function The `mapply()` function can be use to automatically "vectorize" a function. What this means is that it can be used to take a function that typically only takes single arguments and create a new function that can take vector arguments. This is often needed when you want to plot functions. -Here's an example of a function that computes the sum of squares given some data, a mean parameter and a standard deviation. The formula is $\sum_{i=1}^n(x_i-\mu)^2/\sigma^2$. +Here's an example of a function that computes the sum of squares given some data, a mean parameter and a standard deviation. The formula is {$$}\sum_{i=1}^n(x_i-\mu)^2/\sigma^2{/$$}. -```r +{line-numbers=off} +~~~~~~~~ > sumsq <- function(mu, sigma, x) { + sum(((x - mu) / sigma)^2) + } -``` +~~~~~~~~ This function takes a mean `mu`, a standard deviation `sigma`, and some data in a vector `x`. In many statistical applications, we want to minimize the sum of squares to find the optimal `mu` and `sigma`. Before we do that, we may want to evaluate or plot the function for many different values of `mu` or `sigma`. However, passing a vector of `mu`s or `sigma`s won't work with this function because it's not vectorized. -```r +{line-numbers=off} +~~~~~~~~ > x <- rnorm(100) ## Generate some data > sumsq(1:10, 1:10, x) ## This is not what we want [1] 110.2594 -``` +~~~~~~~~ Note that the call to `sumsq()` only produced one value instead of 10 values. However, we can do what we want to do by using `mapply()`. -```r +{line-numbers=off} +~~~~~~~~ > mapply(sumsq, 1:10, 1:10, MoreArgs = list(x = x)) - [1] 196.2289 121.4765 108.3981 104.0788 102.1975 101.2393 100.6998 100.3745 100.1685 100.0332 -``` + [1] 196.2289 121.4765 108.3981 104.0788 102.1975 101.2393 100.6998 + [8] 100.3745 100.1685 100.0332 +~~~~~~~~ There's even a function in R called `Vectorize()` that automatically can create a vectorized version of your function. So we could create a `vsumsq()` function that is fully vectorized as follows. -```r +{line-numbers=off} +~~~~~~~~ > vsumsq <- Vectorize(sumsq, c("mu", "sigma")) > vsumsq(1:10, 1:10, x) - [1] 196.2289 121.4765 108.3981 104.0788 102.1975 101.2393 100.6998 100.3745 100.1685 100.0332 -``` + [1] 196.2289 121.4765 108.3981 104.0788 102.1975 101.2393 100.6998 + [8] 100.3745 100.1685 100.0332 +~~~~~~~~ Pretty cool, right? @@ -796,9 +852,9 @@ Pretty cool, right? * The loop functions in R are very powerful because they allow you to conduct a series of operations on data using a compact form -* The operation of a loop function involves iterating over an R object (e.g. a list or vector or matrix), applying a function to each element of the object, and then collating the results and returning the collated results. +* The operation of a loop function involves iterating over an R object (e.g. a list or vector or matrix), applying a function to each element of the object, and the collating the results and returning the collated results. -* Loop functions make heavy use of anonymous functions, which exist for the life of the loop function but are not stored anywhere. +* Loop functions make heavy use of anonymous functions, which exist for the life of the loop function but are not stored anywhere -* The `split()` function can be used to divide an R object into subsets determined by another variable which can subsequently be looped over using loop functions. +* The `split()` function can be used to divide an R object in to subsets determined by another variable which can subsequently be looped over using loop functions. diff --git a/manuscript/control.md b/manuscript/control.md index 029df02..17609a9 100644 --- a/manuscript/control.md +++ b/manuscript/control.md @@ -26,7 +26,7 @@ Commonly used control structures are - `next`: skip an interation of a loop Most control structures are not used in interactive sessions, but -rather when writing functions or longer expressions. However, these +rather when writing functions or longer expresisons. However, these constructs do not have to be used in functions and it's a good idea to become familiar with them before we delve into functions. @@ -42,29 +42,33 @@ false. For starters, you can just use the `if` statement. -```r +{line-numbers=off} +~~~~~~~~ if() { ## do something } ## Continue with rest of code -``` +~~~~~~~~ The above code does nothing if the condition is false. If you have an action you want to execute when the condition is false, then you need an `else` clause. -```r +{line-numbers=off} +~~~~~~~~ if() { ## do something -} else { +} +else { ## do something else } -``` +~~~~~~~~ You can have a series of tests by following the initial `if` with any number of `else if`s. -```r +{line-numbers=off} +~~~~~~~~ if() { ## do something } else if() { @@ -72,12 +76,13 @@ if() { } else { ## do something different } -``` +~~~~~~~~ Here is an example of a valid if/else structure. -```r +{line-numbers=off} +~~~~~~~~ ## Generate a uniform random number x <- runif(1, 0, 10) if(x > 3) { @@ -85,19 +90,20 @@ if(x > 3) { } else { y <- 0 } -``` +~~~~~~~~ The value of `y` is set depending on whether `x > 3` or not. This expression can also be written a different, but equivalent, way in R. -```r +{line-numbers=off} +~~~~~~~~ y <- if(x > 3) { 10 } else { 0 } -``` +~~~~~~~~ Neither way of writing this expression is more correct than the other. Which one you use will depend on your preference and perhaps @@ -107,7 +113,8 @@ Of course, the `else` clause is not necessary. You could have a series of if clauses that always get executed if their respective conditions are true. -```r +{line-numbers=off} +~~~~~~~~ if() { } @@ -115,24 +122,25 @@ if() { if() { } -``` +~~~~~~~~ ## `for` Loops [Watch a video of this section](https://youtu.be/FbT1dGXCCxU) -`for` loops are pretty much the only looping construct that you will +For loops are pretty much the only looping construct that you will need in R. While you may occasionally find a need for other types of loops, in my experience doing data analysis, I've found very few situations where a for loop wasn't sufficient. -In R, for loops take an iterator variable and assign it successive +In R, for loops take an interator variable and assign it successive values from a sequence or vector. For loops are most commonly used for iterating over the elements of an object (list, vector, etc.) -```r +{line-numbers=off} +~~~~~~~~ > for(i in 1:10) { + print(i) + } @@ -146,7 +154,7 @@ iterating over the elements of an object (list, vector, etc.) [1] 8 [1] 9 [1] 10 -``` +~~~~~~~~ This loop takes the `i` variable and in each iteration of the loop gives it values 1, 2, 3, ..., 10, executes the code within the curly @@ -155,7 +163,8 @@ braces, and then the loop exits. The following three loops all have the same behavior. -```r +{line-numbers=off} +~~~~~~~~ > x <- c("a", "b", "c", "d") > > for(i in 1:4) { @@ -166,14 +175,15 @@ The following three loops all have the same behavior. [1] "b" [1] "c" [1] "d" -``` +~~~~~~~~ The `seq_along()` function is commonly used in conjunction with for loops in order to generate an integer sequence based on the length of an object (in this case, the object `x`). -```r +{line-numbers=off} +~~~~~~~~ > ## Generate a sequence based on length of 'x' > for(i in seq_along(x)) { + print(x[i]) @@ -182,12 +192,13 @@ an object (in this case, the object `x`). [1] "b" [1] "c" [1] "d" -``` +~~~~~~~~ It is not necessary to use an index-type variable. -```r +{line-numbers=off} +~~~~~~~~ > for(letter in x) { + print(letter) + } @@ -195,18 +206,19 @@ It is not necessary to use an index-type variable. [1] "b" [1] "c" [1] "d" -``` +~~~~~~~~ For one line loops, the curly braces are not strictly necessary. -```r +{line-numbers=off} +~~~~~~~~ > for(i in 1:4) print(x[i]) [1] "a" [1] "b" [1] "c" [1] "d" -``` +~~~~~~~~ However, I like to use curly braces even for one-line loops, because that way if you decide to expand the loop to multiple lines, you won't @@ -218,7 +230,8 @@ burned by this). `for` loops can be nested inside of each other. -```r +{line-numbers=off} +~~~~~~~~ x <- matrix(1:6, 2, 3) for(i in seq_len(nrow(x))) { @@ -226,7 +239,7 @@ for(i in seq_len(nrow(x))) { print(x[i, j]) } } -``` +~~~~~~~~ Nested loops are commonly needed for multidimensional or hierarchical data structures (e.g. matrices, lists). Be careful with nesting @@ -240,13 +253,14 @@ functions (discussed later). [Watch a video of this section](https://youtu.be/VqrS1Wghq1c) -`while` loops begin by testing a condition. If it is true, then they +While loops begin by testing a condition. If it is true, then they execute the loop body. Once the loop body is executed, the condition is tested again, and so forth, until the condition is false, after which the loop exits. -```r +{line-numbers=off} +~~~~~~~~ > count <- 0 > while(count < 10) { + print(count) @@ -262,15 +276,16 @@ which the loop exits. [1] 7 [1] 8 [1] 9 -``` +~~~~~~~~ -`while` loops can potentially result in infinite loops if not written +While loops can potentially result in infinite loops if not written properly. Use with care! Sometimes there will be more than one condition in the test. -```r +{line-numbers=off} +~~~~~~~~ > z <- 5 > set.seed(1) > @@ -285,7 +300,7 @@ Sometimes there will be more than one condition in the test. + } > print(z) [1] 2 -``` +~~~~~~~~ Conditions are always evaluated from left to right. For example, in the above code, if `z` were less than 3, the second test would not @@ -302,14 +317,15 @@ not commonly used in statistical or data analysis applications but they do have their uses. The only way to exit a `repeat` loop is to call `break`. -One possible paradigm might be in an iterative algorithm where you may +One possible paradigm might be in an iterative algorith where you may be searching for a solution and you don't want to stop until you're close enough to the solution. In this kind of situation, you often don't know in advance how many iterations it's going to take to get "close enough" to the solution. -```r +{line-numbers=off} +~~~~~~~~ x0 <- 1 tol <- 1e-8 @@ -322,7 +338,7 @@ repeat { x0 <- x1 } } -``` +~~~~~~~~ Note that the above code will not run if the `computeEstimate()` function is not defined (I just made it up for the purposes of this @@ -340,7 +356,8 @@ report whether convergence was achieved or not. `next` is used to skip an iteration of a loop. -```r +{line-numbers=off} +~~~~~~~~ for(i in 1:100) { if(i <= 20) { ## Skip the first 20 iterations @@ -348,13 +365,14 @@ for(i in 1:100) { } ## Do something here } -``` +~~~~~~~~ `break` is used to exit a loop immediately, regardless of what iteration the loop may be on. -```r +{line-numbers=off} +~~~~~~~~ for(i in 1:100) { print(i) @@ -363,13 +381,13 @@ for(i in 1:100) { break } } -``` +~~~~~~~~ ## Summary -- Control structures, like `if`, `while`, and `for`, allow you to - control the flow of an R program. +- Control structures like `if`, `while`, and `for` allow you to + control the flow of an R program - Infinite loops should generally be avoided, even if (you believe) they are theoretically correct. diff --git a/manuscript/debugging.md b/manuscript/debugging.md index 292a07c..19b2345 100644 --- a/manuscript/debugging.md +++ b/manuscript/debugging.md @@ -17,11 +17,12 @@ R has a number of ways to indicate to you that something’s not right. There ar Here is an example of a warning that you might receive in the course of using R. -```r +{line-numbers=off} +~~~~~~~~ > log(-1) Warning in log(-1): NaNs produced [1] NaN -``` +~~~~~~~~ This warning lets you know that taking the log of a negative number results in a `NaN` value because you can't take the log of negative numbers. Nevertheless, R doesn't give an error, because it has a useful value that it can return, the `NaN` value. The warning is just there to let you know that something unexpected happen. Depending on what you are programming, you may have intentionally taken the log of a negative number in order to move on to another section of code. @@ -29,7 +30,8 @@ Here is another function that is designed to print a message to the console depe -```r +{line-numbers=off} +~~~~~~~~ > printmessage <- function(x) { + if(x > 0) + print("x is greater than zero") @@ -37,7 +39,7 @@ Here is another function that is designed to print a message to the console depe + print("x is less than or equal to zero") + invisible(x) + } -``` +~~~~~~~~ This function is simple---it prints a message telling you whether `x` is greater than zero or less than or equal to zero. It also returns its input *invisibly*, which is a common practice with "print" functions. Returning an object invisibly means that the return value does not get auto-printed when the function is called. @@ -46,18 +48,20 @@ Take a hard look at the function above and see if you can identify any bugs or p We can execute the function as follows. -```r +{line-numbers=off} +~~~~~~~~ > printmessage(1) [1] "x is greater than zero" -``` +~~~~~~~~ The function seems to work fine at this point. No errors, warnings, or messages. -```r +{line-numbers=off} +~~~~~~~~ > printmessage(NA) Error in if (x > 0) print("x is greater than zero") else print("x is less than or equal to zero"): missing value where TRUE/FALSE needed -``` +~~~~~~~~ What happened? @@ -66,7 +70,8 @@ Well, the first thing the function does is test if `x > 0`. But you can't do tha We can fix this problem by anticipating the possibility of `NA` values and checking to see if the input is `NA` with the `is.na()` function. -```r +{line-numbers=off} +~~~~~~~~ > printmessage2 <- function(x) { + if(is.na(x)) + print("x is a missing value!") @@ -76,29 +81,33 @@ We can fix this problem by anticipating the possibility of `NA` values and check + print("x is less than or equal to zero") + invisible(x) + } -``` +~~~~~~~~ Now we can run the following. -```r +{line-numbers=off} +~~~~~~~~ > printmessage2(NA) [1] "x is a missing value!" -``` +~~~~~~~~ And all is fine. Now what about the following situation. -```r +{line-numbers=off} +~~~~~~~~ > x <- log(c(-1, 2)) Warning in log(c(-1, 2)): NaNs produced > printmessage2(x) -Warning in if (is.na(x)) print("x is a missing value!") else if (x > 0) print("x is greater than zero") else print("x is less than or equal to zero"): the -condition has length > 1 and only the first element will be used +Warning in if (is.na(x)) print("x is a missing value!") else if (x > 0) +print("x is greater than zero") else print("x is less than or equal to +zero"): the condition has length > 1 and only the first element will be +used [1] "x is a missing value!" -``` +~~~~~~~~ Now what?? Why are we getting this warning? The warning says "the condition has length > 1 and only the first element will be used". @@ -109,7 +118,8 @@ We can solve this problem two ways. One is by simply not allowing vector argumen For the first way, we simply need to check the length of the input. -```r +{line-numbers=off} +~~~~~~~~ > printmessage3 <- function(x) { + if(length(x) > 1L) + stop("'x' has length > 1") @@ -121,32 +131,34 @@ For the first way, we simply need to check the length of the input. + print("x is less than or equal to zero") + invisible(x) + } -``` +~~~~~~~~ Now when we pass `printmessage3()` a vector we should get an error. -```r +{line-numbers=off} +~~~~~~~~ > printmessage3(1:2) Error in printmessage3(1:2): 'x' has length > 1 -``` +~~~~~~~~ Vectorizing the function can be accomplished easily with the `Vectorize()` function. -```r +{line-numbers=off} +~~~~~~~~ > printmessage4 <- Vectorize(printmessage2) > out <- printmessage4(c(-1, 2)) [1] "x is less than or equal to zero" [1] "x is greater than zero" -``` +~~~~~~~~ You can see now that the correct messages are printed without any warning or error. Note that I stored the return value of `printmessage4()` in a separate R object called `out`. This is because when I use the `Vectorize()` function it no longer preserves the invisibility of the return value. ## Figuring Out What's Wrong -The primary task of debugging any R code is correctly diagnosing what the problem is. When diagnosing a problem with your code (or somebody else's), it's important to first understand what you were expecting to occur. Then you need to identify what *did* occur and how did it deviate from your expectations. Some basic questions you need to ask are +The primary task of debugging any R code is correctly diagnosing what the problem is. When diagnosing a problem with your code (or somebody else's), it's important first understand what you were expecting to occur. Then you need to idenfity what *did* occur and how did it deviate from your expectations. Some basic questions you need to ask are - What was your input? How did you call the function? - What were you expecting? Output, messages, other results? @@ -180,12 +192,13 @@ The `traceback()` function prints out the *function call stack* after an error h For example, you may have a function `a()` which subsequently calls function `b()` which calls `c()` and then `d()`. If an error occurs, it may not be immediately clear in which function the error occurred. The `traceback()` function shows you how many levels deep you were when the error occurred. -```r +{line-numbers=off} +~~~~~~~~ > mean(x) Error in mean(x) : object 'x' not found > traceback() 1: mean(x) -``` +~~~~~~~~ Here, it's clear that the error occurred inside the `mean()` function because the object `x` does not exist. The `traceback()` function must be called immediately after an error occurs. Once another function is called, you lose the traceback. @@ -193,7 +206,8 @@ The `traceback()` function must be called immediately after an error occurs. Onc Here is a slightly more complicated example using the `lm()` function for linear modeling. -```r +{line-numbers=off} +~~~~~~~~ > lm(y ~ x) Error in eval(expr, envir, enclos) : object ’y’ not found > traceback() @@ -204,7 +218,7 @@ Error in eval(expr, envir, enclos) : object ’y’ not found 3: eval(expr, envir, enclos) 2: eval(mf, parent.frame()) 1: lm(y ~ x) -``` +~~~~~~~~ You can see now that the error did not get thrown until the 7th level of the function call stack, in which case the `eval()` function tried to evaluate the formula `y ~ x` and realized the object `y` did not exist. @@ -216,7 +230,8 @@ The `debug()` function initiates an interactive debugger (also known as the "bro The `debug()` function takes a function as its first argument. Here is an example of debugging the `lm()` function. -```r +{line-numbers=off} +~~~~~~~~ > debug(lm) ## Flag the 'lm()' function for interactive debugging > lm(y ~ x) debugging in: lm(y ~ x) @@ -230,7 +245,7 @@ debug: { z } Browse[2]> -``` +~~~~~~~~ Now, every time you call the `lm()` function it will launch the interactive debugger. To turn this behavior off you need to call the `undebug()` function. @@ -242,7 +257,8 @@ The debugger calls the browser at the very top level of the function body. From Here's an example of a browser session with the `lm()` function. -```r +{line-numbers=off} +~~~~~~~~ Browse[2]> n ## Evalute this expression and move to the next one debug: ret.x <- x Browse[2]> n @@ -254,15 +270,16 @@ debug: mf <- match.call(expand.dots = FALSE) Browse[2]> n debug: m <- match(c("formula", "data", "subset", "weights", "na.action", "offset"), names(mf), 0L) -``` +~~~~~~~~ While you are in the browser you can execute any other R function that might be available to you in a regular session. In particular, you can use `ls()` to see what is in your current environment (the function environment) and `print()` to print out the values of R objects in the function environment. You can turn off interactive debugging with the `undebug()` function. -```r +{line-numbers=off} +~~~~~~~~ undebug(lm) ## Unflag the 'lm()' function for debugging -``` +~~~~~~~~ ## Using `recover()` @@ -270,7 +287,8 @@ The `recover()` function can be used to modify the error behavior of R when an e With `recover()` you can tell R that when an error occurs, it should halt execution at the exact point at which the error occurred. That can give you the opportunity to poke around in the environment in which the error occurred. This can be useful to see if there are any R objects or data that have been corrupted or mistakenly modified. -```r +{line-numbers=off} +~~~~~~~~ > options(error = recover) ## Change default R error behavior > read.csv("nosuchfile") ## This code doesn't work Error in file(file, "rt") : cannot open the connection @@ -285,13 +303,13 @@ Enter a frame number, or 0 to exit 3: file(file, "rt") Selection: -``` +~~~~~~~~ -The `recover()` function will first print out the function call stack when an error occurs. Then, you can choose to jump around the call stack and investigate the problem. When you choose a frame number, you will be put in the browser (just like the interactive debugger triggered with `debug()`) and will have the ability to poke around. +The `recover()` function will first print out the function call stack when an error occurrs. Then, you can choose to jump around the call stack and investigate the problem. When you choose a frame number, you will be put in the browser (just like the interactive debugger triggered with `debug()`) and will have the ability to poke around. ## Summary -- There are three main indications of a problem/condition: `message`, `warning`, `error`; only an `error` is fatal. -- When analyzing a function with a problem, make sure you can reproduce the problem, clearly state your expectations and how the output differs from your expectation. -- Interactive debugging tools `traceback`, `debug`, `browser`, `trace`, and `recover` can be used to find problematic code in functions. +- There are three main indications of a problem/condition: `message`, `warning`, `error`; only an `error` is fatal +- When analyzing a function with a problem, make sure you can reproduce the problem, clearly state your expectations and how the output differs from your expectation +- Interactive debugging tools `traceback`, `debug`, `browser`, `trace`, and `recover` can be used to find problematic code in functions - Debugging tools are not a substitute for thinking! diff --git a/manuscript/dplyr.md b/manuscript/dplyr.md index ee38b53..3375593 100644 --- a/manuscript/dplyr.md +++ b/manuscript/dplyr.md @@ -37,7 +37,7 @@ Some of the key "verbs" provided by the `dplyr` package are * `%>%`: the "pipe" operator is used to connect multiple verb actions together into a pipeline -The `dplyr` package has a number of its own data types that it takes advantage of. For example, there is a handy `print` method that prevents you from printing a lot of data to the console. Most of the time, these additional data types are transparent to the user and do not need to be worried about. +The `dplyr` package as a number of its own data types that it takes advantage of. For example, there is a handy `print` method that prevents you from printing a lot of data to the console. Most of the time, these additional data types are transparent to the user and do not need to be worried about. @@ -49,7 +49,7 @@ All of the functions that we will discuss in this Chapter will have a few common 2. The subsequent arguments describe what to do with the data frame specified in the first argument, and you can refer to columns in the data frame directly without using the $ operator (just use the column names). -3. The return result of a function is a new data frame. +3. The return result of a function is a new data frame 4. Data frames must be properly formatted and annotated for this to all be useful. In particular, the data must be [tidy](http://www.jstatsoft.org/v59/i10/paper). In short, there should be one observation per row, and each column should represent a feature or characteristic of that observation. @@ -61,42 +61,55 @@ The `dplyr` package can be installed from CRAN or from GitHub using the `devtool To install from CRAN, just run -```r +{line-numbers=off} +~~~~~~~~ > install.packages("dplyr") -``` +~~~~~~~~ To install from GitHub you can run -```r +{line-numbers=off} +~~~~~~~~ > install_github("hadley/dplyr") -``` +~~~~~~~~ After installing the package it is important that you load it into your R session with the `library()` function. -```r +{line-numbers=off} +~~~~~~~~ > library(dplyr) -``` + +Attaching package: 'dplyr' +The following objects are masked from 'package:stats': + + filter, lag +The following objects are masked from 'package:base': + + intersect, setdiff, setequal, union +~~~~~~~~ You may get some warnings when the package is loaded because there are functions in the `dplyr` package that have the same name as functions in other packages. For now you can ignore the warnings. ## `select()` -For the examples in this chapter we will be using a dataset containing air pollution and temperature data for the [city of Chicago](http://www.biostat.jhsph.edu/~rpeng/leanpub/rprog/chicago_data.zip) in the U.S. The dataset is available from my website. +For the examples in this chapter we will be using a dataset containing air pollution and temperature data for the [city of Chicago](http://www.biostat.jhsph.edu/~rpeng/leanpub/rprog/chicago_data.zip) in the U.S. The dataset is available from my web site. After unzipping the archive, you can load the data into R using the `readRDS()` function. -```r +{line-numbers=off} +~~~~~~~~ > chicago <- readRDS("chicago.rds") -``` +~~~~~~~~ You can see some basic characteristics of the dataset with the `dim()` and `str()` functions. -```r +{line-numbers=off} +~~~~~~~~ > dim(chicago) [1] 6940 8 > str(chicago) @@ -109,14 +122,15 @@ You can see some basic characteristics of the dataset with the `dim()` and `str( $ pm10tmean2: num 34 NA 34.2 47 NA ... $ o3tmean2 : num 4.25 3.3 3.33 4.38 4.75 ... $ no2tmean2 : num 20 23.2 23.8 30.4 30.3 ... -``` +~~~~~~~~ The `select()` function can be used to select columns of a data frame that you want to focus on. Often you'll have a large data frame containing "all" of the data, but any *given* analysis might only use a subset of variables or observations. The `select()` function allows you to get the few columns you might need. -Suppose we wanted to take the first 3 columns only. There are a few ways to do this. We could, for example, use numerical indices. But we can also use the names directly. +Suppose we wanted to take the first 3 columns only. There are a few ways to do this. We could for example use numerical indices. But we can also use the names directly. -```r +{line-numbers=off} +~~~~~~~~ > names(chicago)[1:3] [1] "city" "tmpd" "dptp" > subset <- select(chicago, city:dptp) @@ -128,33 +142,36 @@ Suppose we wanted to take the first 3 columns only. There are a few ways to do t 4 chic 29.0 28.625 5 chic 32.0 28.875 6 chic 40.0 35.125 -``` +~~~~~~~~ Note that the `:` normally cannot be used with names or strings, but inside the `select()` function you can use it to specify a range of variable names. You can also *omit* variables using the `select()` function by using the negative sign. With `select()` you can do -```r +{line-numbers=off} +~~~~~~~~ > select(chicago, -(city:dptp)) -``` +~~~~~~~~ which indicates that we should include every variable *except* the variables `city` through `dptp`. The equivalent code in base R would be -```r +{line-numbers=off} +~~~~~~~~ > i <- match("city", names(chicago)) > j <- match("dptp", names(chicago)) > head(chicago[, -(i:j)]) -``` +~~~~~~~~ Not super intuitive, right? The `select()` function also allows a special syntax that allows you to specify variable names based on patterns. So, for example, if you wanted to keep every variable that ends with a "2", we could do -```r +{line-numbers=off} +~~~~~~~~ > subset <- select(chicago, ends_with("2")) > str(subset) 'data.frame': 6940 obs. of 4 variables: @@ -162,18 +179,19 @@ The `select()` function also allows a special syntax that allows you to specify $ pm10tmean2: num 34 NA 34.2 47 NA ... $ o3tmean2 : num 4.25 3.3 3.33 4.38 4.75 ... $ no2tmean2 : num 20 23.2 23.8 30.4 30.3 ... -``` +~~~~~~~~ Or if we wanted to keep every variable that starts with a "d", we could do -```r +{line-numbers=off} +~~~~~~~~ > subset <- select(chicago, starts_with("d")) > str(subset) 'data.frame': 6940 obs. of 2 variables: $ dptp: num 31.5 29.9 27.4 28.6 28.9 ... $ date: Date, format: "1987-01-01" "1987-01-02" ... -``` +~~~~~~~~ You can also use more general regular expressions if necessary. See the help page (`?select`) for more details. @@ -185,7 +203,8 @@ The `filter()` function is used to extract subsets of rows from a data frame. Th Suppose we wanted to extract the rows of the `chicago` data frame where the levels of PM2.5 are greater than 30 (which is a reasonably high level), we could do -```r +{line-numbers=off} +~~~~~~~~ > chic.f <- filter(chicago, pm25tmean2 > 30) > str(chic.f) 'data.frame': 194 obs. of 8 variables: @@ -197,22 +216,24 @@ Suppose we wanted to extract the rows of the `chicago` data frame where the leve $ pm10tmean2: num 32.5 38.7 34 28.5 35 ... $ o3tmean2 : num 3.18 1.75 10.79 14.3 20.66 ... $ no2tmean2 : num 25.3 29.4 25.3 31.4 26.8 ... -``` +~~~~~~~~ You can see that there are now only 194 rows in the data frame and the distribution of the `pm25tmean2` values is. -```r +{line-numbers=off} +~~~~~~~~ > summary(chic.f$pm25tmean2) Min. 1st Qu. Median Mean 3rd Qu. Max. 30.05 32.12 35.04 36.63 39.53 61.50 -``` +~~~~~~~~ We can place an arbitrarily complex logical sequence inside of `filter()`, so we could for example extract the rows where PM2.5 is greater than 30 *and* temperature is greater than 80 degrees Fahrenheit. -```r +{line-numbers=off} +~~~~~~~~ > chic.f <- filter(chicago, pm25tmean2 > 30 & tmpd > 80) > select(chic.f, date, tmpd, pm25tmean2) date tmpd pm25tmean2 @@ -233,7 +254,7 @@ We can place an arbitrarily complex logical sequence inside of `filter()`, so we 15 2005-06-28 85 31.20000 16 2005-07-17 84 32.70000 17 2005-08-03 84 37.90000 -``` +~~~~~~~~ Now there are only 17 observations where both of those conditions are met. @@ -247,43 +268,48 @@ of other columns) is normally a pain to do in R. The `arrange()` function simpli Here we can order the rows of the data frame by date, so that the first row is the earliest (oldest) observation and the last row is the latest (most recent) observation. -```r +{line-numbers=off} +~~~~~~~~ > chicago <- arrange(chicago, date) -``` +~~~~~~~~ We can now check the first few rows -```r +{line-numbers=off} +~~~~~~~~ > head(select(chicago, date, pm25tmean2), 3) date pm25tmean2 1 1987-01-01 NA 2 1987-01-02 NA 3 1987-01-03 NA -``` +~~~~~~~~ and the last few rows. -```r +{line-numbers=off} +~~~~~~~~ > tail(select(chicago, date, pm25tmean2), 3) date pm25tmean2 6938 2005-12-29 7.45000 6939 2005-12-30 15.05714 6940 2005-12-31 15.00000 -``` +~~~~~~~~ -Columns can be arranged in descending order too by using the special `desc()` operator. +Columns can be arranged in descending order too by useing the special `desc()` operator. -```r +{line-numbers=off} +~~~~~~~~ > chicago <- arrange(chicago, desc(date)) -``` +~~~~~~~~ Looking at the first three and last three rows shows the dates in descending order. -```r +{line-numbers=off} +~~~~~~~~ > head(select(chicago, date, pm25tmean2), 3) date pm25tmean2 1 2005-12-31 15.00000 @@ -294,7 +320,7 @@ Looking at the first three and last three rows shows the dates in descending ord 6938 1987-01-03 NA 6939 1987-01-02 NA 6940 1987-01-01 NA -``` +~~~~~~~~ ## `rename()` @@ -304,25 +330,27 @@ Renaming a variable in a data frame in R is surprisingly hard to do! The `rename Here you can see the names of the first five variables in the `chicago` data frame. -```r +{line-numbers=off} +~~~~~~~~ > head(chicago[, 1:5], 3) city tmpd dptp date pm25tmean2 1 chic 35 30.1 2005-12-31 15.00000 2 chic 36 31.0 2005-12-30 15.05714 3 chic 35 29.4 2005-12-29 7.45000 -``` +~~~~~~~~ -The `dptp` column is supposed to represent the dew point temperature adn the `pm25tmean2` column provides the PM2.5 data. However, these names are pretty obscure or awkward and probably need to be renamed to something more sensible. +The `dptp` column is supposed to represent the dew point temperature adn the `pm25tmean2` column provides the PM2.5 data. However, these names are pretty obscure or awkward and probably be renamed to something more sensible. -```r +{line-numbers=off} +~~~~~~~~ > chicago <- rename(chicago, dewpoint = dptp, pm25 = pm25tmean2) > head(chicago[, 1:5], 3) city tmpd dewpoint date pm25 1 chic 35 30.1 2005-12-31 15.00000 2 chic 36 31.0 2005-12-30 15.05714 3 chic 35 29.4 2005-12-29 7.45000 -``` +~~~~~~~~ The syntax inside the `rename()` function is to have the new name on the left-hand side of the `=` sign and the old name on the right-hand side. @@ -337,7 +365,8 @@ For example, with air pollution data, we often want to *detrend* the data by sub Here we create a `pm25detrend` variable that subtracts the mean from the `pm25` variable. -```r +{line-numbers=off} +~~~~~~~~ > chicago <- mutate(chicago, pm25detrend = pm25 - mean(pm25, na.rm = TRUE)) > head(chicago) city tmpd dewpoint date pm25 pm10tmean2 o3tmean2 no2tmean2 @@ -354,14 +383,15 @@ Here we create a `pm25detrend` variable that subtracts the mean from the `pm25` 4 1.519042 5 7.329042 6 -7.830958 -``` +~~~~~~~~ There is also the related `transmute()` function, which does the same thing as `mutate()` but then *drops all non-transformed variables*. Here we detrend the PM10 and ozone (O3) variables. -```r +{line-numbers=off} +~~~~~~~~ > head(transmute(chicago, + pm10detrend = pm10tmean2 - mean(pm10tmean2, na.rm = TRUE), + o3detrend = o3tmean2 - mean(o3tmean2, na.rm = TRUE))) @@ -372,7 +402,7 @@ Here we detrend the PM10 and ozone (O3) variables. 4 -6.395206 -16.175096 5 -6.895206 -14.966763 6 -25.395206 -5.393846 -``` +~~~~~~~~ Note that there are only two columns in the transmuted data frame. @@ -386,48 +416,50 @@ The general operation here is a combination of splitting a data frame into separ First, we can create a `year` varible using `as.POSIXlt()`. -```r +{line-numbers=off} +~~~~~~~~ > chicago <- mutate(chicago, year = as.POSIXlt(date)$year + 1900) -``` +~~~~~~~~ Now we can create a separate data frame that splits the original data frame by year. -```r +{line-numbers=off} +~~~~~~~~ > years <- group_by(chicago, year) -``` +~~~~~~~~ Finally, we compute summary statistics for each year in the data frame with the `summarize()` function. -```r +{line-numbers=off} +~~~~~~~~ > summarize(years, pm25 = mean(pm25, na.rm = TRUE), + o3 = max(o3tmean2, na.rm = TRUE), -+ no2 = median(no2tmean2, na.rm = TRUE), -+ .groups = "drop") -# A tibble: 19 x 4 - year pm25 o3 no2 - - 1 1987 NaN 63.0 23.5 - 2 1988 NaN 61.7 24.5 - 3 1989 NaN 59.7 26.1 - 4 1990 NaN 52.2 22.6 - 5 1991 NaN 63.1 21.4 - 6 1992 NaN 50.8 24.8 - 7 1993 NaN 44.3 25.8 - 8 1994 NaN 52.2 28.5 - 9 1995 NaN 66.6 27.3 -10 1996 NaN 58.4 26.4 -11 1997 NaN 56.5 25.5 -12 1998 18.3 50.7 24.6 -13 1999 18.5 57.5 24.7 -14 2000 16.9 55.8 23.5 -15 2001 16.9 51.8 25.1 -16 2002 15.3 54.9 22.7 -17 2003 15.2 56.2 24.6 -18 2004 14.6 44.5 23.4 -19 2005 16.2 58.8 22.6 -``` ++ no2 = median(no2tmean2, na.rm = TRUE)) +# A tibble: 19 × 4 + year pm25 o3 no2 + +1 1987 NaN 62.96966 23.49369 +2 1988 NaN 61.67708 24.52296 +3 1989 NaN 59.72727 26.14062 +4 1990 NaN 52.22917 22.59583 +5 1991 NaN 63.10417 21.38194 +6 1992 NaN 50.82870 24.78921 +7 1993 NaN 44.30093 25.76993 +8 1994 NaN 52.17844 28.47500 +9 1995 NaN 66.58750 27.26042 +10 1996 NaN 58.39583 26.38715 +11 1997 NaN 56.54167 25.48143 +12 1998 18.26467 50.66250 24.58649 +13 1999 18.49646 57.48864 24.66667 +14 2000 16.93806 55.76103 23.46082 +15 2001 16.92632 51.81984 25.06522 +16 2002 15.27335 54.88043 22.73750 +17 2003 15.23183 56.16608 24.62500 +18 2004 14.62864 44.48240 23.39130 +19 2005 16.18556 58.84126 22.62387 +~~~~~~~~ `summarize()` returns a data frame with `year` as the first column, and then the annual averages of `pm25`, `o3`, and `no2`. @@ -436,35 +468,37 @@ In a slightly more complicated example, we might want to know what are the avera First, we can create a categorical variable of `pm25` divided into quintiles. -```r +{line-numbers=off} +~~~~~~~~ > qq <- quantile(chicago$pm25, seq(0, 1, 0.2), na.rm = TRUE) > chicago <- mutate(chicago, pm25.quint = cut(pm25, qq)) -``` +~~~~~~~~ Now we can group the data frame by the `pm25.quint` variable. -```r +{line-numbers=off} +~~~~~~~~ > quint <- group_by(chicago, pm25.quint) -``` +~~~~~~~~ Finally, we can compute the mean of `o3` and `no2` within quintiles of `pm25`. -```r +{line-numbers=off} +~~~~~~~~ > summarize(quint, o3 = mean(o3tmean2, na.rm = TRUE), -+ no2 = mean(no2tmean2, na.rm = TRUE), -+ .groups = "drop") -# A tibble: 6 x 3 - pm25.quint o3 no2 - -1 (1.7,8.7] 21.7 18.0 -2 (8.7,12.4] 20.4 22.1 -3 (12.4,16.7] 20.7 24.4 -4 (16.7,22.6] 19.9 27.3 -5 (22.6,61.5] 20.3 29.6 -6 18.8 25.8 -``` ++ no2 = mean(no2tmean2, na.rm = TRUE)) +# A tibble: 6 × 3 + pm25.quint o3 no2 + +1 (1.7,8.7] 21.66401 17.99129 +2 (8.7,12.4] 20.38248 22.13004 +3 (12.4,16.7] 20.66160 24.35708 +4 (16.7,22.6] 19.88122 27.27132 +5 (22.6,61.5] 20.31775 29.64427 +6 NA 18.79044 25.77585 +~~~~~~~~ From the table, it seems there isn't a strong relationship between `pm25` and `o3`, but there appears to be a positive correlation between `pm25` and `no2`. More sophisticated statistical modeling can help to provide precise answers to these questions, but a simple application of `dplyr` functions can often get you most of the way there. @@ -473,16 +507,18 @@ From the table, it seems there isn't a strong relationship between `pm25` and `o The pipeline operater `%>%` is very handy for stringing together multiple `dplyr` functions in a sequence of operations. Notice above that every time we wanted to apply more than one function, the sequence gets buried in a sequence of nested function calls that is difficult to read, i.e. -```r +{line-numbers=off} +~~~~~~~~ > third(second(first(x))) -``` +~~~~~~~~ This nesting is not a natural way to think about a sequence of operations. The `%>%` operator allows you to string operations in a left-to-right fashion, i.e. -```r +{line-numbers=off} +~~~~~~~~ > first(x) %>% second %>% third -``` +~~~~~~~~ Take the example that we just did in the last section where we computed the mean of `o3` and `no2` within quintiles of `pm25`. There we had to @@ -493,22 +529,22 @@ Take the example that we just did in the last section where we computed the mean That can be done with the following sequence in a single R expression. -```r +{line-numbers=off} +~~~~~~~~ > mutate(chicago, pm25.quint = cut(pm25, qq)) %>% + group_by(pm25.quint) %>% + summarize(o3 = mean(o3tmean2, na.rm = TRUE), -+ no2 = mean(no2tmean2, na.rm = TRUE), -+ .groups = "drop") -# A tibble: 6 x 3 - pm25.quint o3 no2 - -1 (1.7,8.7] 21.7 18.0 -2 (8.7,12.4] 20.4 22.1 -3 (12.4,16.7] 20.7 24.4 -4 (16.7,22.6] 19.9 27.3 -5 (22.6,61.5] 20.3 29.6 -6 18.8 25.8 -``` ++ no2 = mean(no2tmean2, na.rm = TRUE)) +# A tibble: 6 × 3 + pm25.quint o3 no2 + +1 (1.7,8.7] 21.66401 17.99129 +2 (8.7,12.4] 20.38248 22.13004 +3 (12.4,16.7] 20.66160 24.35708 +4 (16.7,22.6] 19.88122 27.27132 +5 (22.6,61.5] 20.31775 29.64427 +6 NA 18.79044 25.77585 +~~~~~~~~ This way we don't have to create a set of temporary variables along the way or create a massive nested sequence of function calls. @@ -518,29 +554,29 @@ Notice in the above code that I pass the `chicago` data frame to the first call Another example might be computing the average pollutant level by month. This could be useful to see if there are any seasonal trends in the data. -```r +{line-numbers=off} +~~~~~~~~ > mutate(chicago, month = as.POSIXlt(date)$mon + 1) %>% + group_by(month) %>% + summarize(pm25 = mean(pm25, na.rm = TRUE), + o3 = max(o3tmean2, na.rm = TRUE), -+ no2 = median(no2tmean2, na.rm = TRUE), -+ .groups = "drop") -# A tibble: 12 x 4 - month pm25 o3 no2 - - 1 1 17.8 28.2 25.4 - 2 2 20.4 37.4 26.8 - 3 3 17.4 39.0 26.8 - 4 4 13.9 47.9 25.0 - 5 5 14.1 52.8 24.2 - 6 6 15.9 66.6 25.0 - 7 7 16.6 59.5 22.4 - 8 8 16.9 54.0 23.0 - 9 9 15.9 57.5 24.5 -10 10 14.2 47.1 24.2 -11 11 15.2 29.5 23.6 -12 12 17.5 27.7 24.5 -``` ++ no2 = median(no2tmean2, na.rm = TRUE)) +# A tibble: 12 × 4 + month pm25 o3 no2 + +1 1 17.76996 28.22222 25.35417 +2 2 20.37513 37.37500 26.78034 +3 3 17.40818 39.05000 26.76984 +4 4 13.85879 47.94907 25.03125 +5 5 14.07420 52.75000 24.22222 +6 6 15.86461 66.58750 25.01140 +7 7 16.57087 59.54167 22.38442 +8 8 16.93380 53.96701 22.98333 +9 9 15.91279 57.48864 24.47917 +10 10 14.23557 47.09275 24.15217 +11 11 15.15794 29.45833 23.56537 +12 12 17.52221 27.70833 24.45773 +~~~~~~~~ Here we can see that `o3` tends to be low in the winter months and high in the summer while `no2` is higher in the winter and lower in the summer. @@ -549,11 +585,11 @@ Here we can see that `o3` tends to be low in the winter months and high in the s The `dplyr` package provides a concise set of operations for managing data frames. With these functions we can do a number of complex operations in just a few lines of code. In particular, we can often conduct the beginnings of an exploratory analysis with the powerful combination of `group_by()` and `summarize()`. -Once you learn the `dplyr` grammar there are a few additional benefits: +Once you learn the `dplyr` grammar there are a few additional benefits -* `dplyr` can work with other data frame "backends", such as SQL databases. There is a SQL interface for relational databases via the DBI package. +* `dplyr` can work with other data frame "backends" such as SQL databases. There is an SQL interface for relational databases via the DBI package -* `dplyr` can be integrated with the `data.table` package for large fast tables. +* `dplyr` can be integrated with the `data.table` package for large fast tables -The `dplyr` package is a handy way to both simplify and speed up your data frame management code. It's rare that you get such a combination at the same time! +The `dplyr` package is handy way to both simplify and speed up your data frame management code. It's rare that you get such a combination at the same time! diff --git a/manuscript/example.md b/manuscript/example.md index 70f8367..b5f0109 100644 --- a/manuscript/example.md +++ b/manuscript/example.md @@ -2,7 +2,7 @@ -This chapter presents an example data analysis looking at changes in fine particulate matter (PM) air pollution in the United States using the Environmental Protection Agency's freely available national monitoring data. The purpose of the chapter is to just show how the various tools that we have covered in this book can be used to read, manipulate, and summarize data so that you can develop statistical evidence for relevant real-world questions. +This chapter presents an example data analysis looking at changes in fine particulate matter (PM) air pollution in the United States using the Environmental Protection Agencies freely available national monitoring data. The purpose of the chapter is to just show how the various tools that we have covered in this book can be used to read, manipulate, and summarize data so that you can develop statistical evidence for relevant real-world questions. [Watch a video of this chapter](https://youtu.be/VE-6bQvyfTQ) @@ -14,23 +14,25 @@ In this chapter we aim to describe the changes in fine particle (PM2.5) outdoor ## Loading and Processing the Raw Data -From the [EPA's Air Quality System](https://aqs.epa.gov/aqsweb/airdata/download_files.html), we obtained data on fine particulate matter air pollution (PM2.5) that is monitored across the U.S. as part of the nationwide PM monitoring network. We obtained the files for the years 1999 and 2012. +From the [EPA Air Quality System](http://aqsdr1.epa.gov/aqsweb/aqstmp/airdata/download_files.html) we obtained data on fine particulate matter air pollution (PM2.5) that is monitored across the U.S. as part of the nationwide PM monitoring network. We obtained the files for the years 1999 and 2012. ### Reading in the 1999 data -We first read in the 1999 data from the raw text file included in the zip archive. The data is a delimited file where fields are delimited with the `|` character and missing values are coded as blank fields. We skip some commented lines in the beginning of the file and initially we do not read the header data. +We first read in the 1999 data from the raw text file included in the zip archive. The data is a delimited file were fields are delimited with the `|` character and missing values are coded as blank fields. We skip some commented lines in the beginning of the file and initially we do not read the header data. -```r +{line-numbers=off} +~~~~~~~~ > pm0 <- read.table("pm25_data/RD_501_88101_1999-0.txt", comment.char = "#", header = FALSE, sep = "|", na.strings = "") -``` +~~~~~~~~ After reading in the 1999 we check the first few rows (there are 117,421) rows in this dataset. -```r +{line-numbers=off} +~~~~~~~~ > dim(pm0) [1] 117421 28 > head(pm0[, 1:13]) @@ -41,45 +43,55 @@ After reading in the 1999 we check the first few rows (there are 117,421) rows i 4 RD I 1 27 1 88101 1 7 105 120 19990112 00:00 8.841 5 RD I 1 27 1 88101 1 7 105 120 19990115 00:00 14.920 6 RD I 1 27 1 88101 1 7 105 120 19990118 00:00 3.878 -``` +~~~~~~~~ We then attach the column headers to the dataset and make sure that they are properly formated for R data frames. -```r +{line-numbers=off} +~~~~~~~~ > cnames <- readLines("pm25_data/RD_501_88101_1999-0.txt", 1) > cnames <- strsplit(cnames, "|", fixed = TRUE) > ## Ensure names are properly formatted > names(pm0) <- make.names(cnames[[1]]) > head(pm0[, 1:13]) - X..RD Action.Code State.Code County.Code Site.ID Parameter POC Sample.Duration Unit Method Date Start.Time Sample.Value -1 RD I 1 27 1 88101 1 7 105 120 19990103 00:00 NA -2 RD I 1 27 1 88101 1 7 105 120 19990106 00:00 NA -3 RD I 1 27 1 88101 1 7 105 120 19990109 00:00 NA -4 RD I 1 27 1 88101 1 7 105 120 19990112 00:00 8.841 -5 RD I 1 27 1 88101 1 7 105 120 19990115 00:00 14.920 -6 RD I 1 27 1 88101 1 7 105 120 19990118 00:00 3.878 -``` + X..RD Action.Code State.Code County.Code Site.ID Parameter POC +1 RD I 1 27 1 88101 1 +2 RD I 1 27 1 88101 1 +3 RD I 1 27 1 88101 1 +4 RD I 1 27 1 88101 1 +5 RD I 1 27 1 88101 1 +6 RD I 1 27 1 88101 1 + Sample.Duration Unit Method Date Start.Time Sample.Value +1 7 105 120 19990103 00:00 NA +2 7 105 120 19990106 00:00 NA +3 7 105 120 19990109 00:00 NA +4 7 105 120 19990112 00:00 8.841 +5 7 105 120 19990115 00:00 14.920 +6 7 105 120 19990118 00:00 3.878 +~~~~~~~~ The column we are interested in is the `Sample.Value` column which contains the PM2.5 measurements. Here we extract that column and print a brief summary. -```r +{line-numbers=off} +~~~~~~~~ > x0 <- pm0$Sample.Value > summary(x0) Min. 1st Qu. Median Mean 3rd Qu. Max. NA's 0.00 7.20 11.50 13.74 17.90 157.10 13217 -``` +~~~~~~~~ Missing values are a common problem with environmental data and so we check to se what proportion of the observations are missing (i.e. coded as `NA`). -```r +{line-numbers=off} +~~~~~~~~ > mean(is.na(x0)) ## Are missing values important here? [1] 0.1125608 -``` +~~~~~~~~ Because the proportion of missing values is relatively low (0.1125608), we choose to ignore missing values for now. @@ -90,18 +102,20 @@ We then read in the 2012 data in the same manner in which we read the 1999 data -```r +{line-numbers=off} +~~~~~~~~ > pm1 <- read.table("pm25_data/RD_501_88101_2012-0.txt", comment.char = "#", + header = FALSE, sep = "|", na.strings = "", nrow = 1304290) -``` +~~~~~~~~ We also set the column names (they are the same as the 1999 dataset) and extract the `Sample.Value` column from this dataset. -```r +{line-numbers=off} +~~~~~~~~ > names(pm1) <- make.names(cnames[[1]]) > x1 <- pm1$Sample.Value -``` +~~~~~~~~ ## Results @@ -110,58 +124,65 @@ We also set the column names (they are the same as the 1999 dataset) and extract In order to show aggregate changes in PM across the entire monitoring network, we can make boxplots of all monitor values in 1999 and 2012. Here, we take the log of the PM values to adjust for the skew in the data. -```r +{line-numbers=off} +~~~~~~~~ > boxplot(log2(x0), log2(x1)) Warning in boxplot.default(log2(x0), log2(x1)): NaNs produced -Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out = z$out[z$group == : Outlier (-Inf) in boxplot 1 is not drawn -Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out = z$out[z$group == : Outlier (-Inf) in boxplot 2 is not drawn -``` +Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out = z$out[z +$group == : Outlier (-Inf) in boxplot 1 is not drawn +Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out = z$out[z +$group == : Outlier (-Inf) in boxplot 2 is not drawn +~~~~~~~~ ![plot of chunk unnamed-chunk-5](images/unnamed-chunk-5-1.png) -```r +{line-numbers=off} +~~~~~~~~ > summary(x0) Min. 1st Qu. Median Mean 3rd Qu. Max. NA's 0.00 7.20 11.50 13.74 17.90 157.10 13217 > summary(x1) Min. 1st Qu. Median Mean 3rd Qu. Max. NA's - -10.00 4.00 7.63 9.14 12.00 908.97 73133 -``` + -10.00 4.00 7.63 9.14 12.00 909.00 73133 +~~~~~~~~ Interestingly, from the summary of `x1` it appears there are some negative values of PM, which in general should not occur. We can investigate that somewhat to see if there is anything we should worry about. -```r +{line-numbers=off} +~~~~~~~~ > negative <- x1 < 0 -> length(negative[negative=="TRUE"])/length(x1) # proportion of negative values -[1] 0.07636893 > mean(negative, na.rm = T) [1] 0.0215034 -``` +~~~~~~~~ -There is a relatively small proportion of values that are negative (0.076), which is perhaps reassuring. In order to investigate this a step further we can extract the date of each measurement from the original data frame. The idea here is that perhaps negative values occur more often in some parts of the year than other parts. However, the original data are formatted as character strings so we convert them to R's `Date` format for easier manipulation. +There is a relatively small proportion of values that are negative, which is perhaps reassuring. In order to investigate this a step further we can extract the date of each measurement from the original data frame. The idea here is that perhaps negative values occur more often in some parts of the year than other parts. However, the original data are formatted as character strings so we convert them to R's `Date` format for easier manipulation. -```r +{line-numbers=off} +~~~~~~~~ > dates <- pm1$Date > dates <- as.Date(as.character(dates), "%Y%m%d") -``` +~~~~~~~~ We can then extract the month from each of the dates with negative values and attempt to identify when negative values occur most often. -```r +{line-numbers=off} +~~~~~~~~ > missing.months <- month.name[as.POSIXlt(dates)$mon + 1] > tab <- table(factor(missing.months, levels = month.name)) > round(100 * tab / sum(tab)) - January February March April May June July August September October November December - 15 13 15 13 14 13 8 6 3 0 0 0 -``` + January February March April May June July + 15 13 15 13 14 13 8 + August September October November December + 6 3 0 0 0 +~~~~~~~~ -From the table above it appears the bulk of the negative values occur in the first six months of the year (January--June). However, beyond that simple observation, it is not clear why the negative values occur. That said, given the relatively low proportion of negative values, we will ignore them for now. +From the table above it appears that bulk of the negative values occur in the first six months of the year (January--June). However, beyond that simple observation, it is not clear why the negative values occur. That said, given the relatively low proportion of negative values, we will ignore them for now. ### Changes in PM levels at an individual monitor @@ -171,73 +192,85 @@ So far we have examined the change in PM levels on average across the country. O Our first task is to identify a monitor in New York State that has data in 1999 and 2012 (not all monitors operated during both time periods). First we subset the data frames to only include data from New York (`State.Code == 36`) and only include the `County.Code` and the `Site.ID` (i.e. monitor number) variables. -```r +{line-numbers=off} +~~~~~~~~ > site0 <- unique(subset(pm0, State.Code == 36, c(County.Code, Site.ID))) > site1 <- unique(subset(pm1, State.Code == 36, c(County.Code, Site.ID))) -``` +~~~~~~~~ Then we create a new variable that combines the county code and the site ID into a single string. -```r +{line-numbers=off} +~~~~~~~~ > site0 <- paste(site0[,1], site0[,2], sep = ".") > site1 <- paste(site1[,1], site1[,2], sep = ".") > str(site0) - chr [1:33] "1.5" "1.12" "5.73" "5.80" "5.83" "5.110" "13.11" "27.1004" "29.2" "29.5" "29.1007" "31.3" "47.11" "47.76" "55.6001" "59.5" "59.8" "59.11" ... + chr [1:33] "1.5" "1.12" "5.73" "5.80" "5.83" "5.110" ... > str(site1) - chr [1:18] "1.5" "1.12" "5.80" "5.133" "13.11" "29.5" "31.3" "47.122" "55.1007" "61.79" "61.134" "63.2008" "67.1015" "71.2" "81.124" "85.55" "101.3" ... -``` + chr [1:18] "1.5" "1.12" "5.80" "5.133" "13.11" "29.5" ... +~~~~~~~~ -Finally, we want the intersection between the sites present in 1999 and 2012 so that we might choose a monitor that has data in both periods. +Finaly, we want the intersection between the sites present in 1999 and 2012 so that we might choose a monitor that has data in both periods. -```r +{line-numbers=off} +~~~~~~~~ > both <- intersect(site0, site1) > print(both) - [1] "1.5" "1.12" "5.80" "13.11" "29.5" "31.3" "63.2008" "67.1015" "85.55" "101.3" -``` + [1] "1.5" "1.12" "5.80" "13.11" "29.5" "31.3" "63.2008" + [8] "67.1015" "85.55" "101.3" +~~~~~~~~ Here (above) we can see that there are 10 monitors that were operating in both time periods. However, rather than choose one at random, it might best to choose one that had a reasonable amount of data in each year. -```r +{line-numbers=off} +~~~~~~~~ > ## Find how many observations available at each monitor > pm0$county.site <- with(pm0, paste(County.Code, Site.ID, sep = ".")) > pm1$county.site <- with(pm1, paste(County.Code, Site.ID, sep = ".")) > cnt0 <- subset(pm0, State.Code == 36 & county.site %in% both) > cnt1 <- subset(pm1, State.Code == 36 & county.site %in% both) -``` +~~~~~~~~ Now that we have subsetted the original data frames to only include the data from the monitors that overlap between 1999 and 2012, we can split the data frames and count the number of observations at each monitor to see which ones have the most observations. -```r +{line-numbers=off} +~~~~~~~~ > ## 1999 > sapply(split(cnt0, cnt0$county.site), nrow) - 1.12 1.5 101.3 13.11 29.5 31.3 5.80 63.2008 67.1015 85.55 - 61 122 152 61 61 183 61 122 122 7 + 1.12 1.5 101.3 13.11 29.5 31.3 5.80 63.2008 67.1015 + 61 122 152 61 61 183 61 122 122 + 85.55 + 7 > ## 2012 > sapply(split(cnt1, cnt1$county.site), nrow) - 1.12 1.5 101.3 13.11 29.5 31.3 5.80 63.2008 67.1015 85.55 - 31 64 31 31 33 15 31 30 31 31 -``` + 1.12 1.5 101.3 13.11 29.5 31.3 5.80 63.2008 67.1015 + 31 64 31 31 33 15 31 30 31 + 85.55 + 31 +~~~~~~~~ A number of monitors seem suitable from the output, but we will focus here on County 63 and site ID 2008. -```r +{line-numbers=off} +~~~~~~~~ > both.county <- 63 > both.id <- 2008 > > ## Choose county 63 and side ID 2008 > pm1sub <- subset(pm1, State.Code == 36 & County.Code == both.county & Site.ID == both.id) > pm0sub <- subset(pm0, State.Code == 36 & County.Code == both.county & Site.ID == both.id) -``` +~~~~~~~~ Now we plot the time series data of PM for the monitor in both years. -```r +{line-numbers=off} +~~~~~~~~ > dates1 <- as.Date(as.character(pm1sub$Date), "%Y%m%d") > x1sub <- pm1sub$Sample.Value > dates0 <- as.Date(as.character(pm0sub$Date), "%Y%m%d") @@ -250,20 +283,21 @@ Now we plot the time series data of PM for the monitor in both years. > abline(h = median(x0sub, na.rm = T)) > plot(dates1, x1sub, pch = 20, ylim = rng, xlab = "", ylab = expression(PM[2.5] * " (" * mu * g/m^3 * ")")) > abline(h = median(x1sub, na.rm = T)) -``` +~~~~~~~~ ![plot of chunk unnamed-chunk-12](images/unnamed-chunk-12-1.png) -From the plot above, we can that median levels of PM (horizontal solid line) have decreased a little from 10.45 in 1999 to 8.29 in 2012. However, perhaps more interesting is that the variation (spread) in the PM values in 2012 is much smaller than it was in 1999. This suggests that not only are median levels of PM lower in 2012, but that there are fewer large spikes from day to day. One issue with the data here is that the 1999 data are from July through December while the 2012 data are recorded in January through April. It would have been better if we'd had full-year data for both years as there could be some seasonal confounding going on. +From the plot above, we can that median levels of PM (horizontal solid line) have decreased a little from 10.45 in 1999 to 8.29 in 2012. However, perhaps more interesting is that the variation (spread) in the PM values in 2012 is much smaller than it was in 1999. This suggest that not only are median levels of PM lower in 2012, but that there are fewer large spikes from day to day. One issue with the data here is that the 1999 data are from July through December while the 2012 data are recorded in January through April. It would have been better if we'd had full-year data for both years as there could be some seasonal confounding going on. ### Changes in state-wide PM levels -Although ambient air quality standards are set at the federal level in the U.S. and hence affect the entire country, the actual reduction and management of PM is left to the individual states. States that are not "in attainment" have to develop a plan to reduce PM so that that they are in attainment (eventually). Therefore, it might be useful to examine changes in PM at the state level. This analysis falls somewhere in between looking at the entire country all at once and looking at an individual monitor. +Although ambient air quality standards are set at the federal level in the U.S. and hence affect the entire country, the actual reduction and management of PM is left to the individual states. States that are not "in attainment" have to develop a plan to reduce PM so that that the are in attainment (eventually). Therefore, it might be useful to examine changes in PM at the state level. This analysis falls somewhere in between looking at the entire country all at once and looking at an individual monitor. What we do here is calculate the mean of PM for each state in 1999 and 2012. -```r +{line-numbers=off} +~~~~~~~~ > ## 1999 > mn0 <- with(pm0, tapply(Sample.Value, State.Code, mean, na.rm = TRUE)) > ## 2012 @@ -281,19 +315,20 @@ What we do here is calculate the mean of PM for each state in 1999 and 2012. 4 12 11.137139 8.239690 5 13 19.943240 11.321364 6 15 4.861821 8.749336 -``` +~~~~~~~~ Now make a plot that shows the 1999 state-wide means in one "column" and the 2012 state-wide means in another columns. We then draw a line connecting the means for each year in the same state to highlight the trend. -```r +{line-numbers=off} +~~~~~~~~ > par(mfrow = c(1, 1)) > rng <- range(mrg[,2], mrg[,3]) > with(mrg, plot(rep(1, 52), mrg[, 2], xlim = c(.5, 2.5), ylim = rng, xaxt = "n", xlab = "", ylab = "State-wide Mean PM")) > with(mrg, points(rep(2, 52), mrg[, 3])) > segments(rep(1, 52), mrg[, 2], rep(2, 52), mrg[, 3]) > axis(1, c(1, 2), c("1999", "2012")) -``` +~~~~~~~~ ![plot of chunk unnamed-chunk-14](images/unnamed-chunk-14-1.png) diff --git a/manuscript/functions.md b/manuscript/functions.md index 072ed1c..9c12963 100644 --- a/manuscript/functions.md +++ b/manuscript/functions.md @@ -27,7 +27,7 @@ treated much like any other R object. Importantly, - Functions can be passed as arguments to other functions. This is very handy for the various apply functions, like `lapply()` and `sapply()`. - Functions can be nested, so that you can define a function inside of - another function. + another function If you're familiar with common language like C, these features might appear a bit strange. However, they are really important in R and can be useful for data analysis. @@ -40,7 +40,8 @@ objects of class "function". Here's a simple function that takes no arguments and does nothing. -```r +{line-numbers=off} +~~~~~~~~ > f <- function() { + ## This is an empty function + } @@ -50,27 +51,29 @@ Here's a simple function that takes no arguments and does nothing. > ## Execute this function > f() NULL -``` +~~~~~~~~ Not very interesting, but it's a start. The next thing we can do is create a function that actually has a non-trivial *function body*. -```r +{line-numbers=off} +~~~~~~~~ > f <- function() { + cat("Hello, world!\n") + } > f() Hello, world! -``` +~~~~~~~~ The last aspect of a basic function is the *function arguments*. These are the options that you can specify to the user that the user may -explicitly set. For this basic function, we can add an argument that +explicity set. For this basic function, we can add an argument that determines how many times "Hello, world!" is printed to the console. -```r +{line-numbers=off} +~~~~~~~~ > f <- function(num) { + for(i in seq_len(num)) { + cat("Hello, world!\n") @@ -80,9 +83,9 @@ determines how many times "Hello, world!" is printed to the console. Hello, world! Hello, world! Hello, world! -``` +~~~~~~~~ -Obviously, we could have just cut-and-pasted the `cat("Hello, world!\n")` code three times to achieve the same effect, but then we wouldn't be programming, would we? Also, it would be un-neighborly of you to give your code to someone else and force them to cut-and-paste the code however many times they need to see "Hello, world!". +Obviously, we could have just cut-and-pasted the `cat("Hello, world!\n")` code three times to achieve the same effect, but then we wouldn't be programming, would we? Also, it would be un-neighborly of you to give your code to someone else and force them to cut-and-paste the code however many times the need to see "Hello, world!". > In general, if you find yourself doing a lot of cutting and pasting, that's usually a good sign that you might need to write a function. @@ -91,7 +94,8 @@ Finally, the function above doesn't *return* anything. It just prints "Hello, wo This next function returns the total number of characters printed to the console. -```r +{line-numbers=off} +~~~~~~~~ > f <- function(num) { + hello <- "Hello, world!\n" + for(i in seq_len(num)) { @@ -106,26 +110,28 @@ Hello, world! Hello, world! > print(meaningoflife) [1] 42 -``` +~~~~~~~~ In the above function, we didn't have to indicate anything special in order for the function to return the number of characters. In R, the return value of a function is always the very last expression that is evaluated. Because the `chars` variable is the last expression that is evaluated in this function, that becomes the return value of the function. -Note that there is a `return()` function that can be used to return an explicit value from a function, but it is rarely used in R (we will discuss it a bit later in this chapter). +Note that there is a `return()` function that can be used to return an explicity value from a function, but it is rarely used in R (we will discuss it a bit later in this chapter). Finally, in the above function, the user must specify the value of the argument `num`. If it is not specified by the user, R will throw an error. -```r +{line-numbers=off} +~~~~~~~~ > f() Error in f(): argument "num" is missing, with no default -``` +~~~~~~~~ We can modify this behavior by setting a *default value* for the argument `num`. Any function argument can have a default value, if you wish to specify it. Sometimes, argument values are rarely modified (except in special cases) and it makes sense to set a default value for that argument. This relieves the user from having to specify the value of that argument every single time the function is called. Here, for example, we could set the default value for `num` to be 1, so that if the function is called without the `num` argument being explicitly specified, then it will print "Hello, world!" to the console once. -```r +{line-numbers=off} +~~~~~~~~ > f <- function(num = 1) { + hello <- "Hello, world!\n" + for(i in seq_len(num)) { @@ -141,7 +147,7 @@ Hello, world! Hello, world! Hello, world! [1] 28 -``` +~~~~~~~~ Remember that the function still returns the number of characters printed to the console. @@ -157,80 +163,87 @@ At this point, we have written a function that Functions have _named arguments_ which can optionally have default values. Because all function arguments have names, they can be specified using their name. -```r +{line-numbers=off} +~~~~~~~~ > f(num = 2) Hello, world! Hello, world! [1] 28 -``` +~~~~~~~~ Specifying an argument by its name is sometimes useful if a function has many arguments and it may not always be clear which argument is being specified. Here, our function only has one argument so there's no confusion. ## Argument Matching -Calling an R function with arguments can be done in a variety of ways. This may be confusing at first, but it's really handy when doing interactive work at the command line. R functions arguments can be matched *positionally* or by name. Positional matching just means that R assigns the first value to the first argument, the second value to second argument, etc. So in the following call to `rnorm()` +Calling an R function with arguments can be done in a variety of ways. This may be confusing at first, but it's really handing when doing interactive work at the command line. R functions arguments can be matched *positionally* or by name. Positional matching just means that R assigns the first value to the first argument, the second value to second argument, etc. So in the following call to `rnorm()` -```r +{line-numbers=off} +~~~~~~~~ > str(rnorm) function (n, mean = 0, sd = 1) > mydata <- rnorm(100, 2, 1) ## Generate some data -``` +~~~~~~~~ 100 is assigned to the `n` argument, 2 is assigned to the `mean` argument, and 1 is assigned to the `sd` argument, all by positional matching. The following calls to the `sd()` function (which computes the empirical standard deviation of a vector of numbers) are all equivalent. Note that `sd()` has two arguments: `x` indicates the vector of numbers and `na.rm` is a logical indicating whether missing values should be removed or not. -```r +{line-numbers=off} +~~~~~~~~ > ## Positional match first argument, default for 'na.rm' > sd(mydata) -[1] 0.873495 +[1] 0.8707092 > ## Specify 'x' argument by name, default for 'na.rm' > sd(x = mydata) -[1] 0.873495 +[1] 0.8707092 > ## Specify both arguments by name > sd(x = mydata, na.rm = FALSE) -[1] 0.873495 -``` +[1] 0.8707092 +~~~~~~~~ When specifying the function arguments by name, it doesn't matter in what order you specify them. In the example below, we specify the `na.rm` argument first, followed by `x`, even though `x` is the first argument defined in the function definition. -```r +{line-numbers=off} +~~~~~~~~ > ## Specify both arguments by name > sd(na.rm = FALSE, x = mydata) -[1] 0.873495 -``` +[1] 0.8707092 +~~~~~~~~ You can mix positional matching with matching by name. When an argument is matched by name, it is “taken out” of the argument list and the remaining unnamed arguments are matched in the order that they are listed in the function definition. -```r +{line-numbers=off} +~~~~~~~~ > sd(na.rm = FALSE, mydata) -[1] 0.873495 -``` +[1] 0.8707092 +~~~~~~~~ Here, the `mydata` object is assigned to the `x` argument, because it's the only argument not yet specified. Below is the argument list for the `lm()` function, which fits linear models to a dataset. -```r +{line-numbers=off} +~~~~~~~~ > args(lm) function (formula, data, subset, weights, na.action, method = "qr", model = TRUE, x = FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE, contrasts = NULL, offset, ...) NULL -``` +~~~~~~~~ The following two calls are equivalent. -```r +{line-numbers=off} +~~~~~~~~ lm(data = mydata, y ~ x, model = FALSE, 1:100) lm(y ~ x, mydata, 1:100, model = FALSE) -``` +~~~~~~~~ Even though it’s legal, I don’t recommend messing around with the order of the arguments too much, since it can lead to some confusion. @@ -247,11 +260,12 @@ Partial matching should be avoided when writing longer code or programs, because In addition to not specifying a default value, you can also set an argument value to `NULL`. -```r +{line-numbers=off} +~~~~~~~~ f <- function(a, b = 1, c = 2, d = NULL) { } -``` +~~~~~~~~ You can check to see whether an R object is `NULL` with the `is.null()` function. It is sometimes useful to allow an argument to take the `NULL` value, which might indicate that the function should take some specific action. @@ -263,20 +277,22 @@ Arguments to functions are evaluated _lazily_, so they are evaluated only as nee In this example, the function `f()` has two arguments: `a` and `b`. -```r +{line-numbers=off} +~~~~~~~~ > f <- function(a, b) { + a^2 + } > f(2) [1] 4 -``` +~~~~~~~~ This function never actually uses the argument `b`, so calling `f(2)` will not produce an error because the 2 gets positionally matched to `a`. This behavior can be good or bad. It's common to write a function that doesn't use an argument and not notice it simply because R never throws an error. This example also shows lazy evaluation at work, but does eventually result in an error. -```r +{line-numbers=off} +~~~~~~~~ > f <- function(a, b) { + print(a) + print(b) @@ -284,7 +300,7 @@ This example also shows lazy evaluation at work, but does eventually result in a > f(45) [1] 45 Error in print(b): argument "b" is missing, with no default -``` +~~~~~~~~ Notice that "45" got printed first before the error was triggered. This is because `b` did not have to be evaluated until after `print(a)`. Once the function tried to evaluate `print(b)` the function had to throw an error. @@ -295,37 +311,40 @@ There is a special argument in R known as the `...` argument, which indicate a v For example, a custom plotting function may want to make use of the default `plot()` function along with its entire argument list. The function below changes the default for the `type` argument to the value `type = "l"` (the original default was `type = "p"`). -```r +{line-numbers=off} +~~~~~~~~ myplot <- function(x, y, type = "l", ...) { plot(x, y, type = type, ...) ## Pass '...' to 'plot' function } -``` +~~~~~~~~ Generic functions use `...` so that extra arguments can be passed to methods. -```r +{line-numbers=off} +~~~~~~~~ > mean function (x, ...) UseMethod("mean") - + -``` +~~~~~~~~ The `...` argument is necessary when the number of arguments passed to the function cannot be known in advance. This is clear in functions like `paste()` and `cat()`. -```r +{line-numbers=off} +~~~~~~~~ > args(paste) -function (..., sep = " ", collapse = NULL, recycle0 = FALSE) +function (..., sep = " ", collapse = NULL) NULL > args(cat) function (..., file = "", sep = " ", fill = FALSE, labels = NULL, append = FALSE) NULL -``` +~~~~~~~~ Because both `paste()` and `cat()` print out text to the console by combining multiple character vectors together, it is impossible for those functions to know in advance how many character vectors will be passed to the function by the user. So the first argument to either function is `...`. @@ -336,37 +355,44 @@ One catch with `...` is that any arguments that appear _after_ `...` on the argu Take a look at the arguments to the `paste()` function. -```r +{line-numbers=off} +~~~~~~~~ > args(paste) -function (..., sep = " ", collapse = NULL, recycle0 = FALSE) +function (..., sep = " ", collapse = NULL) NULL -``` +~~~~~~~~ With the `paste()` function, the arguments `sep` and `collapse` must be named explicitly and in full if the default values are not going to be used. Here I specify that I want "a" and "b" to be pasted together and separated by a colon. -```r +{line-numbers=off} +~~~~~~~~ > paste("a", "b", sep = ":") [1] "a:b" -``` +~~~~~~~~ If I don't specify the `sep` argument in full and attempt to rely on partial matching, I don't get the expected result. -```r +{line-numbers=off} +~~~~~~~~ > paste("a", "b", se = ":") [1] "a b :" -``` +~~~~~~~~ ## Summary -* Functions can be defined using the `function()` directive and are assigned to R objects just like any other R object. -* Functions can be defined with named arguments; these function arguments can have default values. -* Functions arguments can be specified by name or by position in the argument list. -* Functions always return the last expression evaluated in the function body. +* Functions can be defined using the `function()` directive and are assigned to R objects just like any other R object + +* Functions have can be defined with named arguments; these function arguments can have default values + +* Functions arguments can be specified by name or by position in the argument list + +* Functions always return the last expression evaluated in the function body + * A variable number of arguments can be specified using the special `...` argument in a function definition. diff --git a/manuscript/gettingstarted.md b/manuscript/gettingstarted.md index 5f7e91d..31b199f 100644 --- a/manuscript/gettingstarted.md +++ b/manuscript/gettingstarted.md @@ -16,7 +16,7 @@ There is also an integrated development environment available for R that is built by RStudio. I really like this IDE---it has a nice editor with syntax highlighting, there is an R object viewer, and there are a number of other nice features that are integrated. You can -see how to install RStudio here: +see how to install RStudio here - [Installing RStudio](https://youtu.be/bM7Sfz-LADM) @@ -28,7 +28,7 @@ site](http://rstudio.com). After you install R you will need to launch it and start writing R code. Before we get to exactly how to write R code, it's useful to get a sense of how the system is organized. In these two videos I talk -about where to write code and how to set your working directory, which +about where to write code and how set your working directory, which let's R know where to find all of your files. - [Writing code and setting your working directory on the Mac](https://youtu.be/8xT3hmJQskU) diff --git a/manuscript/nutsbolts.md b/manuscript/nutsbolts.md index 9e37824..2e5d1b4 100644 --- a/manuscript/nutsbolts.md +++ b/manuscript/nutsbolts.md @@ -10,21 +10,23 @@ At the R prompt we type expressions. The `<-` symbol is the assignment operator. -```r +{line-numbers=off} +~~~~~~~~ > x <- 1 > print(x) [1] 1 > x [1] 1 > msg <- "hello" -``` +~~~~~~~~ The grammar of the language determines whether an expression is complete or not. -```r +{line-numbers=off} +~~~~~~~~ x <- ## Incomplete expression -``` +~~~~~~~~ The # character indicates a comment. Anything to the right of the # (including the # itself) is ignored. This is the only comment @@ -39,13 +41,14 @@ and the result of the evaluated expression is returned. The result may be *auto-printed*. -```r +{line-numbers=off} +~~~~~~~~ > x <- 5 ## nothing printed > x ## auto-printing occurs [1] 5 > print(x) ## explicit printing [1] 5 -``` +~~~~~~~~ The `[1]` shown in the output indicates that `x` is a vector and `5` is its first element. @@ -64,12 +67,13 @@ see this integer sequence of length 20. -```r +{line-numbers=off} +~~~~~~~~ > x <- 11:30 > x [1] 11 12 13 14 15 16 17 18 19 20 21 22 [13] 23 24 25 26 27 28 29 30 -``` +~~~~~~~~ @@ -135,7 +139,7 @@ used in ordinary calculations; e.g. `1 / Inf` is 0. The value `NaN` represents an undefined value ("not a number"); e.g. 0 / 0; `NaN` can also be thought of as a missing value (more on that -later). +later) ## Attributes @@ -170,14 +174,15 @@ The `c()` function can be used to create vectors of objects by concatenating things together. -```r +{line-numbers=off} +~~~~~~~~ > x <- c(0.5, 0.6) ## numeric > x <- c(TRUE, FALSE) ## logical > x <- c(T, F) ## logical > x <- c("a", "b", "c") ## character > x <- 9:29 ## integer > x <- c(1+0i, 2+4i) ## complex -``` +~~~~~~~~ Note that in the above example, `T` and `F` are short-hand ways to specify `TRUE` and `FALSE`. However, in general one should try to use @@ -188,11 +193,12 @@ feeling lazy. You can also use the `vector()` function to initialize vectors. -```r +{line-numbers=off} +~~~~~~~~ > x <- vector("numeric", length = 10) > x [1] 0 0 0 0 0 0 0 0 0 0 -``` +~~~~~~~~ ## Mixing Objects @@ -201,11 +207,12 @@ together. Sometimes this happens by accident but it can also happen on purpose. So what happens with the following code? -```r +{line-numbers=off} +~~~~~~~~ > y <- c(1.7, "a") ## character > y <- c(TRUE, 2) ## numeric > y <- c("a", TRUE) ## character -``` +~~~~~~~~ In each case above, we are mixing objects of two different classes in a vector. But remember that the only rule about vectors says this is @@ -226,7 +233,8 @@ Objects can be explicitly coerced from one class to another using the `as.*` functions, if available. -```r +{line-numbers=off} +~~~~~~~~ > x <- 0:6 > class(x) [1] "integer" @@ -236,13 +244,14 @@ Objects can be explicitly coerced from one class to another using the [1] FALSE TRUE TRUE TRUE TRUE TRUE TRUE > as.character(x) [1] "0" "1" "2" "3" "4" "5" "6" -``` +~~~~~~~~ Sometimes, R can't figure out how to coerce an object and this can result in `NA`s being produced. -```r +{line-numbers=off} +~~~~~~~~ > x <- c("a", "b", "c") > as.numeric(x) Warning: NAs introduced by coercion @@ -252,7 +261,7 @@ Warning: NAs introduced by coercion > as.complex(x) Warning: NAs introduced by coercion [1] NA NA NA -``` +~~~~~~~~ When nonsensical coercion takes place, you will usually get a warning from R. @@ -262,10 +271,11 @@ from R. Matrices are vectors with a _dimension_ attribute. The dimension attribute is itself an integer vector of length 2 (number of rows, -number of columns). +number of columns) -```r +{line-numbers=off} +~~~~~~~~ > m <- matrix(nrow = 2, ncol = 3) > m [,1] [,2] [,3] @@ -276,25 +286,27 @@ number of columns). > attributes(m) $dim [1] 2 3 -``` +~~~~~~~~ Matrices are constructed _column-wise_, so entries can be thought of starting in the "upper left" corner and running down the columns. -```r +{line-numbers=off} +~~~~~~~~ > m <- matrix(1:6, nrow = 2, ncol = 3) > m [,1] [,2] [,3] [1,] 1 3 5 [2,] 2 4 6 -``` +~~~~~~~~ Matrices can also be created directly from vectors by adding a dimension attribute. -```r +{line-numbers=off} +~~~~~~~~ > m <- 1:10 > m [1] 1 2 3 4 5 6 7 8 9 10 @@ -303,13 +315,14 @@ dimension attribute. [,1] [,2] [,3] [,4] [,5] [1,] 1 3 5 7 9 [2,] 2 4 6 8 10 -``` +~~~~~~~~ Matrices can be created by _column-binding_ or _row-binding_ with the `cbind()` and `rbind()` functions. -```r +{line-numbers=off} +~~~~~~~~ > x <- 1:3 > y <- 10:12 > cbind(x, y) @@ -321,7 +334,7 @@ Matrices can be created by _column-binding_ or _row-binding_ with the [,1] [,2] [,3] x 1 2 3 y 10 11 12 -``` +~~~~~~~~ ## Lists @@ -334,7 +347,8 @@ Lists can be explicitly created using the `list()` function, which takes an arbitrary number of arguments. -```r +{line-numbers=off} +~~~~~~~~ > x <- list(1, "a", TRUE, 1 + 4i) > x [[1]] @@ -348,13 +362,14 @@ takes an arbitrary number of arguments. [[4]] [1] 1+4i -``` +~~~~~~~~ We can also create an empty list of a prespecified length with the -`vector()` function. +`vector()` function -```r +{line-numbers=off} +~~~~~~~~ > x <- vector("list", length = 5) > x [[1]] @@ -371,7 +386,7 @@ NULL [[5]] NULL -``` +~~~~~~~~ ## Factors @@ -380,7 +395,7 @@ NULL Factors are used to represent categorical data and can be unordered or ordered. One can think of a factor as an integer vector where each integer has a _label_. Factors are important in statistical modeling -and are treated specially by modeling functions like `lm()` and +and are treated specially by modelling functions like `lm()` and `glm()`. Using factors with labels is _better_ than using integers because @@ -390,7 +405,8 @@ and "Female" is better than a variable that has values 1 and 2. Factor objects can be created with the `factor()` function. -```r +{line-numbers=off} +~~~~~~~~ > x <- factor(c("yes", "yes", "no", "yes", "no")) > x [1] yes yes no yes no @@ -404,7 +420,7 @@ x [1] 2 2 1 2 1 attr(,"levels") [1] "no" "yes" -``` +~~~~~~~~ Often factors will be automatically created for you when you read a dataset in using a function like `read.table()`. Those functions often @@ -416,7 +432,8 @@ argument to `factor()`. This can be important in linear modelling because the first level is used as the baseline level. -```r +{line-numbers=off} +~~~~~~~~ > x <- factor(c("yes", "yes", "no", "yes", "no")) > x ## Levels are put in alphabetical order [1] yes yes no yes no @@ -426,11 +443,11 @@ Levels: no yes > x [1] yes yes no yes no Levels: yes no -``` +~~~~~~~~ ## Missing Values -Missing values are denoted by `NA` or `NaN` for undefined +Missing values are denoted by `NA` or `NaN` for q undefined mathematical operations. - `is.na()` is used to test objects if they are `NA` @@ -444,7 +461,8 @@ mathematical operations. -```r +{line-numbers=off} +~~~~~~~~ > ## Create a vector with NAs in it > x <- c(1, 2, NA, 10, 3) > ## Return a logical vector indicating which elements are NA @@ -453,24 +471,25 @@ mathematical operations. > ## Return a logical vector indicating which elements are NaN > is.nan(x) [1] FALSE FALSE FALSE FALSE FALSE -``` +~~~~~~~~ -```r +{line-numbers=off} +~~~~~~~~ > ## Now create a vector with both NA and NaN values > x <- c(1, 2, NaN, NA, 4) > is.na(x) [1] FALSE FALSE TRUE TRUE FALSE > is.nan(x) [1] FALSE FALSE TRUE FALSE FALSE -``` +~~~~~~~~ ## Data Frames Data frames are used to store tabular data in R. They are an important type of object in R and are used in a variety of statistical modeling applications. Hadley Wickham's package -[dplyr](https://github.com/tidyverse/dplyr) has an optimized set of +[dplyr](https://github.com/hadley/dplyr) has an optimized set of functions designed to work efficiently with data frames. Data frames are represented as a special type of list where every @@ -497,7 +516,8 @@ should be used to coerce a data frame to a matrix, almost always, what you want is the result of `data.matrix()`. -```r +{line-numbers=off} +~~~~~~~~ > x <- data.frame(foo = 1:4, bar = c(T, T, F, F)) > x foo bar @@ -509,7 +529,7 @@ you want is the result of `data.matrix()`. [1] 4 > ncol(x) [1] 2 -``` +~~~~~~~~ ## Names @@ -518,7 +538,8 @@ code and self-describing objects. Here is an example of assigning names to an integer vector. -```r +{line-numbers=off} +~~~~~~~~ > x <- 1:3 > names(x) NULL @@ -528,12 +549,13 @@ NULL 1 2 3 > names(x) [1] "New York" "Seattle" "Los Angeles" -``` +~~~~~~~~ Lists can also have names, which is often very useful. -```r +{line-numbers=off} +~~~~~~~~ > x <- list("Los Angeles" = 1, Boston = 2, London = 3) > x $`Los Angeles` @@ -546,32 +568,34 @@ $London [1] 3 > names(x) [1] "Los Angeles" "Boston" "London" -``` +~~~~~~~~ Matrices can have both column and row names. -```r +{line-numbers=off} +~~~~~~~~ > m <- matrix(1:4, nrow = 2, ncol = 2) > dimnames(m) <- list(c("a", "b"), c("c", "d")) > m c d a 1 3 b 2 4 -``` +~~~~~~~~ Column names and row names can be set separately using the `colnames()` and `rownames()` functions. -```r +{line-numbers=off} +~~~~~~~~ > colnames(m) <- c("h", "f") > rownames(m) <- c("x", "z") > m h f x 1 3 z 2 4 -``` +~~~~~~~~ Note that for data frames, there is a separate function for setting the row names, the `row.names()` function. Also, data frames do not @@ -587,8 +611,8 @@ know its confusing. Here's a quick summary: ## Summary -There are a variety of different built-in data types in R. In this -chapter we have reviewed the following: +There are a variety of different builtin-data types in R. In this +chapter we have reviewed the following - atomic classes: numeric, logical, character, integer, complex diff --git a/manuscript/overview.md b/manuscript/overview.md index 82ec124..cf15008 100644 --- a/manuscript/overview.md +++ b/manuscript/overview.md @@ -56,7 +56,7 @@ figuring out how to make data analysis easier, first for themselves, and then eventually for others. In [Stages in the Evolution of -S](https://web.archive.org/web/20150305201743/http://www.stat.bell-labs.com/S/history.html), John Chambers +S](http://www.stat.bell-labs.com/S/history.html ), John Chambers writes: > “[W]e wanted users to be able to begin in an interactive environment, @@ -67,7 +67,7 @@ writes: The key part here was the transition from *user* to *developer*. They wanted to build a language that could easily service both -"people". More technically, they needed to build a language that would +"people". More technically, they needed to build language that would be suitable for interactive data analysis (more command-line based) as well as for writing longer programs (more traditional programming language-like). @@ -133,7 +133,7 @@ beginning and has generally been better than competing packages. Today, with many more visualization packages available than before, that trend continues. R's base graphics system allows for very fine control over essentially every aspect of a plot or graph. Other -newer graphics systems, like lattice and ggplot2, allow for complex and +newer graphics systems, like lattice and ggplot2 allow for complex and sophisticated visualizations of high-dimensional data. R has maintained the original S philosophy, which is that it provides a @@ -196,9 +196,9 @@ functionality of R. The R system is divided into 2 conceptual parts: 1. The "base" R system that you download from CRAN: -[Linux](http://cran.r-project.org/bin/linux/), -[Windows](http://cran.r-project.org/bin/windows/), -[Mac](http://cran.r-project.org/bin/macosx/), [Source +[Linux](http://cran.r-project.org/bin/linux/) +[Windows](http://cran.r-project.org/bin/windows/) +[Mac](http://cran.r-project.org/bin/macosx/) [Source Code](http://cran.r-project.org/src/base/R-3/R-3.1.3.tar.gz) 2. Everything else. @@ -221,7 +221,7 @@ When you download a fresh installation of R from CRAN, you get all of the above, which represents a substantial amount of functionality. However, there are many other packages available: -- There are over 4,000 packages on CRAN that have been developed by +- There are over 4000 packages on CRAN that have been developed by users and programmers around the world. - There are also many packages associated with the [Bioconductor @@ -305,7 +305,7 @@ this book. Also, available from [CRAN](http://cran.r-project.org) are - [R Installation and Administration](http://cran.r-project.org/doc/manuals/r-release/R-admin.html): - This is mostly for building R from the source code + This is mostly for building R from the source code) - [R Internals](http://cran.r-project.org/doc/manuals/r-release/R-ints.html): diff --git a/manuscript/profiler.md b/manuscript/profiler.md index 8ca9fdb..1414e39 100644 --- a/manuscript/profiler.md +++ b/manuscript/profiler.md @@ -7,9 +7,9 @@ -R comes with a profiler to help you optimize your code and improve its performance. In general, it's usually a bad idea to focus on optimizing your code at the very beginning of development. Rather, in the beginning it's better to focus on translating your ideas into code and writing code that's coherent and readable. The problem is that heavily optimized code tends to be obscure and difficult to read, making it harder to debug and revise. Better to get all the bugs out first, then focus on optimizing. +R comes with a profiler to help you optimize your code and improve its performance. In generall, it's usually a bad idea to focus on optimizing your code at the very beginning of development. Rather, in the beginning it's better to focus on translating your ideas into code and writing code that's coherent and readable. The problem is that heavily optimized code tends to be obscure and difficult to read, making it harder to debug and revise. Better to get all the bugs out first, then focus on optimizing. -Of course, when it comes to optimizing code, the question is what should you optimize? Well, clearly you should optimize the parts of your code that are running slowly, but how do we know what parts those are? +Of course, when it comes to optimizing code, the question is what should you optimize? Well, clearly should optimize the parts of your code that are running slowly, but how do we know what parts those are? This is what the profiler is for. Profiling is a systematic way to examine how much time is spent in different parts of a program. @@ -24,9 +24,9 @@ the code spends most of its time. This cannot be done without some sort of rigor The basic principles of optimizing your code are: -* Design first, then optimize. +* Design first, then optimize -* Remember: Premature optimization is the root of all evil. +* Remember: Premature optimization is the root of all evil * Measure (collect data), don’t guess. @@ -35,29 +35,31 @@ The basic principles of optimizing your code are: ## Using `system.time()` -The `system.time()` function takes an arbitrary R expression as input (can be wrapped in curly braces) and returns the amount of time taken to evaluate the expression. The `system.time()` function computes the time (in seconds) needed to execute an expression and if there’s an error, gives the time until the error occurred. The function returns an object of class `proc_time` which contains two useful bits of information: +They `system.time()` function takes an arbitrary R expression as input (can be wrapped in curly braces) and returns the amount of time taken to evaluate the expression. The `system.time()` function computes the time (in seconds) needed to execute an expression and if there’s an error, gives the time until the error occurred. The function returns an object of class `proc_time` which contains two useful bits of information: - *user time*: time charged to the CPU(s) for this expression - *elapsed time*: "wall clock" time, the amount of time that passes for *you* as you're sitting there -Usually, the user time and elapsed time are relatively close, for straight computing tasks. But there are a few situations where the two can diverge, sometimes dramatically. The elapsed time may be *greater than* the user time if the CPU spends a lot of time waiting around. This commonly happens if your R expression involves some input or output, which depends on the activity of the file system and the disk (or the Internet, if using a network connection). +Usually, the user time and elapsed time are relatively close, for straight computing tasks. But there are a few situations where the two can diverge, sometimes dramatically. The elapsed time may be *greater than* the user time if the CPU spends a lot of time waiting around. This commonly happens if your R expression involes some input or output, which depends on the activity of the file system and the disk (or the Internet, if using a network connection). -The elapsed time may be *smaller than* the user time if your machine has multiple cores/processors (and is capable of using them). For example, multi-threaded BLAS libraries (vecLib/Accelerate, ATLAS, ACML, MKL) can greatly speed up linear algebra calculations and are commonly installed on even desktop systems these days. Also, parallel processing done via something like the `parallel` package can make the elapsed time smaller than the user time. When you have multiple processors/cores/machines working in parallel, the amount of time that the collection of CPUs spends working on a problem is the same as with a single CPU, but because they are operating in parallel, there is a savings in elapsed time. +The elapsed time may be *smaller than* the user time if your machine has multiple cores/processors (and is capable of using them). For example, multi-threaded BLAS libraries (vecLib/Accelerate, ATLAS, ACML, MKL) can greatly speed up linear algebra calculations and are commonly installed on even desktop systems these days. Also, parallel processing done via something like the `parallell` package can make the elapsed time smaller than the user time. When you have multiple processors/cores/machines working in parallel, the amount of time that the collection of CPUs spends working on a problem is the same as with a single CPU, but because they are operating in parallel, there is a savings in elapsed time. Here's an example of where the elapsed time is greater than the user time. -```r +{line-numbers=off} +~~~~~~~~ ## Elapsed time > user time system.time(readLines("http://www.jhsph.edu")) user system elapsed 0.004 0.002 0.431 -``` +~~~~~~~~ Most of the time in this expression is spent waiting for the connection to the web server and waiting for the data to travel back to my computer. This doesn't involve the CPU and so the CPU simply waits around for things to get done. Hence, the user time is small. In this example, the elapsed time is smaller than the user time. -```r +{line-numbers=off} +~~~~~~~~ ## Elapsed time < user time > hilbert <- function(n) { + i <- 1:n @@ -67,7 +69,7 @@ In this example, the elapsed time is smaller than the user time. > system.time(svd(x)) user system elapsed 1.035 0.255 0.462 -``` +~~~~~~~~ In this case I ran a singular value decomposition on the matrix in `x`, which is a common linear algebra procedure. Because my computer is able to split the work across multiple processors, the elapsed time is about half the user time. @@ -77,7 +79,8 @@ In this case I ran a singular value decomposition on the matrix in `x`, which is You can time longer expressions by wrapping them in curly braces within the call to `system.time()`. -```r +{line-numbers=off} +~~~~~~~~ > system.time({ + n <- 1000 + r <- numeric(n) @@ -87,8 +90,8 @@ You can time longer expressions by wrapping them in curly braces within the call + } + }) user system elapsed - 0.06 0.00 0.06 -``` + 0.105 0.002 0.116 +~~~~~~~~ If your expression is getting pretty long (more than 2 or 3 lines), it might be better to either break it into smaller pieces or to use the profiler. The problem is that if the expression is too long, you won't be able to identify which part of the code is causing the bottleneck. @@ -96,19 +99,20 @@ If your expression is getting pretty long (more than 2 or 3 lines), it might be [Watch a video of this section](https://youtu.be/BZVcMPtlJ4A) -Using `system.time()` allows you to test certain functions or code blocks to see if they are taking excessive amounts of time. However, this approach assumes that you already know where the problem is and can call `system.time()` on that piece of code. What if you don’t know where to start? +Using `system.time()` allows you to test certain functions or code blocks to see if they are taking excessive amounts of time. However, this approach assumes that you already know where the problem is and can call `system.time()` on it that piece of code. What if you don’t know where to start? This is where the profiler comes in handy. The `Rprof()` function starts the profiler in R. Note that R must be compiled with profiler support (but this is usually the case). In conjunction with `Rprof()`, we will use the `summaryRprof()` function which summarizes the output from `Rprof()` (otherwise it’s not really readable). Note that you should NOT use `system.time()` and `Rprof()` together, or you will be sad. -`Rprof()` keeps track of the function call stack at regularly sampled intervals and tabulates how much time is spent inside each function. By default, the profiler samples the function call stack every 0.02 seconds. This means that if your code runs very quickly (say, under 0.02 seconds), the profiler is not useful. But if your code runs that fast, you probably don't need the profiler. +`Rprof()` keeps track of the function call stack at regularly sampled intervals and tabulates how much time is spent inside each function. By default, the profiler samples the function call stack every 0.02 seconds. This means that if your code runs very quickly (say, under 0.02 seconds), the profiler is not useful. But of your code runs that fast, you probably don't need the profiler. The profiler is started by calling the `Rprof()` function. -```r +{line-numbers=off} +~~~~~~~~ > Rprof() ## Turn on the profiler -``` +~~~~~~~~ You don't need any other arguments. By default it will write its output to a file called `Rprof.out`. You can specify the name of the output file if you don't want to use this default. @@ -117,13 +121,15 @@ Once you call the `Rprof()` function, everything that you do from then on will b The profiler can be turned off by passing `NULL` to `Rprof()`. -```r +{line-numbers=off} +~~~~~~~~ > Rprof(NULL) ## Turn off the profiler -``` +~~~~~~~~ The raw output from the profiler looks something like this. Here I'm calling the `lm()` function on some data with the profiler running. -```r +{line-numbers=off} +~~~~~~~~ ## lm(y ~ x) sample.interval=10000 @@ -141,7 +147,7 @@ sample.interval=10000 "lm.fit" "lm" "lm.fit" "lm" "lm.fit" "lm" -``` +~~~~~~~~ At each line of the output, the profiler writes out the function call stack. For example, on the very first line of the output you can see that the code is 8 levels deep in the call stack. This is where you need the `summaryRprof()` function to help you interpret this data. @@ -155,7 +161,8 @@ The `summaryRprof()` function tabulates the R profiler output and calculates how Here is what `summaryRprof()` reports in the "by.total" output. -```r +{line-numbers=off} +~~~~~~~~ $by.total total.time total.pct self.time self.pct "lm" 7.41 100.00 0.30 4.05 @@ -170,13 +177,14 @@ $by.total "[" 1.03 13.90 0.00 0.00 "as.list.data.frame" 0.82 11.07 0.82 11.07 "as.list" 0.82 11.07 0.00 0.00 -``` +~~~~~~~~ Because `lm()` is the function that I called from the command line, of course 100% of the time is spent somewhere in that function. However, what this doesn't show is that if `lm()` immediately calls another function (like `lm.fit()`, which does most of the heavy lifting), then in reality, most of the time is spent in *that* function, rather than in the top-level `lm()` function. The "by.self" output corrects for this discrepancy. -```r +{line-numbers=off} +~~~~~~~~ $by.self self.time self.pct total.time total.pct "lm.fit" 2.99 40.35 3.50 47.23 @@ -191,31 +199,32 @@ $by.self "as.character" 0.18 2.43 0.18 2.43 "model.frame.default" 0.12 1.62 2.24 30.23 "anyDuplicated.default" 0.02 0.27 0.02 0.27 -``` +~~~~~~~~ Now you can see that only about 4% of the runtime is spent in the actual `lm()` function, whereas over 40% of the time is spent in `lm.fit()`. In this case, this is no surprise since the `lm.fit()` function is the function that actually fits the linear model. -You can see that a reasonable amount of time is spent in functions not necessarily associated with linear modeling (i.e. `as.list.data.frame`, `[.data.frame`). This is because the `lm()` function does a bit of preprocessing and checking before it actually fits the model. This is common with modeling functions---the preprocessing and checking is useful to see if there are any errors. But those two functions take up over 1.5 seconds of runtime. What if you want to fit this model 10,000 times? You're going to be spending a lot of time in preprocessing and checking. +You can see that a reasonable amount of time is spent in functions not necessarily associated with linear modeling (i.e. `as.list.data.frame`, `[.data.frame`). This is because the `lm()` function does a bit of pre-processing and checking before it actually fits the model. This is common with modeling functions---the preprocessing and checking is useful to see if there are any errors. But those two functions take up over 1.5 seconds of runtime. What if you want to fit this model 10,000 times? You're going to be spending a lot of time in preprocessing and checking. The final bit of output that `summaryRprof()` provides is the sampling interval and the total runtime. -```r +{line-numbers=off} +~~~~~~~~ $sample.interval [1] 0.02 $sampling.time [1] 7.41 -``` +~~~~~~~~ ## Summary -* `Rprof()` runs the profiler for performance analysis of R code. +* `Rprof()` runs the profiler for performance of analysis of R code * `summaryRprof()` summarizes the output of `Rprof()` and gives percent of time spent in each function (with two types of - normalization). + normalization) * Good to break your code into functions so that the profiler can give - useful information about where time is being spent. + useful information about where time is being spent -* C or Fortran code is not profiled. +* C or Fortran code is not profiled diff --git a/manuscript/readwritedata.md b/manuscript/readwritedata.md index 162ac42..9e9fb17 100644 --- a/manuscript/readwritedata.md +++ b/manuscript/readwritedata.md @@ -62,26 +62,25 @@ The `read.table()` function has a few important arguments: your file, it's worth setting this to be the empty string `""`. * `skip`, the number of lines to skip from the beginning * `stringsAsFactors`, should character variables be coded as factors? - This defaults to `FALSE` as of R 4.0.0. In 2020, - [the default was changed](https://developer.r-project.org/Blog/public/2020/02/16/stringsasfactors/) - from `TRUE` to `FALSE` due to reproducibility and to stay consistent - with modern alternatives to data frames. Now we have lots of + This defaults to `TRUE` because back in the old days, if you had + data that were stored as strings, it was because those strings + represented levels of a categorical variable. Now we have lots of data that is text data and they don't always represent categorical - variables. So setting it as `FALSE` makes sense in those - cases. With older versions of R, if you *always* want this to be - `FALSE`, you can set a global option via - `options(stringsAsFactors = FALSE)`. I've never seen so much heat - generated on discussion forums about an R function argument than the - `stringsAsFactors` argument. Seriously. + variables. So you may want to set this to be `FALSE` in those + cases. If you *always* want this to be `FALSE`, you can set a global + option via `options(stringsAsFactors = FALSE)`. I've never seen so + much heat generated on discussion forums about an R function + argument than the `stringsAsFactors` argument. Seriously. For small to moderately sized datasets, you can usually call read.table without specifying any other arguments -```r +{line-numbers=off} +~~~~~~~~ > data <- read.table("foo.txt") -``` +~~~~~~~~ In this case, R will automatically @@ -103,7 +102,8 @@ argument). With much larger datasets, there are a few things that you can do that will make your life easier and will prevent R from choking. -* Read the help page for read.table, which contains many hints. +* Read the help page for read.table, which contains many hints + * Make a rough calculation of the memory required to store your dataset (see the next section for an example of how to do this). If the dataset is larger than the amount of RAM on your computer, you @@ -118,11 +118,12 @@ will make your life easier and will prevent R from choking. following: -```r +{line-numbers=off} +~~~~~~~~ > initial <- read.table("datatable.txt", nrows = 100) > classes <- sapply(initial, class) > tabAll <- read.table("datatable.txt", colClasses = classes) -``` +~~~~~~~~ * Set `nrows`. This doesn’t make R run faster but it helps with memory usage. A mild overestimate is okay. You can use the Unix tool `wc` @@ -134,8 +135,8 @@ know a few things about your system. * How much memory is available on your system? * What other applications are in use? Can you close any of them? * Are there other users logged into the same system? -* What operating system are you using? Some operating systems can limit - the amount of memory a single process can access. +* What operating system ar you using? Some operating systems can limit + the amount of memory a single process can access ## Calculating Memory Requirements for R Objects @@ -154,12 +155,13 @@ required to store this data frame? Well, on most modern computers [double precision floating point numbers](http://en.wikipedia.org/wiki/Double-precision_floating-point_format) are stored using 64 bits of memory, or 8 bytes. Given that -information, you can do the following calculation: +information, you can do the following calculation + -``` -> 1,500,000 × 120 × 8 bytes/numeric = 1,440,000,000 bytes -> 1,440,000,000 / 2^20 bytes/MB = 1,373.29 MB = 1.34 GB -``` +| 1,500,000 × 120 × 8 bytes/numeric | = 1,440,000,000 bytes | +| | = 1,440,000,000 / 2^20^ bytes/MB +| | = 1,373.29 MB +| | = 1.34 GB So the dataset would require about 1.34 GB of RAM. Most computers these days have at least that much RAM. However, you need to be aware @@ -172,19 +174,19 @@ Reading in a large dataset for which you do not have enough RAM is one easy way to freeze up your computer (or at least your R session). This is usually an unpleasant experience that usually requires you to kill the R process, in the best case scenario, or reboot your computer, in -the worst case. So make sure to do a rough calculation of memory +the worst case. So make sure to do a rough calculation of memeory requirements before reading in a large dataset. You'll thank me later. # Using the `readr` Package -The [readr](https://github.com/tidyverse/readr) package is recently -developed by Hadley Wickham to deal with reading in large flat files -quickly. The package provides replacements for functions like -`read.table()` and `read.csv()`. The analogous functions in `readr` -are `read_table()` and `read_csv()`. These functions are often -*much* faster than their base R analogues and provide a few other -nice features such as progress meters. +The `readr` package is recently developed by Hadley Wickham to deal +with reading in large flat files quickly. The package provides +replacements for functions like `read.table()` and `read.csv()`. The +analogous functions in `readr` are `read_table()` and +`read_csv()`. These functions are often *much* faster than their base +R analogues and provide a few other nice features such as progress +meters. For the most part, you can read use `read_table()` and `read_csv()` pretty much anywhere you might use `read.table()` and `read.csv()`. In @@ -206,33 +208,31 @@ specifying column types. A typical call to `read_csv` will look as follows. -```r +{line-numbers=off} +~~~~~~~~ > library(readr) > teams <- read_csv("data/team_standings.csv") -Rows: 32 Columns: 2 --- Column specification ------------------------------------------------------------------------------------------------------------------------------------ -Delimiter: "," -chr (1): Team -dbl (1): Standing - -i Use `spec()` to retrieve the full column specification for this data. -i Specify the column types or set `show_col_types = FALSE` to quiet this message. +Parsed with column specification: +cols( + Standing = col_integer(), + Team = col_character() +) > teams -# A tibble: 32 x 2 - Standing Team - - 1 1 Spain - 2 2 Netherlands - 3 3 Germany - 4 4 Uruguay - 5 5 Argentina - 6 6 Brazil - 7 7 Ghana - 8 8 Paraguay - 9 9 Japan -10 10 Chile +# A tibble: 32 × 2 + Standing Team + +1 1 Spain +2 2 Netherlands +3 3 Germany +4 4 Uruguay +5 5 Argentina +6 6 Brazil +7 7 Ghana +8 8 Paraguay +9 9 Japan +10 10 Chile # ... with 22 more rows -``` +~~~~~~~~ By default, `read_csv` will open a CSV file and read it in line-by-line. It will also (by default), read in the first few rows of the table in order to figure out the type of each column (i.e. integer, character, etc.). From the `read_csv` help page: @@ -243,77 +243,88 @@ You can specify the type of each column with the `col_types` argument. In general, it's a good idea to specify the column types explicitly. This rules out any possible guessing errors on the part of `read_csv`. Also, specifying the column types explicitly provides a useful safety check in case anything about the dataset should change without you knowing about it. -```r +{line-numbers=off} +~~~~~~~~ > teams <- read_csv("data/team_standings.csv", col_types = "cc") -``` +~~~~~~~~ Note that the `col_types` argument accepts a compact representation. Here `"cc"` indicates that the first column is `character` and the second column is `character` (there are only two columns). Using the `col_types` argument is useful because often it is not easy to automatically figure out the type of a column by looking at a few rows (especially if a column has many missing values). The `read_csv` function will also read compressed files automatically. There is no need to decompress the file first or use the `gzfile` connection function. The following call reads a gzip-compressed CSV file containing download logs from the RStudio CRAN mirror. -```r +{line-numbers=off} +~~~~~~~~ > logs <- read_csv("data/2016-07-19.csv.bz2", n_max = 10) -Rows: 10 Columns: 10 --- Column specification ------------------------------------------------------------------------------------------------------------------------------------ -Delimiter: "," -chr (6): r_version, r_arch, r_os, package, version, country -dbl (2): size, ip_id -date (1): date -time (1): time - -i Use `spec()` to retrieve the full column specification for this data. -i Specify the column types or set `show_col_types = FALSE` to quiet this message. -``` +Parsed with column specification: +cols( + date = col_date(format = ""), + time = col_time(format = ""), + size = col_integer(), + r_version = col_character(), + r_arch = col_character(), + r_os = col_character(), + package = col_character(), + version = col_character(), + country = col_character(), + ip_id = col_integer() +) +~~~~~~~~ Note that the warnings indicate that `read_csv` may have had some difficulty identifying the type of each column. This can be solved by using the `col_types` argument. -```r +{line-numbers=off} +~~~~~~~~ > logs <- read_csv("data/2016-07-19.csv.bz2", col_types = "ccicccccci", n_max = 10) > logs -# A tibble: 10 x 10 - date time size r_version r_arch r_os package version country ip_id - - 1 2016-07-19 22:00:00 1887881 3.3.0 x86_64 mingw32 data.table 1.9.6 US 1 - 2 2016-07-19 22:00:05 45436 3.3.1 x86_64 mingw32 assertthat 0.1 US 2 - 3 2016-07-19 22:00:03 14259016 3.3.1 x86_64 mingw32 stringi 1.1.1 DE 3 - 4 2016-07-19 22:00:05 1887881 3.3.1 x86_64 mingw32 data.table 1.9.6 US 4 - 5 2016-07-19 22:00:06 389615 3.3.1 x86_64 mingw32 foreach 1.4.3 US 4 - 6 2016-07-19 22:00:08 48842 3.3.1 x86_64 linux-gnu tree 1.0-37 CO 5 - 7 2016-07-19 22:00:12 525 3.3.1 x86_64 darwin13.4.0 survival 2.39-5 US 6 - 8 2016-07-19 22:00:08 3225980 3.3.1 x86_64 mingw32 Rcpp 0.12.5 US 2 - 9 2016-07-19 22:00:09 556091 3.3.1 x86_64 mingw32 tibble 1.1 US 2 -10 2016-07-19 22:00:10 151527 3.3.1 x86_64 mingw32 magrittr 1.5 US 2 -``` +# A tibble: 10 × 10 + date time size r_version r_arch r_os package + +1 2016-07-19 22:00:00 1887881 3.3.0 x86_64 mingw32 data.table +2 2016-07-19 22:00:05 45436 3.3.1 x86_64 mingw32 assertthat +3 2016-07-19 22:00:03 14259016 3.3.1 x86_64 mingw32 stringi +4 2016-07-19 22:00:05 1887881 3.3.1 x86_64 mingw32 data.table +5 2016-07-19 22:00:06 389615 3.3.1 x86_64 mingw32 foreach +6 2016-07-19 22:00:08 48842 3.3.1 x86_64 linux-gnu tree +7 2016-07-19 22:00:12 525 3.3.1 x86_64 darwin13.4.0 survival +8 2016-07-19 22:00:08 3225980 3.3.1 x86_64 mingw32 Rcpp +9 2016-07-19 22:00:09 556091 3.3.1 x86_64 mingw32 tibble +10 2016-07-19 22:00:10 151527 3.3.1 x86_64 mingw32 magrittr +# ... with 3 more variables: version , country , ip_id +~~~~~~~~ You can specify the column type in a more detailed fashion by using the various `col_*` functions. For example, in the log data above, the first column is actually a date, so it might make more sense to read it in as a Date variable. If we wanted to just read in that first column, we could do -```r +{line-numbers=off} +~~~~~~~~ > logdates <- read_csv("data/2016-07-19.csv.bz2", + col_types = cols_only(date = col_date()), + n_max = 10) > logdates -# A tibble: 10 x 1 - date - - 1 2016-07-19 - 2 2016-07-19 - 3 2016-07-19 - 4 2016-07-19 - 5 2016-07-19 - 6 2016-07-19 - 7 2016-07-19 - 8 2016-07-19 - 9 2016-07-19 +# A tibble: 10 × 1 + date + +1 2016-07-19 +2 2016-07-19 +3 2016-07-19 +4 2016-07-19 +5 2016-07-19 +6 2016-07-19 +7 2016-07-19 +8 2016-07-19 +9 2016-07-19 10 2016-07-19 -``` +~~~~~~~~ Now the `date` column is stored as a `Date` object which can be used for relevant date-related computations (for example, see the `lubridate` package). A> The `read_csv` function has a `progress` option that defaults to TRUE. This options provides a nice progress meter while the CSV file is being read. However, if you are using `read_csv` in a function, or perhaps embedding it in a loop, it's probably best to set `progress = FALSE`. + + + # Using Textual and Binary Formats for Storing Data [Watch a video of this chapter](https://youtu.be/5mIPigbNDfk) @@ -357,14 +368,15 @@ One way to pass data around is by deparsing the R object with `dput()` and reading it back in (parsing it) using `dget()`. -```r +{line-numbers=off} +~~~~~~~~ > ## Create a data frame > y <- data.frame(a = 1, b = "a") > ## Print 'dput' output to console > dput(y) -structure(list(a = 1, b = "a"), class = "data.frame", row.names = c(NA, --1L)) -``` +structure(list(a = 1, b = structure(1L, .Label = "a", class = "factor")), .Names = c("a", +"b"), row.names = c(NA, -1L), class = "data.frame") +~~~~~~~~ Notice that the `dput()` output is in the form of R code and that it preserves metadata like the class of the object, the row names, and @@ -373,7 +385,8 @@ the column names. The output of `dput()` can also be saved directly to a file. -```r +{line-numbers=off} +~~~~~~~~ > ## Send 'dput' output to a file > dput(y, file = "y.R") > ## Read in 'dput' output from a file @@ -381,39 +394,41 @@ The output of `dput()` can also be saved directly to a file. > new.y a b 1 1 a -``` +~~~~~~~~ Multiple objects can be deparsed at once using the dump function and read back in using `source`. -```r +{line-numbers=off} +~~~~~~~~ > x <- "foo" > y <- data.frame(a = 1L, b = "a") -``` +~~~~~~~~ We can `dump()` R objects to a file by passing a character vector of their names. -```r +{line-numbers=off} +~~~~~~~~ > dump(c("x", "y"), file = "data.R") > rm(x, y) -``` +~~~~~~~~ The inverse of `dump()` is `source()`. -```r +{line-numbers=off} +~~~~~~~~ > source("data.R") > str(y) 'data.frame': 1 obs. of 2 variables: $ a: int 1 - $ b: chr "a" + $ b: Factor w/ 1 level "a": 1 > x [1] "foo" -``` - +~~~~~~~~ ## Binary Formats @@ -428,7 +443,8 @@ The key functions for converting R objects into a binary format are be saved to a file using the `save()` function. -```r +{line-numbers=off} +~~~~~~~~ > a <- data.frame(x = rnorm(100), y = runif(100)) > b <- c(3, 4.4, 1 / 3) > @@ -437,19 +453,20 @@ be saved to a file using the `save()` function. > > ## Load 'a' and 'b' into your workspace > load("mydata.rda") -``` +~~~~~~~~ If you have a lot of objects that you want to save to a file, you can save all objects in your workspace using the `save.image()` function. -```r +{line-numbers=off} +~~~~~~~~ > ## Save everything to a file > save.image(file = "mydata.RData") > > ## load all objects in this file > load("mydata.RData") -``` +~~~~~~~~ Notice that I've used the `.rda` extension when using `save()` and the `.RData` extension when using `save.image()`. This is just my personal @@ -467,12 +484,15 @@ When you call `serialize()` on an R object, the output will be a raw vector coded in hexadecimal format. -```r +{line-numbers=off} +~~~~~~~~ > x <- list(1, 2, 3) > serialize(x, NULL) - [1] 58 0a 00 00 00 03 00 04 01 01 00 03 05 00 00 00 00 06 43 50 31 32 35 32 00 00 00 13 00 00 00 03 00 00 00 0e 00 00 00 01 3f f0 00 00 00 00 00 00 00 00 -[51] 00 0e 00 00 00 01 40 00 00 00 00 00 00 00 00 00 00 0e 00 00 00 01 40 08 00 00 00 00 00 00 -``` + [1] 58 0a 00 00 00 02 00 03 03 02 00 02 03 00 00 00 00 13 00 00 00 03 00 +[24] 00 00 0e 00 00 00 01 3f f0 00 00 00 00 00 00 00 00 00 0e 00 00 00 01 +[47] 40 00 00 00 00 00 00 00 00 00 00 0e 00 00 00 01 40 08 00 00 00 00 00 +[70] 00 +~~~~~~~~ If you want, this can be sent to a file, but in that case you are better off using something like `save()`. @@ -483,6 +503,8 @@ losing precision or any metadata. If that is what you need, then `serialize()` is the function for you. + + # Interfaces to the Outside World [Watch a video of this chapter](https://youtu.be/Pb01WoJRUtY) @@ -499,7 +521,7 @@ made to files (most common) or to other more exotic things. In general, connections are powerful tools that let you navigate files or other external objects. Connections can be thought of as a translator that lets you talk to objects that are outside of R. Those -outside objects could be anything from a database, a simple text +outside objects could be anything from a data base, a simple text file, or a a web service API. Connections allow R functions to talk to all these different external objects without you having to write custom code for each object. @@ -510,10 +532,12 @@ custom code for each object. Connections to text files can be created with the `file()` function. -```r +{line-numbers=off} +~~~~~~~~ > str(file) -function (description = "", open = "", blocking = TRUE, encoding = getOption("encoding"), raw = FALSE, method = getOption("url.method", "default")) -``` +function (description = "", open = "", blocking = TRUE, encoding = getOption("encoding"), + raw = FALSE, method = getOption("url.method", "default")) +~~~~~~~~ The `file()` function has a number of arguments that are common to many other connection functions so it's worth going into a little @@ -524,10 +548,10 @@ detail here. The `open` argument allows for the following options: -- "r", open file in read only mode -- "w", open a file for writing (and initializing a new file) -- "a", open a file for appending -- "rb", "wb", "ab", reading, writing, or appending in binary mode (Windows) +- "r" open file in read only mode +- "w" open a file for writing (and initializing a new file) +- "a" open a file for appending +- "rb", "wb", "ab" reading, writing, or appending in binary mode (Windows) In practice, we often don't need to deal with the connection interface @@ -538,7 +562,8 @@ For example, if one were to explicitly use connections to read a CSV file in to R, it might look like this, -```r +{line-numbers=off} +~~~~~~~~ > ## Create a connection to 'foo.txt' > con <- file("foo.txt") > @@ -550,20 +575,21 @@ file in to R, it might look like this, > > ## Close the connection > close(con) -``` +~~~~~~~~ which is the same as -```r +{line-numbers=off} +~~~~~~~~ > data <- read.csv("foo.txt") -``` +~~~~~~~~ In the background, `read.csv()` opens a connection to the file `foo.txt`, reads from it, and closes the connection when it's done. The above example shows the basic approach to using -connections. Connections must be opened, then they are read from or +connections. Connections must be opened, then the are read from or written to, and then they are closed. @@ -574,15 +600,17 @@ function. This function is useful for reading text files that may be unstructured or contain non-standard data. -```r +{line-numbers=off} +~~~~~~~~ > ## Open connection to gz-compressed text file > con <- gzfile("words.gz") > x <- readLines(con, 10) > x - [1] "1080" "10-point" "10th" "11-point" "12-point" "16-point" "18-point" "1st" "2" "20-point" -``` + [1] "1080" "10-point" "10th" "11-point" "12-point" "16-point" + [7] "18-point" "1st" "2" "20-point" +~~~~~~~~ -For more structured text data, like CSV files or tab-delimited files, +For more structured text data like CSV files or tab-delimited files, there are other functions like `read.csv()` or `read.table()`. The above example used the `gzfile()` function which is used to create @@ -598,7 +626,7 @@ time to a text file. ## Reading From a URL Connection The `readLines()` function can be useful for reading in lines of -web pages. Since web pages are basically text files that are stored on +webpages. Since web pages are basically text files that are stored on a remote server, there is conceptually not much difference between a web page and a local text file. However, we need R to negotiate the communication between your computer and the web server. This is what @@ -608,29 +636,32 @@ a web server. This code might take time depending on your connection speed. -```r +{line-numbers=off} +~~~~~~~~ > ## Open a URL connection for reading -> con <- url("https://en.wikipedia.org","r") +> con <- url("http://www.jhsph.edu", "r") > > ## Read the web page > x <- readLines(con) -Warning in readLines(con): incomplete final line found on 'https://en.wikipedia.org' > > ## Print out the first few lines -> head(x,5) -[1] "" "" -[3] "" "" -[5] "Wikipedia, the free encyclopedia" -``` +> head(x) +[1] "" +[2] "" +[3] "" +[4] "" +[5] "" +[6] "Johns Hopkins Bloomberg School of Public Health" +~~~~~~~~ While reading in a simple web page is sometimes useful, particularly if data are embedded in the web page somewhere. However, more commonly -we can use URL connections to read in specific data files that are +we can use URL connection to read in specific data files that are stored on web servers. Using URL connections can be useful for producing a reproducible analysis, because the code essentially documents where the data came -from and how they were obtained. This approach is preferable to +from and how they were obtained. This is approach is preferable to opening a web browser and downloading a dataset by hand. Of course, the code you write with connections may not be executable at a later date if things on the server side are changed or reorganized. diff --git a/manuscript/regex.md b/manuscript/regex.md index 6303cae..51297d9 100644 --- a/manuscript/regex.md +++ b/manuscript/regex.md @@ -18,20 +18,21 @@ If you want a very quick introduction to the general notion of regular expressio The primary R functions for dealing with regular expressions are -- `grep()`, `grepl()`: These functions search for matches of a regular expression/pattern in a character vector. `grep()` returns the indices into the character vector that contain a match or the specific strings that happen to have the match. `grepl()` returns a `TRUE`/`FALSE` vector indicating which elements of the character vector contain a match. +- `grep()`, `grepl()`: These functions search for matches of a regular expression/pattern in a character vector. `grep()` returns the indices into the character vector that contain a match or the specific strings that happen to have the match. `grepl()` returns a `TRUE`/`FALSE` vector indicating which elements of the character vector contain a match -- `regexpr()`, `gregexpr()`: Search a character vector for regular expression matches and return the indices of the string where the match begins and the length of the match. +- `regexpr()`, `gregexpr()`: Search a character vector for regular expression matches and return the indices of the string where the match begins and the length of the match -- `sub()`, `gsub()`: Search a character vector for regular expression matches and replace that match with another string. +- `sub()`, `gsub()`: Search a character vector for regular expression matches and replace that match with another string - `regexec()`: This function searches a character vector for a regular expression, much like `regexpr()`, but it will additionally return the locations of any parenthesized sub-expressions. Probably easier to explain through demonstration. -For this chapter, we will use a running example using data from homicides in Baltimore City. The Baltimore Sun newspaper collects information on all homicides that occur in the city (it also reports on many of them). That data is collected and presented in a [map that is publically available](http://data.baltimoresun.com/bing-maps/homicides/). I encourage you to go look at the website/map to get a sense of what kinds of data are presented there. Unfortunately, the data on the website are not particularly amenable to analysis, so I've scraped the data and put it in a separate file. The data in this file contain data from January 2007 to October 2013. +For this chapter, we will use a running example using data from homicides in Baltimore City. The Baltimore Sun newspaper collects information on all homicides that occur in the city (it also reports on many of them). That data is collected and presented in a [map that is publically available](http://data.baltimoresun.com/bing-maps/homicides/). I encourage you to go look at the web site/map to get a sense of what kinds of data are presented there. Unfortunately, the data on the web site are not particularly amenable to analysis, so I've scraped the data and put it in a separate file. The data in this file contain data from January 2007 to October 2013. Here is an excerpt of the Baltimore City homicides dataset: -```r +{line-numbers=off} +~~~~~~~~ > homicides <- readLines("homicides.txt") > > ## Total number of events recorded @@ -41,101 +42,110 @@ Here is an excerpt of the Baltimore City homicides dataset: [1] "39.311024, -76.674227, iconHomicideShooting, 'p2', '
Leon Nelson
3400 Clifton Ave.
Baltimore, MD 21216
black male, 17 years old
Found on January 1, 2007
Victim died at Shock Trauma
Cause: shooting
'" > homicides[1000] [1] "39.33626300000, -76.55553990000, icon_homicide_shooting, 'p1200', '
Davon Diggs
4100 Parkwood Ave
Baltimore, MD 21206
Race: Black
Gender: male
Age: 21 years old
Found on November 5, 2011
Victim died at Johns Hopkins Bayview Medical Center
Cause: Shooting

Originally reported in 5000 Belair Road; later determined to be rear alley of 4100 block Parkwood

'" -``` +~~~~~~~~ -The dataset is formatted so that each homicide is presented on a single line of text. So when we read the data in with `readLines()`, each element of the character vector represents one homicide event. Notice that the data are riddled with HTML tags because they were scraped directly from the website. +The data set is formatted so that each homicide is presented on a single line of text. So when we read the data in with `readLines()`, each element of the character vector represents one homicide event. Notice that the data are riddled with HTML tags because they were scraped directly from the web site. A few interesting features stand out: We have the latitude and longitude of where the victim was found; then there's the street address; the age, race, and gender of the victim; the date on which the victim was found; in which hospital the victim ultimately died; the cause of death. ## `grep()` Suppose we wanted to identify the records for all the victims of shootings (as opposed -to other causes)? How could we do that? From the map we know that for each cause of death there is a different icon/flag placed on the map. In particular, they are different colors. You can see that is indicated in the dataset for shooting deaths with a `iconHomicideShooting` label. Perhaps we can use this aspect of the data to identify all of the shootings. +to other causes)? How could we do that? From the map we know that for each cause of death there is a different icon/flag placed on the map. In particular, they are different colors. You can see that is indicated in the dataset for shooting deaths with a `iconHomicideShooting` label. Perhaps we can use this aspect of the data to idenfity all of the shootings. Here I use `grep()` to match the literal `iconHomicideShooting` into the character vector of homicides. -```r +{line-numbers=off} +~~~~~~~~ > g <- grep("iconHomicideShooting", homicides) > length(g) [1] 228 -``` +~~~~~~~~ -Using this approach I get 228 shooting deaths. However, I notice that for some of the entries, the indicator for the homicide "flag" is noted as `icon_homicide_shooting`. It's not uncommon over time for website maintainers to change the names of files or update files. What happens if we now `grep()` on both icon names using the `|` operator? +Using this approach I get 228 shooting deaths. However, I notice that for some of the entries, the indicator for the homicide "flag" is noted as `icon_homicide_shooting`. It's not uncommon over time for web site maintainers to change the names of files or update files. What happens if we now `grep()` on both icon names using the `|` operator? -```r +{line-numbers=off} +~~~~~~~~ > g <- grep("iconHomicideShooting|icon_homicide_shooting", homicides) > length(g) [1] 1263 -``` +~~~~~~~~ -Now we have 1,263 shooting deaths, which is quite a bit more. In fact, the vast majority of homicides in Baltimore are shooting deaths. +Now we have 1263 shooting deaths, which is quite a bit more. In fact, the vast majority of homicides in Baltimore are shooting deaths. Another possible way to do this is to `grep()` on the cause of death field, which seems to have the format `Cause: shooting`. We can `grep()` on this literally and get -```r +{line-numbers=off} +~~~~~~~~ > g <- grep("Cause: shooting", homicides) > length(g) [1] 228 -``` +~~~~~~~~ -Notice that we seem to be undercounting again. This is because for some of the entries, the word "shooting" uses a capital "S" while other entries use a lower case "s". We can handle this variation by using a character class in our regular expression. +Notice that we seem to be undercounting again. This is because for some of the entries, the word "shooting" uses a captial "S" while other entries use a lower case "s". We can handle this variation by using a character class in our regular expression. -```r +{line-numbers=off} +~~~~~~~~ > g <- grep("Cause: [Ss]hooting", homicides) > length(g) [1] 1263 -``` +~~~~~~~~ -One thing you have to be careful of when processing text data is to not `grep()` things out of context. For example, suppose we just `grep()`-ed on the expression `[Ss]hooting`. +One thing you have to be careful of when processing text data is not not `grep()` things out of context. For example, suppose we just `grep()`-ed on the expression `[Ss]hooting`. -```r +{line-numbers=off} +~~~~~~~~ > g <- grep("[Ss]hooting", homicides) > length(g) [1] 1265 -``` +~~~~~~~~ -Notice that we seem to pick up 2 extra homicides this way. We can figure out which ones they are by comparing the results of the two expressions. +Notice that we see to pick up 2 extra homicides this way. We can figure out which ones they are by comparing the results of the two expressions. First we can get the indices for the first expresssion match. -```r +{line-numbers=off} +~~~~~~~~ > i <- grep("[cC]ause: [Ss]hooting", homicides) > str(i) int [1:1263] 1 2 6 7 8 9 10 11 12 13 ... -``` +~~~~~~~~ Then we can get the indices for just matching on `[Ss]hooting`. -```r +{line-numbers=off} +~~~~~~~~ > j <- grep("[Ss]hooting", homicides) > str(j) int [1:1265] 1 2 6 7 8 9 10 11 12 13 ... -``` +~~~~~~~~ Now we just need to identify which are the entries that the vectors `i` and `j` do *not* have in common. -```r +{line-numbers=off} +~~~~~~~~ > setdiff(i, j) integer(0) > setdiff(j, i) [1] 318 859 -``` +~~~~~~~~ Here we can see that the index vector `j` has two entries that are not in `i`: entries 318, 859. We can take a look at these entries directly to see what makes them different. -```r +{line-numbers=off} +~~~~~~~~ > homicides[859] [1] "39.33743900000, -76.66316500000, icon_homicide_bluntforce, 'p914', '
Steven Harris
4200 Pimlico Road
Baltimore, MD 21215
Race: Black
Gender: male
Age: 38 years old
Found on July 29, 2010
Victim died at Scene
Cause: Blunt Force

Harris was found dead July 22 and ruled a shooting victim; an autopsy subsequently showed that he had not been shot,...

'" -``` +~~~~~~~~ Here we can see that the word "shooting" appears in the narrative text that accompanies the data, but the ultimate cause of death was in fact blunt force. @@ -145,62 +155,73 @@ A> When developing a regular expression to extract entries from a large dataset, Sometimes we want to identify elements of a character vector that match a pattern, but instead of returning their indices we want the actual values that satisfy the match. For example, we may want to identify all of the states in the United States whose names start with "New". -```r +{line-numbers=off} +~~~~~~~~ > grep("^New", state.name) [1] 29 30 31 32 -``` +~~~~~~~~ This gives us the indices into the `state.name` variable that match, but setting `value = TRUE` returns the actual elements of the character vector that match. -```r +{line-numbers=off} +~~~~~~~~ > grep("^New", state.name, value = TRUE) [1] "New Hampshire" "New Jersey" "New Mexico" "New York" -``` +~~~~~~~~ ## `grepl()` The function `grepl()` works much like `grep()` except that it differs in its return value. `grepl()` returns a logical vector indicating which element of a character vector contains the match. For example, suppose we want to know which states in the United States begin with word "New". -```r +{line-numbers=off} +~~~~~~~~ > g <- grepl("^New", state.name) > g - [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE -[26] FALSE FALSE FALSE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE + [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE +[13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE +[25] FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE +[37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE +[49] FALSE FALSE > state.name[g] [1] "New Hampshire" "New Jersey" "New Mexico" "New York" -``` +~~~~~~~~ Here, we can see that `grepl()` returns a logical vector that can be used to subset the original `state.name` vector. + + ## `regexpr()` -Both the `grep()` and the `grepl()` functions have some limitations. In particular, both functions tell you which strings in a character vector match a certain pattern but they don't tell you exactly where the match occurs or if the match is for a more complicated regular expression. +Both the `grep()` and the `grepl()` functions have some limitations. In particular, both functions tell you which strings in a character vector match a certain pattern but they don't tell you exactly where the match occurs or what the match is for a more complicated regular expression. The `regexpr()` function gives you the (a) index into each string where the match begins and the (b) length of the match for that string. `regexpr()` only gives you the *first* match of the string (reading left to right). `gregexpr()` will give you *all* of the matches in a given string if there are is more than one match. In our Baltimore City homicides dataset, we might be interested in finding the date on which each victim was found. Taking a look at the dataset -```r +{line-numbers=off} +~~~~~~~~ > homicides[1] [1] "39.311024, -76.674227, iconHomicideShooting, 'p2', '
Leon Nelson
3400 Clifton Ave.
Baltimore, MD 21216
black male, 17 years old
Found on January 1, 2007
Victim died at Shock Trauma
Cause: shooting
'" -``` +~~~~~~~~ it seems that we might be able to just `grep` on the word "Found". However, the word "found" may be found elsewhere in the entry, such as in this entry, where the word "found" appears in the narrative text at the end. -```r +{line-numbers=off} +~~~~~~~~ > homicides[954] [1] "39.30677400000, -76.59891100000, icon_homicide_shooting, 'p816', '
Kenly Wheeler
1400 N Caroline St
Baltimore, MD 21213
Race: Black
Gender: male
Age: 29 years old
Found on March 3, 2010
Victim died at Scene
Cause: Shooting

Wheeler\\'s body was found on the grounds of Dr. Bernard Harris Sr. Elementary School

'" -``` +~~~~~~~~ But we can see that the date is typically preceded by "Found on" and is surrounded by `
` tags, so let's use the pattern `
[F|f]ound(.*)
` and see what it brings up. -```r +{line-numbers=off} +~~~~~~~~ > regexpr("
[F|f]ound(.*)
", homicides[1:10]) [1] 177 178 188 189 178 182 178 187 182 183 attr(,"match.length") @@ -209,20 +230,22 @@ attr(,"index.type") [1] "chars" attr(,"useBytes") [1] TRUE -``` +~~~~~~~~ We can use the `substr()` function to extract the first match in the first string. -```r +{line-numbers=off} +~~~~~~~~ > substr(homicides[1], 177, 177 + 93 - 1) [1] "
Found on January 1, 2007
Victim died at Shock Trauma
Cause: shooting
" -``` +~~~~~~~~ Immediately, we can see that the regular expression picked up too much information. This is because the previous pattern was too greedy and matched too much of the string. We need to use the `?` metacharacter to make the regular expression "lazy" so that it stops at the *first* `` tag. -```r +{line-numbers=off} +~~~~~~~~ > regexpr("
[F|f]ound(.*?)
", homicides[1:10]) [1] 177 178 188 189 178 182 178 187 182 183 attr(,"match.length") @@ -231,25 +254,28 @@ attr(,"index.type") [1] "chars" attr(,"useBytes") [1] TRUE -``` +~~~~~~~~ Now when we look at the substrings indicated by the `regexpr()` output, we get -```r +{line-numbers=off} +~~~~~~~~ > substr(homicides[1], 177, 177 + 33 - 1) [1] "
Found on January 1, 2007
" -``` +~~~~~~~~ While it's straightforward to take the output of `regexpr()` and feed it into `substr()` to get the matches out of the original data, one handy function is `regmatches()` which extracts the matches in the strings for you without you having to use `substr()`. -```r +{line-numbers=off} +~~~~~~~~ > r <- regexpr("
[F|f]ound(.*?)
", homicides[1:5]) > regmatches(homicides[1:5], r) -[1] "
Found on January 1, 2007
" "
Found on January 2, 2007
" "
Found on January 2, 2007
" "
Found on January 3, 2007
" +[1] "
Found on January 1, 2007
" "
Found on January 2, 2007
" +[3] "
Found on January 2, 2007
" "
Found on January 3, 2007
" [5] "
Found on January 5, 2007
" -``` +~~~~~~~~ @@ -258,51 +284,58 @@ While it's straightforward to take the output of `regexpr()` and feed it into `s Sometimes we need to clean things up or modify strings by matching a pattern and replacing it with something else. For example, how can we extract the date from this string? -```r +{line-numbers=off} +~~~~~~~~ > x <- substr(homicides[1], 177, 177 + 33 - 1) > x [1] "
Found on January 1, 2007
" -``` +~~~~~~~~ We want to strip out the stuff surrounding the "January 1, 2007" portion. We can do that by matching on the text that comes before and after it using the `|` operator and then replacing it with the empty string. -```r +{line-numbers=off} +~~~~~~~~ > sub("
[F|f]ound on |
", "", x) [1] "January 1, 2007" -``` +~~~~~~~~ Notice that the `sub()` function found the first match (at the beginning of the string) and replaced it and then stopped. However, there was another match at the end of the string that we also wanted to replace. To get both matches, we need the `gsub()` function. -```r +{line-numbers=off} +~~~~~~~~ > gsub("
[F|f]ound on |
", "", x) [1] "January 1, 2007" -``` +~~~~~~~~ -The `sub()` and `gsub()` functions can take vector arguments so we don't have to process each string one by one. +The `sub() and `gsub()` functions can take vector arguments so we don't have to process each string one by one. -```r +{line-numbers=off} +~~~~~~~~ > r <- regexpr("
[F|f]ound(.*?)
", homicides[1:5]) > m <- regmatches(homicides[1:5], r) > m -[1] "
Found on January 1, 2007
" "
Found on January 2, 2007
" "
Found on January 2, 2007
" "
Found on January 3, 2007
" +[1] "
Found on January 1, 2007
" "
Found on January 2, 2007
" +[3] "
Found on January 2, 2007
" "
Found on January 3, 2007
" [5] "
Found on January 5, 2007
" > d <- gsub("
[F|f]ound on |
", "", m) > > ## Nice and clean > d -[1] "January 1, 2007" "January 2, 2007" "January 2, 2007" "January 3, 2007" "January 5, 2007" -``` +[1] "January 1, 2007" "January 2, 2007" "January 2, 2007" "January 3, 2007" +[5] "January 5, 2007" +~~~~~~~~ Finally, it may be useful to convert these strings to the `Date` class so that we can do some date-related computations. -```r +{line-numbers=off} +~~~~~~~~ > as.Date(d, "%B %d, %Y") [1] "2007-01-01" "2007-01-02" "2007-01-02" "2007-01-03" "2007-01-05" -``` +~~~~~~~~ ## `regexec()` @@ -311,7 +344,8 @@ The `regexec()` function works like `regexpr()` except it gives you the indices for parenthesized sub-expressions. For example, take a look at the following expression. -```r +{line-numbers=off} +~~~~~~~~ > regexec("
[F|f]ound on (.*?)
", homicides[1]) [[1]] [1] 177 190 @@ -321,14 +355,15 @@ attr(,"index.type") [1] "chars" attr(,"useBytes") [1] TRUE -``` +~~~~~~~~ Notice first that the regular expression itself has a portion in parentheses `()`. That is the portion of the expression that I presume will contain the date. In the output, you'll notice that there are two indices and two "match.length" values. The first index tells you where the overall match begins (character 177) and the second index tells you where the expression in the parentheses begins (character 190). By contrast, if we only use the `regexpr()` function, we get -```r +{line-numbers=off} +~~~~~~~~ > regexec("
[F|f]ound on .*?
", homicides[1]) [[1]] [1] 177 @@ -338,14 +373,15 @@ attr(,"index.type") [1] "chars" attr(,"useBytes") [1] TRUE -``` +~~~~~~~~ -We can use the `substr()` function to demonstrate which parts of the strings are matched by the `regexec()` function. +We can use the `substr()` function to demonstrate which parts of a strings are matched by the `regexec()` function. Here's the output for `regexec()`. -```r +{line-numbers=off} +~~~~~~~~ > regexec("
[F|f]ound on (.*?)
", homicides[1]) [[1]] [1] 177 190 @@ -355,28 +391,31 @@ attr(,"index.type") [1] "chars" attr(,"useBytes") [1] TRUE -``` +~~~~~~~~ Here's the overall expression match. -```r +{line-numbers=off} +~~~~~~~~ > substr(homicides[1], 177, 177 + 33 - 1) [1] "
Found on January 1, 2007
" -``` +~~~~~~~~ And here's the parenthesized sub-expression. -```r +{line-numbers=off} +~~~~~~~~ > substr(homicides[1], 190, 190 + 15 - 1) [1] "January 1, 2007" -``` +~~~~~~~~ All this can be done much more easily with the `regmatches()` function. -```r +{line-numbers=off} +~~~~~~~~ > r <- regexec("
[F|f]ound on (.*?)
", homicides[1:2]) > regmatches(homicides[1:2], r) [[1]] @@ -384,32 +423,35 @@ All this can be done much more easily with the `regmatches()` function. [[2]] [1] "
Found on January 2, 2007
" "January 2, 2007" -``` +~~~~~~~~ Notice that `regmatches()` returns a list in this case, where each element of the list contains two strings: the overall match and the parenthesized sub-expression. As an example, we can make a plot of monthly homicide counts. First we need a regular expression to capture the dates. -```r +{line-numbers=off} +~~~~~~~~ > r <- regexec("
[F|f]ound on (.*?)
", homicides) > m <- regmatches(homicides, r) -``` +~~~~~~~~ Then we can loop through the list returned by `regmatches()` and extract the second element of each (the parenthesized sub-expression). -```r +{line-numbers=off} +~~~~~~~~ > dates <- sapply(m, function(x) x[2]) -``` +~~~~~~~~ Finally, we can convert the date strings into the `Date` class and make a histogram of the counts. -```r +{line-numbers=off} +~~~~~~~~ > dates <- as.Date(dates, "%B %d, %Y") > hist(dates, "month", freq = TRUE, main = "Monthly Homicides in Baltimore") -``` +~~~~~~~~ ![plot of chunk unnamed-chunk-35](images/regex-unnamed-chunk-35-1.png) @@ -417,36 +459,40 @@ We can see from the picture that homicides do not occur uniformly throughout the ## The `stringr` Package -The `stringr` package is part of the [tidyverse](https://www.tidyverse.org) collection of packages and wraps the underlying `stringi` package in a series of convenience functions. Some of the complexity of using the base R regular expression functions is usefully hidden by the `stringr` functions. In addition, the `stringr` functions provide a more rational interface to regular expressions with more consistency in the arguments and argument ordering. +The `stringr` package is part of the [tidyverse](https://www.tidyverse.org) collection of packages and wraps they underlying `stringi` package in a series of convenience functions. Some of the complexity of using the base R regular expression functions is usefully hidden by the `stringr` functions. In addition, the `stringr` functions provide a more rational interface to regular expressions with more consistency in the arguments and argument ordering. Given what we have discussed so far, there is a fairly straightforward mapping from the base R functions to the `stringr` functions. In general, for the `stringr` functions, the data are the first argument and the regular expression is the second argument, with optional arguments afterwards. `str_subset()` is much like `grep(value = TRUE)` and returns a character vector of strings that contain a given match. -```r +{line-numbers=off} +~~~~~~~~ > library(stringr) > g <- str_subset(homicides, "iconHomicideShooting") > length(g) [1] 228 -``` - -`str_detect()` is essentially equivalent `grepl()`. +~~~~~~~~ +`str_detect()` is essentially `grepl()` `str_extract()` plays the role of `regexpr()` and `regmatches()`, extracting the matches from the output. -```r +{line-numbers=off} +~~~~~~~~ > str_extract(homicides[1:10], "
[F|f]ound(.*?)
") - [1] "
Found on January 1, 2007
" "
Found on January 2, 2007
" "
Found on January 2, 2007
" "
Found on January 3, 2007
" - [5] "
Found on January 5, 2007
" "
Found on January 5, 2007
" "
Found on January 5, 2007
" "
Found on January 7, 2007
" + [1] "
Found on January 1, 2007
" "
Found on January 2, 2007
" + [3] "
Found on January 2, 2007
" "
Found on January 3, 2007
" + [5] "
Found on January 5, 2007
" "
Found on January 5, 2007
" + [7] "
Found on January 5, 2007
" "
Found on January 7, 2007
" [9] "
Found on January 8, 2007
" "
Found on January 8, 2007
" -``` +~~~~~~~~ Finally, `str_match()` does the job of `regexec()` by provide a matrix containing the parenthesized sub-expressions. -```r +{line-numbers=off} +~~~~~~~~ > str_match(homicides[1:5], "
[F|f]ound on (.*?)
") [,1] [,2] [1,] "
Found on January 1, 2007
" "January 1, 2007" @@ -454,7 +500,7 @@ Finally, `str_match()` does the job of `regexec()` by provide a matrix containin [3,] "
Found on January 2, 2007
" "January 2, 2007" [4,] "
Found on January 3, 2007
" "January 3, 2007" [5,] "
Found on January 5, 2007
" "January 5, 2007" -``` +~~~~~~~~ Note how the second column of the output contains the values of the parenthesized sub-expressions. We could now obtain these values by extracting the second column of the matrix. If there had been more parenthesized sub-expressions, there would have been more columns in the output matrix. @@ -464,14 +510,14 @@ Note how the second column of the output contains the values of the parenthesize The primary R functions for dealing with regular expressions are - `grep()`, `grepl()`: Search for matches of a regular expression/pattern in a - character vector. + character vector -- `regexpr()`, `gregexpr()`: Search a character vector for regular expression matches and +- `regexpr()`, `gregexpr(): Search a character vector for regular expression matches and return the indices where the match begins; useful in conjunction - with `regmatches()`. + with `regmatches()` - `sub()`, `gsub()`: Search a character vector for regular expression matches and - replace that match with another string. + replace that match with another string - `regexec()`: Gives you indices of parethensized sub-expressions. diff --git a/manuscript/scoping.md b/manuscript/scoping.md index 9eb2691..df00ea3 100644 --- a/manuscript/scoping.md +++ b/manuscript/scoping.md @@ -11,27 +11,30 @@ How does R know which value to assign to which symbol? When I type -```r +{line-numbers=off} +~~~~~~~~ > lm <- function(x) { x * x } > lm function(x) { x * x } -``` +~~~~~~~~ how does R know what value to assign to the symbol `lm`? Why doesn’t it give it the value of `lm` that is in the `stats` package? When R tries to bind a value to a symbol, it searches through a series of `environments` to find the appropriate value. When you are working on the command line and need to retrieve the value of an R object, the order in which things occur is roughly 1. Search the global environment (i.e. your workspace) for a symbol name matching the one requested. -2. Search the namespaces of each of the packages on the search list. +2. Search the namespaces of each of the packages on the search list The search list can be found by using the `search()` function. -```r +{line-numbers=off} +~~~~~~~~ > search() - [1] ".GlobalEnv" "package:dplyr" "package:readr" "tools:rstudio" "package:stats" "package:graphics" "package:grDevices" - [8] "package:utils" "package:datasets" "package:methods" "Autoloads" "package:base" -``` +[1] ".GlobalEnv" "package:knitr" "package:stats" +[4] "package:graphics" "package:grDevices" "package:utils" +[7] "package:datasets" "Autoloads" "package:base" +~~~~~~~~ The _global environment_ or the user’s workspace is always the first element of the search list and the `base` package is always the last. For better or for worse, the order of the packages on the search list matters, particularly if there are multiple objects with the same name in different packages. @@ -44,18 +47,20 @@ Note that R has separate namespaces for functions and non-functions so it’s po The scoping rules for R are the main feature that make it different from the original S language (in case you care about that). This may seem like an esoteric aspect of R, but it's one of its more interesting and useful features. -The scoping rules of a language determine how a value is associated with a *free variable* in a function. R uses [_lexical scoping_](http://en.wikipedia.org/wiki/Scope_(computer_science)#Lexical_scope_vs._dynamic_scope) or _static scoping_. An alternative to lexical scoping is _dynamic scoping_ which is implemented by some languages. Lexical scoping turns out to be particularly useful for simplifying statistical computations. +The scoping rules of a language determine how a value is associated with a *free variable* in a function. R uses [_lexical scoping_](http://en.wikipedia.org/wiki/Scope_(computer_science)#Lexical_scope_vs._dynamic_scope) or _static scoping_. An alternative to lexical scoping is _dynamic scoping_ which is implemented by some languages. Lexical scoping turns out to be particularly useful for simplifying statistical computations Related to the scoping rules is how R uses the *search list* to bind a value to a +symbol Consider the following function. -```r +{line-numbers=off} +~~~~~~~~ > f <- function(x, y) { + x^2 + y / z + } -``` +~~~~~~~~ This function has 2 formal arguments `x` and `y`. In the body of the function there is another symbol `z`. In this case `z` is called a _free variable_. @@ -75,7 +80,7 @@ A function, together with an environment, makes up what is called a _closure_ or How do we associate a value to a free variable? There is a search process that occurs that goes as follows: - If the value of a symbol is not found in the environment in which a function was defined, then the search is continued in the _parent environment_. -- The search continues down the sequence of parent environments until we hit the _top-level environment_; this is usually the global environment (workspace) or the namespace of a package. +- The search continues down the sequence of parent environments until we hit the _top-level environment_; this usually the global environment (workspace) or the namespace of a package. - After the top-level environment, the search continues down the search list until we hit the _empty environment_. If a value for a given symbol cannot be found once the empty environment is arrived at, then an error is thrown. @@ -91,59 +96,64 @@ Typically, a function is defined in the global environment, so that the values o Here is an example of a function that returns another function as its return value. Remember, in R functions are treated like any other object and so this is perfectly valid. -```r +{line-numbers=off} +~~~~~~~~ > make.power <- function(n) { + pow <- function(x) { + x^n + } + pow + } -``` +~~~~~~~~ The `make.power()` function is a kind of "constructor function" that can be used to construct other functions. -```r +{line-numbers=off} +~~~~~~~~ > cube <- make.power(3) > square <- make.power(2) > cube(3) [1] 27 > square(3) [1] 9 -``` +~~~~~~~~ Let's take a look at the `cube()` function's code. -```r +{line-numbers=off} +~~~~~~~~ > cube function(x) { x^n } - -``` + +~~~~~~~~ Notice that `cube()` has a free variable `n`. What is the value of `n` here? Well, its value is taken from the environment where the function was defined. When I defined the `cube()` function it was when I called `make.power(3)`, so the value of `n` at that time was 3. We can explore the environment of a function to see what objects are there and their values. -```r +{line-numbers=off} +~~~~~~~~ > ls(environment(cube)) [1] "n" "pow" > get("n", environment(cube)) [1] 3 -``` +~~~~~~~~ We can also take a look at the `square()` function. -```r +{line-numbers=off} +~~~~~~~~ > ls(environment(square)) [1] "n" "pow" > get("n", environment(square)) [1] 2 -``` +~~~~~~~~ ## Lexical vs. Dynamic Scoping @@ -151,7 +161,8 @@ We can also take a look at the `square()` function. We can use the following example to demonstrate the difference between lexical and dynamic scoping rules. -```r +{line-numbers=off} +~~~~~~~~ > y <- 10 > > f <- function(x) { @@ -162,13 +173,14 @@ We can use the following example to demonstrate the difference between lexical a > g <- function(x) { + x*y + } -``` +~~~~~~~~ What is the value of the following expression? -```r +{line-numbers=off} +~~~~~~~~ f(3) -``` +~~~~~~~~ With lexical scoping the value of `y` in the function `g` is looked up in the environment in which the function was defined, in this case the global environment, so the value of `y` is 10. With dynamic scoping, the value of `y` is looked up in the environment from which the function was _called_ (sometimes referred to as the _calling environment_). In R the calling environment is known as the _parent frame_. In this case, the value of `y` would be 2. @@ -179,7 +191,8 @@ Consider this example. -```r +{line-numbers=off} +~~~~~~~~ > g <- function(x) { + a <- 3 + x+a+y @@ -190,7 +203,7 @@ Error in g(2): object 'y' not found > y <- 3 > g(2) [1] 8 -``` +~~~~~~~~ Here, `y` is defined in the global environment, which also happens to be where the function `g()` is defined. @@ -217,7 +230,8 @@ Optimization routines in R like `optim()`, `nlm()`, and `optimize()` require you Here is an example of a "constructor" function that creates a negative log-likelihood function that can be minimized to find maximum likelihood estimates in a statistical model. -```r +{line-numbers=off} +~~~~~~~~ > make.NegLogLik <- function(data, fixed = c(FALSE, FALSE)) { + params <- fixed + function(p) { @@ -231,14 +245,15 @@ Here is an example of a "constructor" function that creates a negative log-likel + -(a + b) + } + } -``` +~~~~~~~~ **Note**: Optimization functions in R _minimize_ functions, so you need to use the negative log-likelihood. Now we can generate some data and then construct our negative log-likelihood. -```r +{line-numbers=off} +~~~~~~~~ > set.seed(1) > normals <- rnorm(100, 1, 2) > nLL <- make.NegLogLik(normals) @@ -253,44 +268,46 @@ function(p) { b <- -0.5*sum((data-mu)^2) / (sigma^2) -(a + b) } - - + > > ## What's in the function environment? > ls(environment(nLL)) [1] "data" "fixed" "params" -``` +~~~~~~~~ Now that we have our `nLL()` function, we can try to minimize it with `optim()` to estimate the parameters. -```r +{line-numbers=off} +~~~~~~~~ > optim(c(mu = 0, sigma = 1), nLL)$par mu sigma 1.218239 1.787343 -``` +~~~~~~~~ You can see that the algorithm converged and obtained an estimate of `mu` and `sigma`. We can also try to estimate one parameter while holding another parameter fixed. Here we fix `sigma` to be equal to 2. -```r +{line-numbers=off} +~~~~~~~~ > nLL <- make.NegLogLik(normals, c(FALSE, 2)) > optimize(nLL, c(-1, 3))$minimum [1] 1.217775 -``` +~~~~~~~~ Because we now have a one-dimensional problem, we can use the simpler `optimize()` function rather than `optim()`. We can also try to estimate `sigma` while holding `mu` fixed at 1. -```r +{line-numbers=off} +~~~~~~~~ > nLL <- make.NegLogLik(normals, c(1, FALSE)) > optimize(nLL, c(1e-6, 10))$minimum [1] 1.800596 -``` +~~~~~~~~ ## Plotting the Likelihood @@ -299,22 +316,24 @@ Another nice feature that you can take advantage of is plotting the negative log Here is the function when `mu` is fixed. -```r -> ## Fix 'mu' to be equal to 1 +{line-numbers=off} +~~~~~~~~ +> ## Fix 'mu' to be equalt o 1 > nLL <- make.NegLogLik(normals, c(1, FALSE)) > x <- seq(1.7, 1.9, len = 100) > > ## Evaluate 'nLL()' at every point in 'x' > y <- sapply(x, nLL) > plot(x, exp(-(y - min(y))), type = "l") -``` +~~~~~~~~ ![plot of chunk nLLFixMu](images/nLLFixMu-1.png) Here is the function when `sigma` is fixed. -```r +{line-numbers=off} +~~~~~~~~ > ## Fix 'sigma' to be equal to 2 > nLL <- make.NegLogLik(normals, c(FALSE, 2)) > x <- seq(0.5, 1.5, len = 100) @@ -322,14 +341,14 @@ Here is the function when `sigma` is fixed. > ## Evaluate 'nLL()' at every point in 'x' > y <- sapply(x, nLL) > plot(x, exp(-(y - min(y))), type = "l") -``` +~~~~~~~~ ![plot of chunk nLLFixSigma](images/nLLFixSigma-1.png) ## Summary -- Objective functions can be "built" which contain all of the necessary data for evaluating the function. +- Objective functions can be "built" which contain all of the necessary data for evaluating the function - No need to carry around long argument lists — useful for interactive and exploratory work. -- Code can be simplified and cleaned up. +- Code can be simplified and cleaned up - Reference: Robert Gentleman and Ross Ihaka (2000). "Lexical Scope and Statistical Computing," _JCGS_, 9, 491–508. diff --git a/manuscript/simulation.md b/manuscript/simulation.md index 72898b8..8fec973 100644 --- a/manuscript/simulation.md +++ b/manuscript/simulation.md @@ -9,14 +9,14 @@ Simulation is an important (and big) topic for both statistics and for a variety of other areas where there is a need to introduce randomness. Sometimes you want to implement a statistical procedure that requires random number generation or sampling (i.e. Markov chain Monte Carlo, the bootstrap, random forests, bagging) and sometimes you want to simulate a system and random number generators can be used to model random inputs. -R comes with a set of pseudo-random number generators that allow you to simulate from well-known probability distributions like the Normal, Poisson, and binomial. Some example functions for probability distributions in R +R comes with a set of pseuodo-random number generators that allow you to simulate from well-known probability distributions like the Normal, Poisson, and binomial. Some example functions for probability distributions in R - `rnorm`: generate random Normal variates with a given mean and standard deviation - `dnorm`: evaluate the Normal probability density (with a given mean/SD) at a point (or vector of points) - `pnorm`: evaluate the cumulative distribution function for a Normal distribution - `rpois`: generate random Poisson variates with a given rate -For each probability distribution there are typically four functions available that start with a "r", "d", "p", and "q". The "r" function is the one that actually simulates random numbers from that distribution. The other functions are prefixed with a +For each probability distribution there are typically four functions available that start with a "r", "d", "p", and "q". The "r" function is the one that actually simulates randon numbers from that distribution. The other functions are prefixed with a - `d` for density - `r` for random number generation @@ -25,44 +25,50 @@ For each probability distribution there are typically four functions available t If you're only interested in simulating random numbers, then you will likely only need the "r" functions and not the others. However, if you intend to simulate from arbitrary probability distributions using something like rejection sampling, then you will need the other functions too. -Probably the most common probability distribution to work with is the Normal distribution (also known as the Gaussian). Working with the Normal distribution requires using these four functions +Probably the most common probability distribution to work with the is the Normal distribution (also known as the Gaussian). Working with the Normal distributions requires using these four functions -```r +{line-numbers=off} +~~~~~~~~ dnorm(x, mean = 0, sd = 1, log = FALSE) pnorm(q, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE) qnorm(p, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE) rnorm(n, mean = 0, sd = 1) -``` +~~~~~~~~ -Here we simulate 10 standard Normal random numbers with mean 0 and standard deviation 1. +Here we simulate standard Normal random numbers with mean 0 and standard deviation 1. -```r +{line-numbers=off} +~~~~~~~~ > ## Simulate standard Normal random numbers > x <- rnorm(10) > x - [1] 0.01874617 -0.18425254 -1.37133055 -0.59916772 0.29454513 0.38979430 -1.20807618 -0.36367602 -1.62667268 -0.25647839 -``` + [1] 0.01874617 -0.18425254 -1.37133055 -0.59916772 0.29454513 + [6] 0.38979430 -1.20807618 -0.36367602 -1.62667268 -0.25647839 +~~~~~~~~ -We can modify the default parameters to simulate 10 numbers with mean 20 and standard deviation 2. +We can modify the default parameters to simulate numbers with mean 20 and standard deviation 2. -```r +{line-numbers=off} +~~~~~~~~ > x <- rnorm(10, 20, 2) > x - [1] 22.20356 21.51156 19.52353 21.97489 21.48278 20.17869 18.09011 19.60970 21.85104 20.96596 + [1] 22.20356 21.51156 19.52353 21.97489 21.48278 20.17869 18.09011 + [8] 19.60970 21.85104 20.96596 > summary(x) Min. 1st Qu. Median Mean 3rd Qu. Max. 18.09 19.75 21.22 20.74 21.77 22.20 -``` +~~~~~~~~ If you wanted to know what was the probability of a random Normal variable of being less than, say, 2, you could use the `pnorm()` function to do that calculation. -```r +{line-numbers=off} +~~~~~~~~ > pnorm(2) [1] 0.9772499 -``` +~~~~~~~~ You never know when that calculation will come in handy. @@ -73,27 +79,30 @@ When simulating any random numbers it is essential to set the *random number see For example, I can generate 5 Normal random numbers with `rnorm()`. -```r +{line-numbers=off} +~~~~~~~~ > set.seed(1) > rnorm(5) [1] -0.6264538 0.1836433 -0.8356286 1.5952808 0.3295078 -``` +~~~~~~~~ Note that if I call `rnorm()` again I will of course get a different set of 5 random numbers. -```r +{line-numbers=off} +~~~~~~~~ > rnorm(5) [1] -0.8204684 0.4874291 0.7383247 0.5757814 -0.3053884 -``` +~~~~~~~~ If I want to reproduce the original set of random numbers, I can just reset the seed with `set.seed()`. -```r +{line-numbers=off} +~~~~~~~~ > set.seed(1) > rnorm(5) ## Same as before [1] -0.6264538 0.1836433 -0.8356286 1.5952808 0.3295078 -``` +~~~~~~~~ In general, you should **always set the random number seed when conducting a simulation!** Otherwise, you will not be able to reconstruct the exact numbers that you produced in an analysis. @@ -101,14 +110,15 @@ It is possible to generate random numbers from other probability distributions l -```r +{line-numbers=off} +~~~~~~~~ > rpois(10, 1) ## Counts with a mean of 1 [1] 0 0 1 1 2 1 1 4 1 2 > rpois(10, 2) ## Counts with a mean of 2 [1] 4 1 2 0 1 1 0 1 4 1 > rpois(10, 20) ## Counts with a mean of 20 [1] 19 19 24 23 22 24 23 20 11 22 -``` +~~~~~~~~ ## Simulating a Linear Model @@ -118,15 +128,16 @@ Simulating random numbers is useful but sometimes we want to simulate values tha Suppose we want to simulate from the following linear model -\[ +{$$} y = \beta_0 + \beta_1 x + \varepsilon -\] +{/$$} -where $\varepsilon\sim\mathcal{N}(0,2^2)$. Assume $x\sim\mathcal{N}(0,1^2)$, $\beta_0=0.5$ and $\beta_1=2$. The variable `x` might represent an important predictor of the outcome `y`. Here's how we could do that in R. +where {$$}\varepsilon\sim\mathcal{N}(0,2^2){/$$}. Assume {$$}x\sim\mathcal{N}(0,1^2){/$$}, {$$}\beta_0=0.5{/$$} and {$$}\beta_1=2{/$$}. The variable `x` might represent an important predictor of the outcome `y`. Here's how we could do that in R. -```r +{line-numbers=off} +~~~~~~~~ > ## Always set your seed! > set.seed(20) > @@ -140,15 +151,16 @@ where $\varepsilon\sim\mathcal{N}(0,2^2)$. Assume $x\sim\mathcal{N}(0,1^2)$, $\b > y <- 0.5 + 2 * x + e > summary(y) Min. 1st Qu. Median Mean 3rd Qu. Max. --6.4084 -1.5402 0.6789 0.6893 2.9303 6.5052 -``` +-6.4080 -1.5400 0.6789 0.6893 2.9300 6.5050 +~~~~~~~~ We can plot the results of the model simulation. -```r +{line-numbers=off} +~~~~~~~~ > plot(x, y) -``` +~~~~~~~~ ![plot of chunk Linear Model](images/Linear Model-1.png) @@ -156,56 +168,60 @@ We can plot the results of the model simulation. What if we wanted to simulate a predictor variable `x` that is binary instead of having a Normal distribution. We can use the `rbinom()` function to simulate binary random variables. -```r +{line-numbers=off} +~~~~~~~~ > set.seed(10) > x <- rbinom(100, 1, 0.5) > str(x) ## 'x' is now 0s and 1s int [1:100] 1 0 0 1 0 0 0 0 1 0 ... -``` +~~~~~~~~ Then we can procede with the rest of the model as before. -```r +{line-numbers=off} +~~~~~~~~ > e <- rnorm(100, 0, 2) > y <- 0.5 + 2 * x + e > plot(x, y) -``` +~~~~~~~~ ![plot of chunk Linear Model Binary](images/Linear Model Binary-1.png) -We can also simulate from a *generalized linear model* where the errors are no longer from a Normal distribution but come from some other distribution. For example, suppose we want to simulate from a Poisson log-linear model where +We can also simulate from *generalized linear model* where the errors are no longer from a Normal distribution but come from some other distribution. For examples, suppose we want to simulate from a Poisson log-linear model where -\[ +{$$} Y \sim Poisson(\mu) -\] +{/$$} -\[ +{$$} \log \mu = \beta_0 + \beta_1 x -\] +{/$$} -and $\beta_0=0.5$ and $\beta_1=0.3$. We need to use the `rpois()` function for this +and {$$}\beta_0=0.5{/$$} and {$$}\beta_1=0.3{/$$}. We need to use the `rpois()` function for this -```r +{line-numbers=off} +~~~~~~~~ > set.seed(1) > > ## Simulate the predictor variable as before > x <- rnorm(100) -``` +~~~~~~~~ Now we need to compute the log mean of the model and then exponentiate it to get the mean to pass to `rpois()`. -```r +{line-numbers=off} +~~~~~~~~ > log.mu <- 0.5 + 0.3 * x > y <- rpois(100, exp(log.mu)) > summary(y) Min. 1st Qu. Median Mean 3rd Qu. Max. 0.00 1.00 1.00 1.55 2.00 6.00 > plot(x, y) -``` +~~~~~~~~ ![plot of chunk Poisson Log-Linear Model](images/Poisson Log-Linear Model-1.png) @@ -218,34 +234,36 @@ You can build arbitrarily complex models like this by simulating more predictors The `sample()` function draws randomly from a specified set of (scalar) objects allowing you to sample from arbitrary distributions of numbers. -```r +{line-numbers=off} +~~~~~~~~ > set.seed(1) > sample(1:10, 4) -[1] 9 4 7 1 +[1] 3 4 5 7 > sample(1:10, 4) -[1] 2 7 3 6 +[1] 3 9 8 5 > > ## Doesn't have to be numbers > sample(letters, 5) -[1] "r" "s" "a" "u" "w" +[1] "q" "b" "e" "x" "p" > > ## Do a random permutation > sample(1:10) - [1] 10 6 9 2 1 5 8 4 3 7 + [1] 4 7 10 6 9 2 8 3 1 5 > sample(1:10) - [1] 5 10 2 8 6 1 4 3 9 7 + [1] 2 3 4 1 9 5 10 8 6 7 > -> ## Sample w/ replacement +> ## Sample w/replacement > sample(1:10, replace = TRUE) - [1] 3 6 10 10 6 4 4 10 9 7 -``` + [1] 2 9 7 8 2 8 5 9 7 8 +~~~~~~~~ To sample more complicated things, such as rows from a data frame or a list, you can sample the indices into an object rather than the elements of the object itself. Here's how you can sample rows from a data frame. -```r +{line-numbers=off} +~~~~~~~~ > library(datasets) > data(airquality) > head(airquality) @@ -256,13 +274,14 @@ Here's how you can sample rows from a data frame. 4 18 313 11.5 62 5 4 5 NA NA 14.3 56 5 5 6 28 NA 14.9 66 5 6 -``` +~~~~~~~~ Now we just need to create the index vector indexing the rows of the data frame and sample directly from that index vector. -```r +{line-numbers=off} +~~~~~~~~ > set.seed(20) > > ## Create index vector @@ -272,19 +291,19 @@ Now we just need to create the index vector indexing the rows of the data frame > samp <- sample(idx, 6) > airquality[samp, ] Ozone Solar.R Wind Temp Month Day -107 NA 64 11.5 79 8 15 -120 76 203 9.7 97 8 28 -130 20 252 10.9 80 9 7 -98 66 NA 4.6 87 8 6 -29 45 252 14.9 81 5 29 -45 NA 332 13.8 80 6 14 -``` +135 21 259 15.5 76 9 12 +117 168 238 3.4 81 8 25 +43 NA 250 9.2 92 6 12 +80 79 187 5.1 87 7 19 +144 13 238 12.6 64 9 21 +146 36 139 10.3 81 9 23 +~~~~~~~~ Other more complex objects can be sampled in this way, as long as there's a way to index the sub-elements of the object. ## Summary -- Drawing samples from specific probability distributions can be done with "r" functions. +- Drawing samples from specific probability distributions can be done with "r" functions - Standard distributions are built in: Normal, Poisson, Binomial, Exponential, Gamma, etc. -- The `sample()` function can be used to draw random samples from arbitrary vectors. -- Setting the random number generator seed via `set.seed()` is critical for reproducibility. +- The `sample()` function can be used to draw random samples from arbitrary vectors +- Setting the random number generator seed via `set.seed()` is critical for reproducibility diff --git a/manuscript/vectorized.md b/manuscript/vectorized.md index 56ac57a..8c47b31 100644 --- a/manuscript/vectorized.md +++ b/manuscript/vectorized.md @@ -12,25 +12,27 @@ languages. The simplest example is when adding two vectors together. -```r +{line-numbers=off} +~~~~~~~~ > x <- 1:4 > y <- 6:9 > z <- x + y > z [1] 7 9 11 13 -``` +~~~~~~~~ Natural, right? Without vectorization, you'd have to do something like -```r +{line-numbers=off} +~~~~~~~~ z <- numeric(length(x)) for(i in seq_along(x)) { z[i] <- x[i] + y[i] } z [1] 7 9 11 13 -``` +~~~~~~~~ If you had to do that every time you wanted to add two vectors, your hands would get very tired from all the typing. @@ -40,24 +42,26 @@ comparisons. So suppose you wanted to know which elements of a vector were greater than 2. You could do he following. -```r +{line-numbers=off} +~~~~~~~~ > x [1] 1 2 3 4 > x > 2 [1] FALSE FALSE TRUE TRUE -``` +~~~~~~~~ Here are other vectorized logical operations. -```r +{line-numbers=off} +~~~~~~~~ > x >= 2 [1] FALSE TRUE TRUE TRUE > x < 3 [1] TRUE TRUE FALSE FALSE > y == 8 [1] FALSE FALSE TRUE FALSE -``` +~~~~~~~~ Notice that these logical operations return a logical vector of `TRUE` and `FALSE`. @@ -66,23 +70,25 @@ and `FALSE`. Of course, subtraction, multiplication and division are also vectorized. -```r +{line-numbers=off} +~~~~~~~~ > x - y [1] -5 -5 -5 -5 > x * y [1] 6 14 24 36 > x / y [1] 0.1666667 0.2857143 0.3750000 0.4444444 -``` +~~~~~~~~ ## Vectorized Matrix Operations -Matrix operations are also vectorized, making for nicely compact +Matrix operations are also vectorized, making for nicly compact notation. This way, we can do element-by-element operations on matrices without having to loop over every element. -```r +{line-numbers=off} +~~~~~~~~ > x <- matrix(1:4, 2, 2) > y <- matrix(rep(10, 4), 2, 2) > @@ -103,6 +109,6 @@ matrices without having to loop over every element. [,1] [,2] [1,] 40 40 [2,] 60 60 -``` +~~~~~~~~