Skip to content

Commit 3e1a842

Browse files
committed
add missing values section
1 parent 3925745 commit 3e1a842

File tree

1 file changed

+40
-0
lines changed

1 file changed

+40
-0
lines changed

03-wrangling.qmd

Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -348,6 +348,46 @@ summary_temp <- weather %>%
348348

349349
:::
350350

351+
352+
## Missing variables {#sec-missing}
353+
354+
In @sec-summarize we saw that missing values in a data frame are coded as `NA`. Almost any operation involving an unknown value will also result in an unknown value. When calculating summary statistics, we can include the argument `na.rm = TRUE` to remove missing values from the calculation.
355+
356+
In other situations, however, it may be necessary to remove entire observations from a data frame if one or more values are missing. Handling missing data requires extra care. Before removing observations, it is important to consider *why* the data are missing and whether it is appropriate to exclude those observations from the analysis.
357+
358+
There are multiple ways to remove observations with missing values in R. The two most common approaches are using `filter()` with `is.na()` and using `drop_na()`. Both accomplish the same goal but differ in readability and flexibility.
359+
360+
### `is.na()` function
361+
362+
One way to remove observations with missing values is to `filter` them out. To do this, we need the `is.na()` function, which returns `TRUE` if a value is missing and `FALSE` otherwise.
363+
364+
Consider the `weather` data frame and variable `temp` again. If we use `is.na(temp)` inside `filter()`, we would *keep* only the days where `temp` is missing. This is the opposite of what we want. Instead we use `!is.na(temp)` to keep only observations where `temp` is *not* missing.
365+
366+
```{r}
367+
#| eval: false
368+
369+
weather %>%
370+
filter(!is.na(temp))
371+
```
372+
373+
### `drop_na()` function
374+
375+
Another way to remove missing values is with the `drop_na()` function from the `tidyr` package.
376+
377+
```{r}
378+
#| eval: false
379+
380+
library(tidyr)
381+
382+
weather %>%
383+
drop_na(temp)
384+
385+
```
386+
387+
The `drop_na()` function removes rows where any of the specified variables contain missing values. It is very important to explicitly list the variables of interest. If no variables are specified, `drop_na()` will remove *all* rows that contain *any* missing values in the data frame. This can be dangerous, as it may unintentionally remove observations that are missing values in variables unrelated to your analysis.
388+
389+
In general, `filter(!is.na(var))` offers greater control and allows for more complex filtering conditions, while `drop_na(var)` is a quick and readable way to remove missing values.
390+
351391
## `group_by()` rows {#sec-groupby}
352392

353393
![Group by and summarize diagram from Data Wrangling with dplyr and tidyr cheatsheet](images/group_summary.png)

0 commit comments

Comments
 (0)