R-for-Geospatial/day2.qmd at main · ccmothes/R-for-Geospatial · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
---
title: "Lesson 2"
format:
  html:
    message: false
    warning: false
    df-print: paged
---

# Spatial Data Analysis

In Lesson 1 you were exposed to spatial data types and various databases you can pull spatial data from. Today we are going to use those data sets to perform a range of spatial analyses.

We briefly used the `sf` and `terra` packages yesterday, but today we will be exploring them much more in depth using a range of spatial operations they provide.

We have to start by reading in the packages we need for today. Some are repeats from Lesson 1, but we also have a couple new ones. Namely `ggplot2` and `tmap` to do some quick visualizations of our spatial outputs. In Lesson 3 we will be using these packages for more advanced data visualizations.

```{r}
source("packageLoad.R")

packageLoad(c("dplyr",
              "readr",
              "sf",
              "terra",
              "tmap",
              "ggplot2"))

```

## Distance Calculations

We're going to start off today with some distance calculations. Using our species occurrence data, say we want to know which species is often found closest to rivers and/or roads. We can answer this by finding each species average distance (across all occurrences) to our rivers and roads shapefiles.

First we have to read in the data. Our occurrences were saved as a csv file with lat/long. We can convert a non-spatial object (like a csv file) to a spatial `sf` object using the `st_as_sf()` function, specifying the CRS and lat/long columns.

```{r}
occ <- read_csv("data/species_occ.csv") %>%
  st_as_sf(coords = c("decimalLongitude", "decimalLatitude"), crs = 4326)
```

Throughout today we are going to be mapping our spatial data to quickly inspect it and get a visual of the data's extent and characteristics.

`tmap` is a great R package for spatial data visualization. It allows for both static ("plot" mode) and interactive ("view" mode) mapping options, which you can set using the function `tmap_mode()` . For today we will be making quick interactive plots. Once you set the mode with `tmap_mode()`, every plot call to `tmap` after that produces a plot in that mode.

```{r}
tmap_mode("view")
```

Quick view of all our points, colored by species:

```{r}
tm_shape(occ) +
  tm_symbols(col = "Species", size = 0.5)
```

Another way to make a quick map is with `tmap`'s `qtm()` function, which stands for "Quick Thematic Map"

```{r}
qtm(occ, symbols.col = "Species")
```

Now, for each species we want to find their average distance to rivers and roads. This involves point to line distance calculations, which we can perform with the `sf` package.

First let's read in our rivers and roads shapefiles. You can read in shapefiles with the `st_read()` function from `sf`, specifying the '.shp' extension, which will work as long as all other accompanying spatial files (.shx, .dpf, .prj) are within the same folder.

```{r}
rivers <- st_read("data/rivers.shp")

roads <- st_read("data/roads.shp")
```

Before performing any spatial operations, all of our spatial objects must be in the same CRS. We can see our spatial objects' CRS when we print the object to the console, or we can get the full CRS details with the `st_crs()` function.

```{r}
st_crs(rivers)
st_crs(roads)
st_crs(occ)
```

So our line shapefiles are in NAD83 and the occurrences are in WGS84. It generally doesn't matter which CRS you choose to use, but here we are going to transform our occurrences to NAD83 so we only have to transform one object instead of two. Here we use the `st_transform()` function, specifying we want the new CRS for our occurrences to be that of the rivers shapefile.

```{r}
occ <- st_transform(occ, crs = st_crs(rivers))
```

Also, our occurrence data set covers all of Colorado, but rivers and roads are only for Larimer County. We have to first filter our points to the extent of the rivers and roads objects. However, the extent of these is a square bounding box, not the exact boundary of Larimer County. We can subset Larimer County from our Colorado counties object, and use `st_filter()` to filter points the are found within the Larimer County polygon.

```{r}
counties <- st_read("data/CO_counties.shp")

occ_larimer <- st_filter(occ, counties[counties$NAME == "Larimer",])

qtm(occ_larimer)
```

Great, now we just have species occurrences within Larimer County.

Now for each point we want to calculate its distance to the nearest river and road. Let's start with rivers and then do the same for roads. The most efficient way is to first find the nearest line feature for each point. We can do this with the `st_nearest_feature()` function.

This function returns the index values (row number) of the river feature in the `rivers` spatial data frame that is closest in distance to each point. Here we are saving these index values in a new column of our Larimer occurrences that we will use later to calculate distances.

```{r}
occ_larimer$nearest_river <- st_nearest_feature(occ_larimer, rivers)
```

Now, for each point we can use the `st_distance()` function to calculate the distance to the nearest river feature, using the index value in our new "nearest_river" column. Adding `by_element = TRUE` is necessary to tell the function to perform the distance calculations by element (row), which we will fill into a new column "river_dist_m".

```{r}
occ_larimer$river_dist_m <- st_distance(occ_larimer, rivers[occ_larimer$nearest_river,], by_element = TRUE)
```

Notice that the new column is more than just a numeric class, but a "units" class, specifying that the values are in meters.

```{r}
str(occ_larimer)
```

Cool, now we have the distance to the nearest river (in meters) for each individual species occurrence. Now say we want the average distance for each species. We can do some data wrangling to get these values using the `dplyr()` package. Using the pipe `%>%` operator again, we perform a chain of operations on the data frame. `group_by()` specifies that all following operations should be performed individually by a grouping variable, in this case we want to apply operations on each individual species. Next `summarise()` calculates a new column "river_dist" that is the average (per species) river distance, where we have to convert our 'units' column to numeric to perform the `mean()` function.

```{r}
#| message: false
#| warning: false
occ_larimer %>%
  group_by(Species) %>%
  summarise(river_dist = (mean(as.numeric(river_dist_m))))

```

Let's make a quick bar plot to visually compare. We can pipe this output into a `ggplot()` object, specifying which variables go on the x and y axes within `aes()` and then we want to fill the bar color by species. `geom_col()` returns a barplot where the heights of the bars represent values in the data (in our case species average distance to a river).

```{r}
occ_larimer %>%
  group_by(Species) %>%
  summarise(river_dist = (mean(as.numeric(river_dist_m)))) %>%
  ggplot(aes(Species, river_dist, fill = Species)) +
  geom_col()
```

A thing to note about `ggplot2` is that it uses `+` instead of `%>%` to add additional elements.

Let's mess around with this plot a little bit further to make it look more aesthetic.

```{r}
occ_larimer %>%
  group_by(Species) %>%
  summarise(river_dist = mean(as.numeric(river_dist_m))) %>%
  ggplot(aes(Species, river_dist, fill = Species)) +
  geom_col() +
  labs(y = "Average distance to nearest river (m)") +
  theme(legend.position = "none") #removes the legend


```

Now lets do the same thing, but calculate average distance to the nearest road.

```{r}
occ_larimer$nearest_road <- st_nearest_feature(occ_larimer, roads)

occ_larimer$road_dist_m <- st_distance(occ_larimer, roads[occ_larimer$nearest_road,], by_element = TRUE)

occ_larimer %>%
  group_by(Species) %>%
  summarise(road_dist = mean(as.numeric(road_dist_m))) %>%

  ggplot(aes(Species, road_dist, fill = Species)) +
  geom_col() +
  labs(y = "Average distance to nearest road (m)") +
  theme(legend.position = "none")
```

## Buffers

Alternatively, say you want to know what percentage of species' occurrences (points) were found within a certain distance of a river or a road (calculated buffer).

To do this we could add a buffer around our line features and filter the points that fall within that buffer zone. For this example let's say we are interested in the 100 m buffer zone around rivers and roads. However, if you try this you'll notice this operation takes quite a while.

```{r}
#| eval: false
river_buffer <- st_buffer(rivers, dist = 100)
```

Instead, a more efficient way would be to make a 100 m buffer around each point, and see how many intersect with a river or road.

```{r}
occ_buffer <- st_buffer(occ_larimer, dist = 100)

```

Still takes a little bit of run time, but much faster than buffering each line feature. Our `occ_buffer` object is now a spatial polygon data frame, where each feature is an occurrence buffer with 100 m radius.

## Spatial Intersect

We can conduct spatial intersect operations using the function `st_intersects()`. This function checks if each individual buffer intersects with a river, and if so it returns an index value (row number) for each river feature it intersects. This function returns a list object for each buffer polygon, that will be empty if there are no intersections. We will add this as a column to our buffer data set, and then create a binary yes/no river intersection column based on those results.

```{r}
river_intersections <- st_intersects(occ_buffer, rivers)
```

If we inspect this object, we see it is a list of the same length as our `occ_buffer` object, where each list element is either empty (no intersections) or a list of index numbers for the river features that do intersect that buffer.

We want to create a new column that returns TRUE/FALSE if the buffer intersects with a river. We do this by testing if the length of each element is greater than 0 (if not the element is empty and returns FALSE since there are no river intersections).

```{r}
occ_buffer$river_100m <- lengths(river_intersections) > 0
```

Now we can find out what percentage of occurrences are within 100 m of a river for each species using similar `dplyr` operations as before.

```{r}
occ_buffer %>%
  st_drop_geometry() %>% #we use this function to treat the object as a dataframe
  group_by(Species) %>%
  summarise(total_occ = n(), percent_river = (sum(river_100m == TRUE)/total_occ)*100)
```

Now lets do another type of spatial intersection. Say we want to know what percent of each county is defined as 'urban area', using our urban areas polygons.

Let's read in our urban areas polygon shapefile.

```{r}
urban <- st_read("data/urban_areas.shp")
```

Now since we are wanting to calculate the actual area of intersection (not just whether or not an intersection exists as before) we can use the `st_intersection()` function.

```{r}
urban_intersect <- st_intersection(counties, urban)
```

This function returns the urban areas polygons as new polygons that intersect within each county, tied to the county information it lies within. We see that there are more rows in this data set than the original counties data set, meaning some urban areas cross multiple counties.

To clean this data to get the results we're interested in (percentage of each county that is covered by urban areas) we will first calculate the area of each urban area intersection, sum the total intersecting areas per county if there are multiple, and divide that sum by total county area. We are using a new function here from the `dplyr` package, `mutate()`, which creates new columns based on specified values/calculations.

```{r}
intersect_area <- urban_intersect %>%
  mutate(intersect_area = st_area(.)) %>% #create a new column with shape area
  dplyr::select(NAME, intersect_area) %>% #reduce to just county name and intersect area columns
  group_by(NAME) %>% # group by county
  summarise(intersect_area = sum(intersect_area)) %>% #new column that sums intersect area per county
  st_drop_geometry() #drop the geometry to treat as a dataframe
```

We return this as a data frame since we just want the intersection area values per county. We can then join this data frame to our counties shapefile to calculate percent urban area coverage. We are going to create a new counties object called `counties_attr` that we will be adding a lot of other county-level attributes to. We are using another new function here `left_join()`, which is a member of `dplyr`'s join functions where in this case the output contains all rows in `x`, the first element, and joins `y`, the second element, by a matching column name/variable, in this case county name.

```{r}
counties_attr <- counties %>%
  mutate(county_area = st_area(.)) %>% #calculate county area
  left_join(intersect_area, by = "NAME") %>% #join county to intersect area data
  mutate(urban_coverage = as.numeric(intersect_area / county_area)) # calculate a new column that is the proportion of urban area coverage
```

Now let's visualize urban coverage by county

```{r}
qtm(counties_attr, fill = "urban_coverage")

```

We can also look at total urban area instead of a percentage.

```{r}
qtm(counties_attr, fill = "intersect_area")
```

A little more variation here but maybe nothing too exciting if you are familiar with Colorado metro areas. But at least you learned some cool new tools related to spatial intersections!

## Spatial/Nonspatial Joins

We already used this tool a little already, but we can use `left_join()` to add a bunch more attributes to our counties shapefile. First lets add our census data, which was collected at the county level and saved as a csv.

```{r}
census <- read_csv("data/census_data.csv") %>%
  dplyr::select(-NAME)
```

We first want to remove the 'NAME' column, because it is slightly different than the 'NAME' column in our counties data and therefore will not join properly. We can instead join by matching 'GEOID', which is a unique numeric ID given to each county.

```{r}
counties_attr <- counties_attr %>%
  left_join(census, by = "GEOID")
```

Lastly, lets join our species occurrence data set. Say we want to know how many species occurrences are found in each county. Here we go back to the `st_intersects()` function we used before, which returns a list of all spatial elements that intersect with the spatial features of interest, in this case how many occurrence points intersect with each county polygon. We then nest this within `lengths()` which will return the number of intersecting features (occurrences) for each county, and we put those values as a new column called "species_count".

```{r}
counties_attr$species_count <- lengths(st_intersects(counties_attr, occ))
```

Now we have a bunch of information tied to our county shapefile, we can explore it spatially and comparatively.

```{r}
qtm(counties_attr, fill = "species_count")
```

Use `ggplot2` to visualize the relationship between different county attributes.

```{r}
counties_attr %>%
  st_drop_geometry() %>%
  ggplot(aes(total_pop, species_count)) +
  geom_point() +
  stat_smooth()
```

```{r}
# urban area related to co_born
counties_attr %>%
  st_drop_geometry() %>%
  ggplot(aes(urban_coverage, species_count)) +
  geom_point() +
  stat_smooth()
```

## Raster Reclassification

So far we've dealt with a bunch of vector data and associated analyses with the `sf` package. Now lets work through some raster data analysis using the `terra` package.

Lets read in our land cover and elevation raster files. These are files you downloaded yesterday, but were already processed. You can read more about these data sets and how they were processed [here](getdata.qmd). Land cover data comes from the National Land Cover Database (NLCD) and elevation data comes from the AWS Open Data Terrain Tiles. You can read in a raster file with the `rast()` function from `terra`.

```{r}
landcover <- terra::rast("data/NLCD_CO.tif")

elevation <- terra::rast("data/elevation_1km.tif")

```

```{r}
qtm(landcover)
```

This land cover data set includes attributes (land cover classes) associated with raster values. We can quickly view the frequency of each land cover type with the `freq()` function.

```{r}
freq(landcover)
```

Let's view this as a bar chart.

```{r}
freq(landcover) %>%
  ggplot(aes(reorder(value, count), count)) +
  labs(x = "") +
  geom_col() +
  coord_flip() # switch the axes to better view land cover class names
```

Say we want to explore some habitat characteristics of our species of interest, and we are specifically interested in forest cover. Our first step is to create a new raster layer from our land cover layer representing percent forest cover. This will involve multiple operations, including raster reclassification and focal statistics. Specifically, say we want to calculate the average percentage of forest cover and urbanization within a 9x9 pixel moving window.

First lets reclassify our land cover raster, creating a new raster representing just forest/non-forest pixels.

Since rasters are technically matrices, we can index and change values using matrix operations. Given this particular raster has attributes (land cover class names) associated with values, we can index by those names.

```{r}
#first assign landcover to a new object name so we can manipulate it while keeping the origian
forest <- landcover

#where the raster equals any of the forest categories, set that value to 1
forest[forest %in% c("Deciduous Forest", "Evergreen Forest", "Mixed Forest")] <- 1

#SPELLING IS IMPORTANT

#now set all non forest pixels to NA
forest[forest != 1] <- NA
```

Lets look at our new forest layer

```{r}
plot(forest)
```

## Focal Statistics

Now we are going to perform focal statistics, which is a spatial operation that calculates new values for each cell based on a specified moving window. For this example we are going to calculate within a 9x9km moving window (since our pixel resolution is 1km). We supply this to the `w =` argument as a matrix, where the first value is the weight of each pixel, and the second two are the number of rows and columns. Second we use the "sum" function, since each forest pixel has a value of 1 we will get the total number of forest pixels within the moving window, and then later divide the values by the total number of pixels in the window (81) to get the percentage. The final raster values will represent for each pixel the surrounding forest percentage (within \~4.5 km radius).

*Note: We are just making up these distance numbers to demonstrate the use of* `terra` *functions.* *In reality when working on your own research questions you should spend time thinking about the most appropriate values to use for your study species/system.*

```{r}
forest_pct <- terra::focal(forest, w=matrix(1,9,9), fun = "sum", na.rm = TRUE)
```

```{r}
forest_pct <- forest_pct/81
```

```{r}
plot(forest_pct)
```

Next, we wanted to know the percent forest cover associated with each species occurrence. Since we are now working with multiple spatial objects, we have to first check they are all in the same CRS and if not transform the data before any spatial operations.

```{r}
crs(forest_pct)

st_crs(occ)
```

Looks like the raster layer is in a different CRS. Let's reproject this so we can use it with our vector data (which are all in NAD83). We can project raster data to a new CRS with the `project()` function from `terra`.

One thing to note is that while `terra` does work with vector data, it wants them to be in a special format called a `SpatVector` , instead of an `sf` object. Luckily they have made it quick and easy to convert between the data formats using the `vect()` function, so we just need to remember to nest our `sf` objects within that function when using them in `terra` functions.

```{r}
forest_pct <- project(forest_pct, vect(occ))
```

## Raster Extract

Now we can use the `extract()` function to extract the raster pixel value at each occurrence.

```{r}
terra::extract(forest_pct, vect(occ))
```

Notice that this returns a 2 column data frame, with an ID for each feature (occurrence) and the extracted raster value in the second column. We want to add these raster values as a new column to our occurrence data set, so we need to index just this second column of the `extract()` output.

```{r}
occ$forest_pct <- terra::extract(forest_pct, vect(occ))[,2]
```

Now let's do the same extract method with our elevation raster, pulling the elevation value at each species occurrence.

Again, we need to first project the raster to the CRS of the occurrence data set.

```{r}
elevation <-  terra::project(elevation, vect(occ))
```

Now we can do the same extraction as with the forest raster, putting the values in a new 'elevation' column

```{r}
occ$elevation <- terra::extract(elevation, vect(occ))[,2]
```

Let's do some data frame operations to calculate and compare average forest cover across species

```{r}
occ %>%
  group_by(Species) %>%
  summarise(avg_forest_pct = mean(forest_pct, na.rm = TRUE))
```

```{r}
occ %>%
  group_by(Species) %>%
  summarise(avg_forest_pct = mean(forest_pct, na.rm = TRUE)) %>%
  ggplot(aes(Species, avg_forest_pct, fill = Species))+
  geom_col() +
  theme(legend.position = "bottom")
```

Looks like elk are associated with the most forested habitats. We can also view the spread of values with a box plot.

```{r}
ggplot(occ, aes(Species, forest_pct)) +
  geom_boxplot()

```

How about elevation?

```{r}
ggplot(occ, aes(Species, elevation)) +
  geom_boxplot()
```

Yellow-bellied marmots are found at the highest elevations, on average.

That's one way to use the `extract()` function. We can also extract raster values within polygons, and supply a function to summarize those raster values.

Say we wanted to know the most common land cover type in each county. We can use `extract()` with the function 'modal' to return the most common type for each polygon, and we will add this as a new column.

Again we need to project our land cover raster.

```{r}
landcover_prj <- project(landcover, vect(counties))
```

```{r}
terra::extract(landcover_prj, vect(counties), fun = "modal")

```

This works similar to what we ran with the occurrence dataset...however we notice that it returns the raw raster values instead of the named land cover classes (which we want). This raster behaves such that values are named factors (as long as we have the land cover attribute file in the same folder), and we can pull this metadata with the `cats()` function.

```{r}
cats(landcover)[[1]]
```

This returns a single element list that is a data frame with a bunch of information. We just want to keep 'value' and 'NLCD Land Cover Class", and remove all the empty classes. We have to first index the first element of the list to operate on just the data frame, then we can apply some `dplyr` functions.

```{r}
nlcd_classes <- cats(landcover)[[1]] %>%
  dplyr::select("value", nlcd_class = "NLCD Land Cover Class") %>%
  filter(nlcd_class != "")
```

Cool, now we can tie this to our counties data frame once we have the most common land cover value calculated with `extract`

```{r}
counties$common_landcover <- terra::extract(landcover_prj, vect(counties), fun = "modal")[,2]
```

This join is different from our others because the variable we want to join by has a different column name in each data set (even though the values match), but we can specify this within the `by =` argument like this:

```{r}
counties <- counties %>%
  left_join(nlcd_classes, by = c("common_landcover" = "value"))
```

We can now get some summary statistics on the most common land cover types per county.

```{r}
counties %>%
  st_drop_geometry() %>% #for ggplot, don't need geometry
  group_by(nlcd_class) %>%
  summarise(n = n()) %>%
  ggplot(aes(nlcd_class, n, fill = nlcd_class)) +
  geom_col()+
  theme(legend.position = "none")+
  coord_flip()
```

And map it out by county:

```{r}
tm_shape(counties)+
  tm_polygons("nlcd_class")
```

## Save Data

We've added new columns to some of our data frames and created new ones. Let's save these objects so we can use them in Lesson 3 without having to re-run all this code.

Yesterday we learned how to save shapefiles with `st_write()`. Another way to save data in R is by saving them as R objects, which is beneficial as it reduces file size. You can save individual R objects as .RDS files, or multiple R objects as .RData files. Since we have multiple objects we want to save and load back into our environment tomorrow, we will write them to a single .RData file.

Before writing to file, let's get rid of some of the extra columns we don't need. `counties_attr` is one of the main objects we will be using for visualizations tomorrow. Here we can de-select a range of columns by putting a `-` in front.

```{r}
counties_attr <- counties_attr %>%
  dplyr::select(-c(NAMELSAD:INTPTLON))
```

Now we use the `save()` function, listing all the objects in our current environment that we want to save, and the file name and location with the .RData extension.

```{r}
#| eval: false
save(counties_attr, occ_buffer, occ_larimer, occ, file = "data/objects.RData")
```

```{r}
#| eval: false
#| echo: false
# for testing/building content
save(counties_attr, occ_buffer, occ_larimer, occ, file = "data_test/objects.RData")
```

We will start Lesson 3 by reading these objects back into our environment (so you can close out of this R Studio session at the end of the day if you want).