HackingLimnoWorkshop/model_walkthrough.Rmd at main · SimonTopp/HackingLimnoWorkshop · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
---
title: "Hacking Limnology Workshop"
date: "7/13/2021"
output: html_document
---

```{r setup, include=FALSE}
library(tidyverse)
library(feather)
library(mapview)
library(sf)
library(xgboost)
library(Hmisc)
library(leaflet)
library(leafgl)
library(kableExtra)
library(Metrics)
library(lubridate)

knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE)
```

### First lets load in AquaSat (available at <https://figshare.com/articles/dataset/AquaSat/8139383>).

It's important to know that AquaSat takes a very *inclusive* approach to data availability. This is in order to give end users more control over the various filtering choices they want to make. However, it also means that we need to be careful and apply some reasonable data quality control measures. That's what we'll do in this first chunk. Future version's of AquaSat will have tiered data, basically a flag that does this step for you.

Additionally, we'll add some band ratios that we know will be helpful in our modelling efforts.

```{r}
## Bring in some utility functions that we'll use throughout the tutorial.
source('utils.R')

## Read in AquaSat
as <- read_csv('data/sr_wq_rs_join.csv')

## Do some QC and filter the data to just water clarity in lakes.
as.munged <- as %>%
  filter(!is.na(secchi),
         type =='Lake',
         pwater > 90,
         pixelCount > 10,
         clouds < 10,
         secchi > 0,
         secchi < 10,
         across(c(blue, green, red, nir, swir1, swir2), ~ .x > 0 & .x < 1000)) %>%
  select(blue, green, red, nir, swir1, swir2, sat, date, secchi, lat, long) %>%
  mutate(dWL = fui.hue(red, green, blue),
         NR = nir/red,
         ndvi = (nir-red)/(nir+red),
         gndvi = (nir-green)/(nir + green),
         ndti = (red-green)/(red + green),
         id = row_number()) %>%
  left_join(fui.lookup) %>%
  filter(!is.na(fui))
```

### Build a model to predict water clarity.

Now we're going to use AquaSat to build a model for predicting water clarity in lakes. Here, we're going to build what I'll refer to as a *naive* model. By naive I am referring to the fact that it's a proof of concept model and there are a number of steps we could take to improve it's performance and generalizability. We'll discuss some of these below.

```{r}

# Define our target and input variables
target <- 'secchi'
feats <- c('blue', 'red', 'green', 'nir', 'NR', 'gndvi', 'ndti', 'ndvi', 'dWL')

# Set aside training and testing data
set.seed(2423)
train <- as.munged %>% sample_frac(.7)
test <- as.munged %>% filter(!id %in% train$id)

dtrain <- xgb.DMatrix(data = as.matrix(train[feats]), label = train[target][[1]])
dtest <- xgb.DMatrix(data = as.matrix(test[feats]), label = test[target][[1]])

# Set our parameters, note that we won't be hypertuning our model for this example.
params <- list(booster = "gbtree", objective = "reg:squarederror", eval_metric = 'mae', eta=0.1, gamma=0, max_depth=3, min_child_weight=1, subsample=1, colsample_bytree=1)

## Train our model with early stopping rounds on our test data to avoid overfitting.
xgb.naive <- xgb.train(params = params, data = dtrain, nrounds = 1000, watchlist = list(train = dtrain, val = dtest), print_every_n = 25, early_stopping_rounds = 10, maximize = F)

preds <- test %>% mutate(predicted = predict(xgb.naive, dtest))

ggplot(preds, aes(x = secchi, y = predicted)) +
  geom_hex() +
  scale_fill_viridis_c(trans = 'log10') +
  geom_abline(color = 'red') +
  ggpmisc::stat_poly_eq(aes(label =  paste(stat(adj.rr.label))),
               formula = y~x, parse = TRUE,
               label.y = Inf, vjust = 1.3)

preds %>%
  summarise(rmse = rmse(secchi, predicted),
            mae = mae(secchi, predicted),
            mape = mape(secchi, predicted),
            bias = bias(secchi, predicted),
            p.bias = percent_bias(secchi, predicted),
            smape = smape(secchi, predicted)) %>%
  kable(digits = 3) %>% kable_styling()
```

#### So, using purely optical inputs, we still get an R\^2 of \~.5 and MAE of `r round(xgb.naive$best_score, 3)` meters. Not bad!

#### If we wanted to maximize this models predictive capability we could:

-   Add in lake and landscape variables (depth, surface area, land cover, etc)
-   Apply some minor standardization procedures across our landsat sensors
-   Hypertune our model to optimize performance.

#### If we wanted to maximize the generalizability of our model we could:

-   Apply leave-location-leave time out cross-validation
-   Be more intential in terms of our train/test splits
-   Apply model weights to focus on data sparse regions

### Now lets actually apply our model to a lake to explore it's clarity dynamics.

Here, were going to use lake Mendota as an example, but the map below shows all the lakes that are in [LimnoSat-US](https://doi.org/10.5281/zenodo.4139694).

```{r}
## This chunk requires the LimnoSat database available at
# https://doi.org/10.5281/zenodo.4139694. It is not necessary for the workshop itself.

## Read in the data
# Lakes
lakes <- st_read('../Walkthroughs/data/LimnoSat/HydroLakes_DP.shp') %>%
  st_centroid()

#LimnoSat
#ls <- read_feather('../Walkthroughs/data/LimnoSat/LimnoSat_20200628.feather')

#Find your lake of interest, click on the a lake to get it's Hylak_id
leaflet() %>%
  addTiles() %>%
  addGlPoints(lakes %>% filter(type == 'dp'), popup = 'Hylak_id')

```

#### First just look at the distribution of observations that we have for it

```{r Mendota Explorer}
# Filter to Mendota and add some useful variables
#Mendota <- ls %>% filter(Hylak_id == 9086)
Mendota <- read_csv('data/LimnoSat_Mendota.csv') %>%
  mutate(month = month(date, label = T),
         doy = yday(date),
         period = cut(year, 12, dig.lab = 4))

# Yearly observations
ggplot(Mendota, aes(x = year)) + geom_bar() + labs(y = 'Number of Observations', title = 'Yearly Observations') + theme_bw()

# Monthly observations
Mendota %>% mutate(month = month(date, label = T)) %>%
  ggplot(., aes(x = month)) + geom_bar() + labs(y = 'Number of Observations', title = 'Monthly Observations') + theme_bw()
```

### For Mendota, looks like we have a total of `r nrow(Mendota)` observations. LimnoSat-US contains all the Landsat Reflectance values as well as the dominant wavelength, a metric of color. We'll use dominant wavelength to explore the data because it's an intuitive way to examine lake systems.

```{r}
Mendota <- Mendota %>% left_join(fui.lookup)

# Overall Color Distribution
Mendota %>% group_by(dWL) %>%
  summarise(count = n()) %>%
  left_join(fui.lookup) %>%
  ggplot(., aes(x = dWL, y = count, fill = color.code)) +
  geom_col() +
  scale_fill_identity() +
  labs(x = 'Wavelength (nm)', title = 'Overall Color Distribution') +
  theme_bw() +
  theme(legend.position = 'none')

# Monthly Climatology
ggplot(Mendota, aes(x = month, y = dWL)) +
  #geom_violin(draw_quantiles = .5) +
  geom_boxplot(outlier.colour = 'transparent') +
  geom_jitter(aes(color = color.code), size = 2, position = position_jitter(.2)) +
  scale_color_identity() +
  labs(y = 'Wavelength (nm)', title = 'Monthly Climatology') +
  theme_bw() +
  theme(legend.position = 'none')

# Summer color observations over time
Mendota %>%
  filter(month %in% c('Jun', 'Jul', 'Aug')) %>%
  ggplot(., aes(x = date, y = dWL)) +
  geom_point(aes(color = color.code), size = 3) +
  geom_smooth(se = T, method = 'lm') +
  scale_color_identity() +
  labs(y = 'Wavelenght (nm)', x = 'Year', title = 'Summer (JJA) Lake Color Over Time') +
  theme_bw() +
  theme(legend.position = 'none')
```

### Now lets apply our model and look at clarity and some of it's key dynamics. Note the points below are still colored by their dominant wavelength allowing us to see how lake color changes with predicted clarity.

```{r}
Mendota <- Mendota %>%
  rename_at(vars(Blue, Green, Red, Nir, Swir1, Swir2), tolower) %>%
  mutate(dWL = fui.hue(red, green, blue),
         NR = nir/red,
         ndvi = (nir-red)/(nir+red),
         gndvi = (nir-green)/(nir + green),
         ndti = (red-green)/(red + green)) %>%
  left_join(fui.lookup)

Mendota$sdd <- predict(xgb.naive, as.matrix(Mendota[,feats]))

# Monthly Climatology
ggplot(Mendota, aes(x = month, y = sdd)) +
  #geom_violin(draw_quantiles = .5) +
  geom_boxplot(outlier.colour = 'transparent') +
  geom_jitter(aes(color = color.code), size = 2, position = position_jitter(.2)) +
  scale_color_identity() +
  labs(y = 'Predicted Water Clarity (m)', title = 'Monthly Climatology') +
  theme_bw() +
  theme(legend.position = 'none')

# Summer color observations over time
Mendota %>%
  filter(month %in% c('Jun', 'Jul', 'Aug')) %>%
  ggplot(., aes(x = date, y = sdd)) +
  geom_point(aes(color = color.code), size = 3) +
  geom_smooth(se = T, method = 'lm') +
  scale_color_identity() +
  labs(y = 'Clarity (m)', x = 'Year', title = 'Summer (JJA) Lake Color Over Time') +
  theme_bw() +
  theme(legend.position = 'none')
```

# That's the gist of it! Now go do some cool remote sensing work!

If you have any questions feel free to reach out at [sntopp\@live.unc.edu](mailto:sntopp@live.unc.edu){.email} ![](https://media.giphy.com/media/3oEhn6803hZKJNeMhy/giphy.gif)

### Bonus: Model Explainability

#### Now lets look at some interpretability metrics, specifically SHAP values and feature importance.

SHAP values show the distribution of feature effects across all observations. A more detailed description can be found at <https://christophm.github.io/interpretable-ml-book/shap.html>

```{r}
import <- xgb.importance(feats, xgb.naive)
xgb.plot.importance(import)

xgb.plot.shap(data = as.matrix(as.munged[,feats]), top_n = 6, n_col = 3, model = xgb.naive)
```