misc-code-examples/boosting-sdm.qmd at main · elkewind/misc-code-examples · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
---
title: "Boosting as a Species Distribution Model in R"
author: "Elke Windschitl"
format: html
editor: visual
date: 2022-12-02
---

## Introduction

In this post, I attempt to reproduce work by [Elith et al. 2008](https://besjournals.onlinelibrary.wiley.com/doi/10.1111/j.1365-2656.2008.01390.x)^1^ to model species distribution of short-finned eel (*Anguilla australis*) using Boosted Regression Trees. This analysis was created for an assignment for EDS 232 Machine Learning in Environmental Science -- a course in UCSB's Master's of Environmental Data Science curriculum taught by Mateo Robbins.

Boosting is a popular machine learning algorithm that builds models sequentially based on information learned from the previous model^2^. Here, decision trees will be built in sequence using extreme gradient boosting to classify presence or absence of short-finned eel in a given location associated with environmental parameters such as temperature, slope, rainy days, etc. Elith et al. used package gbm in R, whereas I use a Tidymodels approach in R.

## Data

Data labeled "model.data.csv" were retrieved from the supplemental information by Elith et al. 2008 and include the following variables:

![Figure 1: Table 1. from Elith et al. 2008 displaying the variables included in the analysis.](variables.png)

```{r}
#| include: false
library(readr)
eel_data_raw <- read_csv("/Users/elkewindschitl/Documents/MEDS/eds-232/labs/eds-232-labs/model.data.csv")
```

```{r}
#| warning: false
# Load libraries
library(tidyverse)
library(tidymodels)
library(sjPlot)
library(pROC)
library(RColorBrewer)
library(knitr)

eel_data <- eel_data_raw %>%
  select(-Site) # remove site number from data frame
eel_data$Angaus <- as.factor(eel_data$Angaus) # set outcome variable as a factor

eel_data %>% slice_head(n = 5) %>% kable
```

## Methods

### Split and Resample

I split the data from above into a training and test set 70/30, stratified by outcome score. I used 10-fold CV to resample the training set, stratified by Angaus.

```{r}
#| warning: false
# Stratified sampling with the rsample package
set.seed(123) # Set a seed for reproducibility
split <- initial_split(data = eel_data,
                       prop = .7,
                       strata = "Angaus")

eel_train <- training(split) # Grab training data
eel_test  <- testing(split) # Grab test data

# Set up cross validation stratified on Angaus
cv_folds <- eel_train %>%
  vfold_cv(v=10, strata = "Angaus")
```

### Preprocessing

I created a recipe to prepare the data for the XGBoost model. I was interested in predicting the binary outcome variable Angaus which indicates presence or absence of the eel species *Anguilla australis*.

```{r}
#| warning: false
# Set up a recipe
eel_rec <- recipe(Angaus ~ ., data = eel_train) %>%
  step_dummy(all_nominal(), -all_outcomes(), one_hot = TRUE) %>%
  prep(training = eel_train, retain = TRUE)

# Bake to check recipe
baked_eel <- bake(eel_rec, eel_train)
```

### Tuning XGBoost

### Tune Learning Rate

First I conducted tuning on just the learn_rate parameter.

```{r}
#| warning: false
eel_spec <- parsnip::boost_tree(mode = "classification",
                                engine = "xgboost",
                                trees = 1000,
                                learn_rate = tune())

```

I set up a grid to tune my model by using a range of learning rate parameter values.

```{r}
#| warning: false
# Set up tuning grid
eel_grid <- expand.grid(learn_rate = seq(0.0001, 0.5, length.out = 50))

# Set up workflow
wf_eel_tune <- workflow() %>%
  add_recipe(eel_rec) %>%
  add_model(eel_spec)

doParallel::registerDoParallel(cores = 3)
set.seed(123)

# Tune
eel_rs <- tune_grid(
  wf_eel_tune,
  Angaus~.,
  resamples = cv_folds,
  grid = eel_grid
)
```

```{r}
#| warning: false
# Identify best values from the tuning process
eel_rs %>%
  tune::show_best(metric = "accuracy") %>%
  slice_head(n = 5) %>%
  kable(caption = "Performance of the best models and the associated estimates for the learning rate parameter values.")

  # tab_df(title = "Table 2",
  #        digits = 4,
  #        footnote = "Performance of the best models and the associated estimates for the learning rate parameter values.",
  #        show.footnote = TRUE)

eel_best_learn <- eel_rs %>%
  tune::select_best("accuracy")

# eel_model <- eel_spec %>%
#   finalize_model(eel_best_learn)
```

### Tune Tree Parameters

I created a new specification where I set the learning rate and tune the tree parameters.

```{r}
#| warning: false
eel_spec2 <- parsnip::boost_tree(mode = "classification",
                                engine = "xgboost",
                                trees = 1000,
                                learn_rate = eel_best_learn$learn_rate,
                                min_n = tune(),
                                tree_depth = tune(),
                                loss_reduction = tune()
                                )
```

I set up a tuning grid using grid_max_entropy() to get a representative sampling of the parameter space.

```{r}
#| warning: false
# Define parameters to be tuned
eel_params <- dials::parameters(
  min_n(),
  tree_depth(),
  loss_reduction()
)

# Set up grid
eel_grid2 <- dials::grid_max_entropy(eel_params, size = 50)

# Set up workflow
wf_eel_tune2 <- workflow() %>%
  add_recipe(eel_rec) %>%
  add_model(eel_spec2)

set.seed(123)
doParallel::registerDoParallel(cores = 3)

# Tune
eel_rs2 <- tune_grid(
  wf_eel_tune2,
  Angaus~.,
  resamples = cv_folds,
  grid = eel_grid2
)
```

```{r}
#| warning: false
# Identify best values from the tuning process
eel_rs2 %>%
  tune::show_best(metric = "accuracy") %>%
  slice_head(n = 5) %>%
  kable(caption = "Performance of the best models and the associated estimates for the tree parameter values.")
  # tab_df(title = "Table 3",
  #        digits = 4,
  #        footnote = "Performance of the best models and the associated estimates for the tree parameter values.",
  #        show.footnote = TRUE)

eel_best_trees <- eel_rs2 %>%
  tune::select_best("accuracy")

# eel_model2 <- eel_spec2 %>%
#   finalize_model(eel_best_trees)
```

### Tune Stochastic Parameters

I created another new specification where I set the learning rate and tree parameters and tune the stochastic parameters.

```{r}
#| warning: false
eel_spec3 <- parsnip::boost_tree(mode = "classification",
                                engine = "xgboost",
                                trees = 1000,
                                learn_rate = eel_best_learn$learn_rate,
                                min_n = eel_best_trees$min_n,
                                tree_depth = eel_best_trees$tree_depth,
                                mtry = tune(),
                                loss_reduction = eel_best_trees$loss_reduction,
                                sample_size = tune(),
                                stop_iter = tune()
                                )
```

I set up a tuning grid using grid_max_entropy() again.

```{r}
#| warning: false
# Define parameters to be tuned
eel_params2 <- dials::parameters(
  finalize(mtry(),select(baked_eel,-Angaus)),
  sample_size = sample_prop(c(.4, .9)),
  stop_iter())

# Set up grid
eel_grid3 <- dials::grid_max_entropy(eel_params2, size = 50)

# Set up workflow
wf_eel_tune3 <- workflow() %>%
  add_recipe(eel_rec) %>%
  add_model(eel_spec3)

set.seed(123)
doParallel::registerDoParallel(cores = 3)

# Tune
eel_rs3 <- tune_grid(
  wf_eel_tune3,
  Angaus~.,
  resamples = cv_folds,
  grid = eel_grid3
)
```

```{r}
#| warning: false
# Identify best values from the tuning process
eel_rs3 %>%
  tune::show_best(metric = "accuracy") %>%
  slice_head(n = 5) %>%
  kable(caption = "Performance of the best models and the associated estimates for the stochastic parameter values.")
  # tab_df(title = "Table 4",
  #        digits = 4,
  #        footnote = "Performance of the best models and the associated estimates for the stochastic parameter values.",
  #        show.footnote = TRUE)

eel_best_stoch <- eel_rs3 %>%
  tune::select_best("accuracy")

eel_model3 <- eel_spec3 %>%
  finalize_model(eel_best_stoch)
```

### Finalize workflow

I assembled my final workflow with all of my optimized parameters and did a final fit.

```{r}
#| warning: false
eel_final_spec <- parsnip::boost_tree(mode = "classification",
                                engine = "xgboost",
                                trees = 1000,
                                learn_rate = eel_best_learn$learn_rate,
                                min_n = eel_best_trees$min_n,
                                tree_depth = eel_best_trees$tree_depth,
                                mtry = eel_best_stoch$mtry,
                                loss_reduction = eel_best_trees$loss_reduction,
                                stop_iter = eel_best_stoch$stop_iter,
                                sample_size = eel_best_stoch$sample_size
                                )

# Set up workflow
wf_eel_final <- workflow() %>%
  add_recipe(eel_rec) %>%
  add_model(eel_final_spec)

final_simple_fit <- wf_eel_final %>% # fit to just training data (need for later)
  fit(data = eel_train)

final_eel_fit <- last_fit(eel_final_spec, Angaus~., split) # does training fit then final prediction as well

# Show predictions
final_pred_tab <- as.data.frame(final_eel_fit$.predictions)
head(final_pred_tab) %>%
  kable(caption = "Predictions of Angaus presence on test data.")

final_met_tab <- final_eel_fit$.metrics # Store metrics
head(final_met_tab) %>%
  kable(caption = "Accuracy and area under ther receiver operator curve of the final fit.")

# Bind predictions and original data
eel_test_rs <- cbind(eel_test, final_eel_fit$.predictions)
eel_test_rs <- eel_test_rs[,-1] # Remove duplicate column

# Compute a confusion matrix
cm<- eel_test_rs %>% yardstick::conf_mat(truth = Angaus, estimate = .pred_class)

autoplot(cm, type = "heatmap") +
  theme(axis.text.x = element_text(size = 12),
        axis.text.y = element_text(size = 12),
        axis.title = element_text(size = 14),
        panel.background = element_rect(fill = "#F8F8F8"),
        plot.background = element_rect(fill = "#F8F8F8")) +
  labs(title = "Figure 2: Confusion matrix of predictions on test data.")

tibble <- final_eel_fit %>% collect_metrics()

final_eel_accuracy <- tibble %>%
  filter(.metric == "accuracy") %>%
  pull(.estimate)

final_eel_auc <- tibble %>%
  filter(.metric == "roc_auc") %>%
  pull(.estimate)
```

```{r}
#| echo: false
#| code-overflow: wrap
cat(paste0("The model had an accuracy of ", round(final_eel_accuracy,2), ". The ROC area under the curve was ", round(final_eel_auc, 2), ". The rate of false negatives was ", round(cm$table[3]/nrow(eel_test), 2), ", and the rate of false positives was ", round(cm$table[2]/nrow(eel_test),2), "."))
```

The model had an accuracy of `r round(final_eel_accuracy,2)`. The ROC area under the curve was `r round(final_eel_auc, 2)`. The rate of false negatives was `r round(cm$table[3]/nrow(eel_test), 2)`, and the rate of false positives was `r round(cm$table[2]/nrow(eel_test),2)`.

## Results

I then fit my final model to the evaluation data set labeled "eel.eval.data.csv" and compare performances.

```{r}
#| include: false
eval_dat_raw <- read_csv("/Users/elkewindschitl/Documents/MEDS/eds-232/labs/eds-232-labs/eel.eval.data.csv")
```

```{r}
#| warning: false
# Read in eval data
eval_dat <- eval_dat_raw %>%
  rename(Angaus = Angaus_obs) %>% # rename to match previous data
  mutate(Angaus = as_factor(Angaus)) # make outcome a factor

prediction <- final_simple_fit %>% predict(new_data = eval_dat) # generate predictions
eval_dat_pred <- cbind(eval_dat, prediction)

# Compare predicted classes to actual classes
correct_predictions <- sum(eval_dat_pred$.pred_class == eval_dat_pred$Angaus)

# Calculate accuracy
accuracy <- correct_predictions / nrow(eval_dat_pred)

# Calculate auc
eval_dat_pred$pred_num <- as.numeric(eval_dat_pred$.pred_class)
auc <- auc(eval_dat_pred$Angaus, eval_dat_pred$pred_num)
```

How did my model perform on this data?

The model had an accuracy of `r accuracy` on these data, which isn't quite as good as the accuracy when applying the model to the testing data. However the difference is not too extreme and seems pretty good given that the dummy classifier would be 0.744. The model had an AUC of `r round(auc[1], 2)` which is not great.

How did my results compare to those of Elith et al.?

The model here does not do as well as the model in Elith et al. which found a model AUC of 0.858. My AUC of `r round(auc[1], 2)` is disappointingly far off. This could be because the data have imbalanced classes. Elith et al. did more tuning to find the optimal values and used larger ranges of trees, where as I was more limited by computing power on my laptop. They also could have tuned the threshold for classification, which I did not do here. I also used a Tidymodels approach, where as it is possible that the *gbm* package offers more power or flexibility when tuning.

## References

^1^Elith, J., Leathwick, J.R. and Hastie, T. (2008), A working guide to boosted regression trees. Journal of Animal Ecology, 77: 802-813. https://doi.org/10.1111/j.1365-2656.2008.01390.x

^2^Boehmke, Bradley, and Brandon Greenwell. “Chapter 12 Gradient Boosting.” Hands-On Machine Learning with R, 2020, https://bradleyboehmke.github.io/HOML/gbm.html. Accessed 3 Oct. 2023.