L02-more-models/template_L02.qmd at master · akuyper/L02-more-models · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
---
title: "L02 More Models"
subtitle: "Data Science 3 with R (STAT 301-3)"
author: "YOUR NAME"

format:
  html:
    toc: true
    embed-resources: true
    code-fold: show
    link-external-newwindow: true

execute:
  warning: false

from: markdown+emoji
---

## Github Repo Link

::: {.callout-important}

To link to your github **repo**sitory, appropriately edit the example link below. Meaning replace `https://your-github-repo-url` with your github repo url. Suggest verifying the link works before submitting.

[https://your-github-repo-url](https://your-github-repo-url)

:::

## Overview

The main goals of this lab are (1) review and practice the application of machine learning within the `tidymodels` framework and (2) introduce and explore a few new model types.

## Dataset

We will be utilizing `wildfires.csv` dataset contained in the **data** subdirectory. `wildfires_codebook.html` provides a quick overview of the data which is where students should begin.

## Instructions

A wildlife protection area is located in the park from which this data was collected and we want to predict whether or not a wildfire will reach it (`wlf`) given all the other variables in our dataset except for `burned` (we will be using it as a target variable in the future).

Using the `tidymodels` framework, pick the best model from the following candidate models:

1.  elastic net

    -   tune `mixture` and `penalty`
    -   include all two-way interactions

2.  Nearest neighbors

    -   tune number of `neighbors`

3.  Random forest

    -   tune `mtry` and `min_n`

4.  Boosted tree

    -   tune `mtry`, `min_n`, and `learn_rate`

5.  Support vector machine (polynomial)

    -   tune `cost`, `degree`, and `scale_factor` (default values are sufficient, free to change if you want)

6.  Support vector machine (radial basis function)

    -   tune `cost` and `rbf_sigma` (default values a sufficient, free to change if you want)

7.  Single Layer Neural Network (multilayer perceptron --- mlp)

    -   tune `hidden_units` and `penalty` (default values a sufficient, free to change if you want)
    -   `nnet` for the engine will be easiest, Alternatively, you might want to try `keras` if you can get it installed ([Keras Installation](https://tensorflow.rstudio.com/guide/keras/)).

8. Multivariate adaptive regression splines (MARS)

    - tune `num_terms` (need to supply upperbound) and `prod_degree` (defualt works here)

Some general notes:

- For tuning we suggest using 5 folds and 3 repeats.
- Make sure you specify which performance measure you are using to pick the best model.
- Almost all work should be done in R scripts and you will only be reporting the results (we will see the R scripts in your repo).
- Suggest using jobs.
- A basic layout is suggested/provided.
- We also want to collect how long it takes the tuning process for each model type. We can use the `tictoc` package --- code is provided in the `template_tune.R`.

## What should be turned in

A short write-up that includes:

1. A nicely formatted table that lists the 8 general types of model and the best performance it achieved.

2. A nicely formatted table that lists the run time for the tuning process for the 8 model types (could be combined with first table).

3. Final selection, training, and evaluation of the best model.

## Github Repo Link

[YOUR GITHUB URL](YOUR GITHUB URL){target="_blank"}