301final/executive_summary.Rmd at main · susantmtran/301final · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
---
title: "Predicting Oscar-Nominations Between 2000 and 2017 (Executive Summary)"
author: "Grace Park, Susan Tran, Nomi Tannenbaum, Jamie Lee, Alex Chang"
date: "Spring 2021"
output:
  html_document:
    toc: yes
    toc_float: yes
    theme: paper
    code_folding: hide
subtitle: Data Science III (STAT 301-3)
editor_options:
  chunk_output_type: console
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(warning=FALSE, message=FALSE)

# load package
library(tidyverse)
library(tidymodels)

# load data
potential_data <- read_csv("data/unprocessed/BigML_Dataset_5f50a62c2fb31c516d000176.csv")

# load-objects
oscar_recipe <- readRDS("model-info/oscar_recipe.rds")
oscar_train <- readRDS("data/processed/oscar_train.rds")
oscar_test <- readRDS("data/processed/oscar_test.rds")
oscar_folds <- readRDS("data/processed/oscar_folds.rds")

rf_tuned <- read_rds("model-info/tuned/rf_tuned_best.rds")
nn_tuned <- read_rds("model-info/tuned/nn_tuned_best.rds")
bt_tuned <- read_rds("model-info/tuned/bt_tuned_best.rds")
mlp_tuned <- read_rds("model-info/tuned/mlp_tuned_best.rds")
en_tuned <- read_rds("model-info/tuned/en_tuned_best.rds")

# set-seed
set.seed(101)
```

# *Introduction*

Our data set was found on the BigML platform and contains IMDb data scrapes on 1,183 films released between 2000 and 2017 including award nominations, ratings, and award wins. We aim to predict whether or not the movie received any Oscar nominations, `Oscar_nominated`. Our outcome variable, `Oscar_nominated` is a categorical variable that we converted into a factor with two levels: `0` and `1`. `0` represents movies that received no Oscar nominations, and `1` represents movies that were nominated for at least one Oscar category.

<br>

# *Exploratory Data Analysis*

To start our EDA, we can look at the relationship between total a film's total numer of award nominations and total award wins. In the plot below, we can see that a film's total number of award nominations has a positive correlation with its total wins and that the films nominated for the Oscars Best Picture award tend to have more total nominations, which makes sense given that great films will likely be nominated for other prestigious awards as well.

```{r}
potential_data %>% ggplot(aes(awards_nominations, awards_wins)) +
  geom_jitter(aes(color = Oscar_Best_Picture_nominated), alpha = 0.5) +
  scale_color_manual("Nominated for Oscars Best Picture?",
                     values = c("#008f69", "#2f84de"),
                     labels = c("NO", "YES")) +
  geom_smooth(color = "black") +
  labs(title = "Cumulative Nominations and Awards",
       subtitle = "Films Released during 2000–2017",
       x = "Total Nominations",
       y = "Total Wins",
       caption = "For The Academy Awards, Golden Globes, BAFTA, and other prestigious
       film critic organizations.") +
  theme_classic() +
  theme(legend.position = "top",
        plot.title = element_text(face = "bold", hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5))
```

The bar plot below portrays a bimodal distribution of Oscar Best Picture winners by month of the film's release, with notable peaks in January and the last few months of the year, although there are some missing data to be cognizant of. We speculate that the excitement of the new year might contribute to that peak in January, while the peaks in October, November, and December might be attributable to the fact that the films are more memorable and have more "buzz" around the cutoff date of December 31st.

```{r}
potential_data %>% count(release_date.month, Oscar_Best_Picture_won) %>% drop_na() %>%
  filter(Oscar_Best_Picture_won == "Yes") %>%
  select(1, 3) %>%
  ggplot(aes(release_date.month, n)) +
  geom_col(aes(fill = release_date.month)) +
  scale_x_continuous(breaks = seq(1, 12, 1)) +
  labs(title = "Total Number of Oscar Best Picture Winners, By Month of Release",
       subtitle = "Films Released during 2000–2017",
       x = "Calendar Month",
       y = NULL) +
  theme_classic() +
  theme(legend.position = "none",
        plot.title = element_text(face = "bold", hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5))
```

<br>

# *Resampling and Feature Engineering*

Before beginning the feature engineering process, we split the data using 70% for the training set (`oscar_train`) and 30% for the testing set (`oscar_test`). We then resampled our training data set (`oscar_folds`) using V-fold cross-validation with five folds and three repeats, stratified by our outcome variable, `Oscar_nominated`.

For our recipe, we included 10 predictors to predict our outcome variable: `duration`, `votes`, `rate`, `metascore`, `gross`, `awards_wins`, `awards_nominations`, `Golden_Globes_nominated`, `Critics_Choice_nominate`, and `BAFTA_nominated.` Additional context on our variables can be found in our codebook (`data/codebook/cleaned_data_codebook.Rd`).

```{r}
oscar_recipe <- recipe(
  Oscar_nominated ~ duration + votes + rate + metascore + gross
    + awards_wins + awards_nominations + Golden_Globes_nominated +
    Critics_Choice_nominated + BAFTA_nominated,
  data = oscar_train
) %>%
  step_impute_bag(metascore, gross) %>%
  step_dummy(all_nominal(), -all_outcomes(), one_hot = TRUE) %>%
  step_normalize(all_numeric())
```

<br>

# *Model Assessment*

We tried five different models: boosted tree (`bt_model`), elastic net (`en_model`), K-nearest neighbors (`nn_model`), multilayer perceptron (`mlp_model`), and random forest (`rf_model`). We found that our random forest model was most successful, with an accuracy of 0.875 prior to fitting the model on the entire training set or the testing set.

```{r}
tune_results <- tibble(
  model_type = c("rf_tuned", "nn_tuned", "mlp_tuned", "en_tuned", "bt_tuned"),
  tune_info = list(rf_tuned, nn_tuned, mlp_tuned, en_tuned, bt_tuned),
  assessment_info = map(tune_info, collect_metrics),
  best_model = map(tune_info, ~select_best(.x, metric = "accuracy")))

tune_results %>%
  select(model_type, assessment_info) %>%
  unnest(assessment_info) %>%
  filter(.metric == "accuracy") %>%
  group_by(model_type) %>%
  summarise(accuracy = max(mean)) %>%
  arrange(desc(accuracy))
```

<br>

# *Model Fit and Results*

Upon fitting our winning random forest model on the testing data set, we got the following results for ROC-AUC and accuracy. Although accuracy was a bit lower, both of the metrics indicate strong model performance.

```{r}
# define-model
rf_model <- rand_forest(mode = "classification",
                        mtry = tune(),
                        min_n = tune()) %>%
  set_engine("ranger")


# define-tuning-grid
rf_params <- parameters(rf_model) %>%
  update(mtry = mtry(c(1, 10)))

rf_grid <- grid_regular(rf_params, levels = 5)

# workflow
rf_workflow <- workflow() %>%
  add_model(rf_model) %>%
  add_recipe(oscar_recipe)

# tuned-workflow
rf_workflow_tuned <- rf_workflow %>%
  finalize_workflow(select_best(rf_tuned, metric = "accuracy"))

# fit
rf_fit_results <- fit(rf_workflow_tuned, oscar_train)

# results
rf_accuracy <- rf_fit_results %>%
  predict(oscar_test) %>%
  bind_cols(truth = oscar_test$Oscar_nominated) %>%
  accuracy(truth = truth, estimate = .pred_class)

rf_fit_results %>%
  predict(new_data = oscar_test, type = "prob") %>%
  bind_cols(truth = oscar_test$Oscar_nominated) %>%
  roc_auc(truth = truth, .pred_0) %>% # roc-auc
  bind_rows(rf_accuracy) %>% # elastic net accuracy (stored)
  mutate(
    Metric = .metric, # rename
    Estimate = .estimate
  ) %>%
  select(Metric, Estimate) # select

rf_fit_results %>%
  predict(new_data = oscar_test, type = "prob") %>%
  bind_cols(truth = oscar_test$Oscar_nominated) %>%
  roc_curve(truth = truth, .pred_0) %>%
  autoplot()
```

<br>

# *Conclusion*

Wrapping up our analysis, we found that the performance of our model in predicting whether a film received any Oscar nominations was quite strong. Next steps could include searching for more context into the meaning of some variables by gaining access to a codebook from the original data source, which we did not have as a resource.

<br>

# *GitHub Repo Link*

[https://github.com/susantmtran/301final](https://github.com/susantmtran/301final){target="_blank"}