-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathAppendix.Rmd
More file actions
250 lines (202 loc) · 9.88 KB
/
Appendix.Rmd
File metadata and controls
250 lines (202 loc) · 9.88 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
---
title: "Movie Review Appendix"
output: html_document
---
```{r include=FALSE}
library(ggplot2)
source("analyze.R")
projection_Rdata_file <- "data/movie_movie_projection.Rdata"
load(projection_Rdata_file)
descriptive_Rdata_file <- "data/projection_descriptions.Rdata"
load(descriptive_Rdata_file)
```
# 1998 & 2000 Analysis
## Node Edge Counts
```{r echo = F}
summary(projection1998)
```
## Degree Distribution
```{r echo = F}
hist(description_1998$degree_dist,
main = "Degree Distribution",
xlab = "Degree",
breaks = 100,
col = "blue")
```
The degree distribution shows that most of the movies don't share reviewers with a large number of other movies.
## Degree Strength Distribution
```{r echo = F}
hist(description_1998$degree_strength,
main = "Degree Strength Distribution",
xlab = "Degree Strength",
breaks = 100,
col = "blue")
```
The degree strength distribution shows that most movies do not receive many reviews from reviewers that also reviewed other movies.
## Eigenvector Centrality Distribution
```{r echo = F}
hist(description_1998$evcentrality,
main = "Eigenvector Centrality Distribution",
xlab = "Eigenvector Centrality",
breaks = 100,
col = "blue")
```
The eigenvector centrality distribution shows us that few movies are connected by reviewers to several other highly connected nodes. We can see a small spike that represents the few movies that break the trend and have shared reviewers with several other well connected movies.
## Transitivity Distribution
```{r echo = F}
hist(description_1998$transitivity,
main = "Transitivity Distribution",
xlab = "Transitivity",
breaks = 100,
col = "blue")
```
The spike in the transitivity distribution at 1 shows us that most of the neighbor movies are tightly connected. Due to the low degree of the network, these cliques are fairly small.
# 2000 Analysis
## Node Edge Counts
```{r echo = F}
summary(projection2000)
```
## Degree Distribution
```{r echo = F}
hist(description_2000$degree_dist,
main = "Degree Distribution",
xlab = "Degree",
breaks = 100,
col = "blue")
```
The degree distribution shows that most of the movies don't share reviewers with a large number of other movies.
## Degree Strength Distribution
```{r echo = F}
hist(description_2000$degree_strength,
main = "Degree Strength Distribution",
xlab = "Degree Strength",
breaks = 100,
col = "blue")
```
The degree strength distribution shows that most movies do not receive many reviews from reviewers that also reviewed other movies.
## Eigenvector Centrality Distribution
```{r echo = F}
hist(description_2000$evcentrality,
main = "Eigenvector Centrality Distribution",
xlab = "Eigenvector Centrality",
breaks = 100,
col = "blue")
```
The eigenvector centrality distribution shows us that few movies are connected by reviewers to several other highly connected movies. There are more highly connected movies in this data set than in the 1998-2000 set.
## Transitivity Distribution
```{r echo = F}
hist(description_2000$transitivity,
main = "Transitivity Distribution",
xlab = "Transitivity",
breaks = 100,
col = "blue")
```
The spike in the transitivity distribution at 1 shows us that most of the neighbor movies are tightly connected.
# Compairison of 1998-1999 vs First 2000
## Degree Distribution Compairison
```{r echo = F}
degree_dist_join <- data.frame(year = 1998,
degree = description_1998$degree_dist)
degree_dist_join <- rbind(degree_dist_join,
data.frame(year = 2000,
degree = description_2000$degree_dist))
degree_dist_join$year <- as.factor(degree_dist_join$year)
ggplot(data = degree_dist_join,
aes(x = log(degree), y=..density.., fill = year)) +
geom_histogram(position="identity",
colour = "black",
binwidth=.5,
alpha=0.4) +
xlab("Log(Degree)") +
ylab("Density") +
ggtitle("Degree Density Compairison")
```
The degree distribution of the 2000 data set has a much longer fatter tail, meaning more movies that have high degree.
## Degree Strength Distribution Compairison
```{r echo = F}
degree_strength_join <- data.frame(year = 1998,
strength = description_1998$degree_strength)
degree_strength_join <- rbind(degree_strength_join,
data.frame(year = 2000,
strength = description_2000$degree_strength))
degree_strength_join$year <- as.factor(degree_strength_join$year)
ggplot(data = degree_strength_join,
aes(x = log(strength), y=..density.., fill = year)) +
geom_histogram(position="identity",
colour = "black",
binwidth=.5,
alpha=0.4) +
xlab("Log(Degree Strength)") +
ylab("Density") +
ggtitle("Degree Strength Density Compairison")
```
The degree strength distribution of the 2000 data set has a much longer fatter tail, meaning more movies that have more shared reviewers.
## Eigenvector Centrality Distribution Compairison
```{r echo = F}
evcent_dist_join <- data.frame(year = 1998,
evcent = description_1998$evcentrality)
evcent_dist_join <- rbind(evcent_dist_join,
data.frame(year = 2000,
evcent = description_2000$evcentrality))
evcent_dist_join$year <- as.factor(evcent_dist_join$year)
ggplot(data = evcent_dist_join,
aes(x = log(evcent), y=..density.., fill = year)) +
geom_histogram(position="identity",
colour = "black",
binwidth=.5,
alpha=0.4) +
xlab("Log(Eigenvector Centrality)") +
ylab("Density") +
ggtitle("Eigenvector Centrality Density Compairison")
```
The eigenvector centrality distribution of the 2000 data set is shifted higher than the 1998-2000 data set. This is in line with the increased degree of this data set.
## Transitivity Distribution Compairison
```{r echo = F}
transitivity_dist_join <- data.frame(year = 1998,
transitivity = description_1998$transitivity)
transitivity_dist_join <- rbind(transitivity_dist_join,
data.frame(year = 2000,
transitivity = description_2000$transitivity))
transitivity_dist_join$year <- as.factor(transitivity_dist_join$year)
ggplot(data = transitivity_dist_join,
aes(x = transitivity, y=..density.., fill = year)) +
geom_histogram(position="identity",
colour = "black",
binwidth=.1,
alpha=0.4) +
xlab("Log(Transitivity)") +
ylab("Density") +
ggtitle("Transitivity Density Compairison")
```
The transitivity of the 2000 data set is slightly shifted higher, meaning that the network is forming more tightly connected groups.
# Network Plots
## Production Quality Plots
### 1998 - 2000 Projection of Movie Reviews
Movie Nodes are sized by betweeness, edges represent shared movie reviewers reviews durring the years of 1998 & 1999. Movies that were not able to be connected to other movies by reviewers were not considered. Movie data had to be deduplicated to eliminate movie variations of the same content (DVD/VHS ...)

### Sub-networks of 1998 - 2000 Movie Reviews
Breaks out of graphs of the 1998 - 2000 data. Groups formed by modularity. Sizing and filtering same as previous.

### Compairison between 1998 - 2000 & 2000
Plot of the 1999-2000 projection, next to the projection of movie reviews taken from the year 2000. Isolates removed. Notice the large increase in movies and edges.

## Exploration Plots
### Plots with Movie ID's
Due to the large number of movies, in the early exploration we looked at graphs like these then looked up ID's -> Movies
**1998 - 2000 Sized & Colored by # of Triangles Formed**
Image of movies of final dataset colored and sized by number of triangles. A spline scaling method was used for the node sizing as there was a huge difference between the small nodes (barely visible) and large nodes. The spline allowed the clusters to form nicely while still showing the very small relationships through auxilery nodes.

**1998 - 2000 Sized & Colored by Degree Strength**
Early exploration of visualizing the graph, nodes are colored and sized by degree. One of the last visualizations before modularity was used for coloring and revealed the real groupings.

**1998 - 2000 Sized & Colored Closeness**
Movies colored and sized by closeness, the same spline scaling was used during this visualization as well. In this cluster the top right movies were all horor and middle bottom were greatest hits. During this visualization we really know the clustering was working and there was further knowledge to be gained about their relationships.

### Clusters formed by Duplicated Movie Data
This graph is of the year 1998 only. We found that the clusters were due to movies being duplicated in the data with the same reviews.
**Layout 1**
Nodes Colored by modularity and sized by centrality. Part of the original dataset during investigation. The bottom green cluster served as the starting point to furhter investigation into the relationships within groups.

**Initial Movie visualizations**
Nodes are colored by centrality and layed out with edges made more prominent. The clusters of narger nodes were pushed toward the extremeties in order to highlight the smaller graphs in a attempt to find some structure.
