streeteasy/sentiment_analysis.Rmd at main · sds-capstone/streeteasy · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
---
title: "sentiment_analysis"
output: html_document
---

```{r message=FALSE, warning=FALSE}
library(tidyverse)
library(textdata)
library(tidytext)
library(SnowballC)
source("pre-processing.R")
```

```{r message=FALSE, warning=FALSE}
load_data() # creates two dataframes in the global environment

clean_data() # cleans the data

sale_listings_imputed <- impute_data() # imputes size values and returns a new dataframe
```

Emma's sentiment analysis on sale_listings data frame:
```{r load afinn sentiment lexicon, cache = TRUE}
afinn_lex <- get_sentiments("afinn")
```

```{r data frame sentiment analysis, cache = TRUE}
sentiments <- sale_listings %>%
  select(id, listing_description) %>% # Preserve property_id for future use
  unnest_tokens(output = word,
                input = listing_description) %>% # One row per word in listing_description
  anti_join(stop_words) %>% # Removes stop words
  inner_join(afinn_lex) # Scores sentiments on a -5 to +5 scale

sentiment_summary <- sentiments %>%
  group_by(id) %>%
  summarise(score = sum(value)) # Shrink df back to one row per listing via total score

# Next up is adding this to our model and/or making a visualization
```

Lauren's sentiment analysis of a single description:
```{r manually reading in a single description as a vector}
text <- c("
SPACIOUS and GRACIOUS. Perfect full-time home or pied-a-terre. Enjoy all the conveniences of East 57th Street living in this exceptionally large and quiet one bedroom apartment with a windowed dining room that can also be converted into an office or nursery if desired. As an added bonus, the maintenance includes electricity and other utilities. This lovely home features an enormous open living/dining area, a spacious kitchen with plenty of cabinets and counter space, an extra-large master bedroom, extremely generous closets, and beautiful built-ins. The apartment is also extremely peaceful and quiet, as it faces north, with plenty of natural light and open city views. Set in a perfect Midtown location with easy access to multiple subway and bus lines, a Whole Foods across the street, Bloomingdale's around the corner and Central Park within walking distance. Easy access to the E train and Long Island City too! The Harridge House has an attentive full-time staff, a beautiful marble and glass lobby, renovated hallways, a roof deck, two newly renovated laundry rooms, bike storage, a computerized package tracking system and an on-site garage with direct building access. Please note the building does not allow dogs."
)
```

```{r finding most common word in text}
# put text in a df
text_df <- data_frame(long_string = text)

# makes each word its own column
word_col <- text_df %>%
   unnest_tokens(output = word, input = long_string)
word_col

# # removes stop words like "the", "and", "before", "after", "such", "as", etc.
# no_stops  <- word_col  %>%
#   anti_join(stop_words)
# no_stops
```

```{r}
afinn <- word_col %>%
  inner_join(get_sentiments("afinn")) %>%
  mutate(score = sum(value))
afinn
```

```{r make variables for number of words and characters}
# run chunks 6 and 16 to get sentiments_join
sentiment_wc <- sentiments_join %>%
  mutate(char_count = str_count(listing_description),
         word_count = sapply(strsplit(listing_description, " "), length))
```

```{r scatter of words vs sentiment score}
sentiment_linear <- lm(score ~ word_count, data = sentiment_wc)
summary(sentiment_linear)

ggplot(sentiment_wc, aes(x = word_count, y = score)) +
  geom_point() +
  geom_smooth(method = lm)
```