streeteasy/README.Rmd at main · esscott/streeteasy · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
---
output: github_document
---

<!-- README.md is generated from README.Rmd. Please edit that file -->

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

# StreetEasy
*SDS 410 Capstone in Statistical & Data Sciences, Spring 2021, Smith College*

Semester-long project in partnership with StreetEasy investigating
predicting New York real estate prices using natural language processing and machine learning algorithms

* **Team:** Lauren Low, Dayana Meza, Emma Scott, Xian (Elaine) Ye, Yanwan Zhu
* **Project Partner:** Yipeng Lai @ StreetEasy
* **Faculty Mentor:** Prof. Ben Baumer

### Files
* **data** - Create this folder within working directory to store .csv files of data
* **paper_MDPI** - Folder containing final paper **paper_MDPI.Rmd** and related files like **figures.Rmd** and **mybibfile.bib**
* **pre-processing.R** - Data cleaning script to prepare sale_listings for modeling: filter out unreasonable values, join with `zipcodeR` data, remove duplicate listings, impute NA values, split data into training and test sets
* **random_forest.Rmd** - Random forest machine learning model using existing variables and text-based variables created from `listing_description`; also includes visualizations of model error
* **text_processing.R** - Natural language processing script that creates binary keyword variables and performs `AFINN` sentiment analysis

### Instructions
1. In working directory, create **data** folder containing **amenities.csv**, **documentation - amenities.csv**, **documentation - sale_listings.csv**, **sale_listings.csv**
2. Run **pre-processing.R** to load script functions into global environment
3. Run **text_preprocessing.R** to load script functions into global environment
4. Run **random_forest.Rmd** to generate random forest model and corresponding model error visualizations