-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathREADME.Rmd
More file actions
35 lines (28 loc) · 1.82 KB
/
README.Rmd
File metadata and controls
35 lines (28 loc) · 1.82 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
---
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```
# StreetEasy
*SDS 410 Capstone in Statistical & Data Sciences, Spring 2021, Smith College*
Semester-long project in partnership with StreetEasy investigating
predicting New York real estate prices using natural language processing and machine learning algorithms
* **Team:** Lauren Low, Dayana Meza, Emma Scott, Xian (Elaine) Ye, Yanwan Zhu
* **Project Partner:** Yipeng Lai @ StreetEasy
* **Faculty Mentor:** Prof. Ben Baumer
### Files
* **data** - Create this folder within working directory to store .csv files of data
* **paper_MDPI** - Folder containing final paper **paper_MDPI.Rmd** and related files like **figures.Rmd** and **mybibfile.bib**
* **pre-processing.R** - Data cleaning script to prepare sale_listings for modeling: filter out unreasonable values, join with `zipcodeR` data, remove duplicate listings, impute NA values, split data into training and test sets
* **random_forest.Rmd** - Random forest machine learning model using existing variables and text-based variables created from `listing_description`; also includes visualizations of model error
* **text_processing.R** - Natural language processing script that creates binary keyword variables and performs `AFINN` sentiment analysis
### Instructions
1. In working directory, create **data** folder containing **amenities.csv**, **documentation - amenities.csv**, **documentation - sale_listings.csv**, **sale_listings.csv**
2. Run **pre-processing.R** to load script functions into global environment
3. Run **text_preprocessing.R** to load script functions into global environment
4. Run **random_forest.Rmd** to generate random forest model and corresponding model error visualizations