Home

Welcome to the streeteasy wiki!

Welcome to the streeteasy wiki!
- Notes for the team
- Sprint 2

Notes for the team

Google doc with initial questions for Yipeng & notes taken from the meeting

Data-related: Unit type dictionary

NONE         = ' '
CONDO        = 'D'
COOP         = 'P'
TOWNHOUSE    = 'T'
LOFT         = 'L'
CONDOP       = 'N'
HOUSE        = 'H'
MULTIFAMILY  = 'M'
RENTAL       = 'R'
UNKNOWN      = '?'
LAND         = 'A'
COMMERCIAL   = 'C'
MOBILE_HOME  = 'E'
BUILDING     = 'B'
ESTATE       = 'S'
APARTMENT    = 'F' # F is for Flat. Because EVERY OTHER LETTER IN APARTMENT WAS ALREADY TAKEN!
UNCLASSIFIED = 'U'
ANYHOUSE     = 'X'
AUCTION      = 'Z'
FRACTIONAL   = 'Y'

Different ways for featurizing text data:

Define a target feature/tag such as “whether the apartment has been recently renovated” and infer the value from text using regular expressions
Use bag-of-words method by counting the frequency of words observed in each document
Use pre-trained word embeddings or train word embedding

Sprint 2

Goal:

Create a pre-processing script to get everyone on the same page in data cleaning and wrangling
Create a linear regression model and a decision tree model that incorporate the same set of variables.
Create 1 or 2 text-based variables.

Notes from Mar 19 meeting

From Yipeng:

size_sqft: We have a lot of NA and 0 for this feature, because the information is not provided by the agents (and is not listed in the official documents for co-op apartments) You can use building, neighborhood, zipcode or borough median controlling for the number of rooms for imputation.
anyrooms = # bedroom + # bathroom + # living room + # dining room + # office + # any other type of rooms. The feature should be highly correlated with the sum of bedroom count and bathroom, so it may be redundant.
time_to_subway is the inferred raw time to the nearest subway station, and may be large for listings in New Jersey. You can decided if you want to keep this feature or you can use other geospatial features.
Optional geospatial feature engineering: I just recalled that someone on my team has done research on geolocations and showed how rent is correlated with properties’ distance from Union Square. In addition to Union Square, you can also consider landmarks such as Central Park and Prospect Park if you want more geo features.

Notes from Mar 15 class

Imputation for missing data
To improve decision tree: pruning parameter, random forest?
here() or data folder: data folder seems to be fine for our purpose!

Notes from Mar 12 meeting

How to deal with outliers & multiple listings?

Set a threshold (99% or 95% quantile) and set the exceeding values to this number
Exclude listings that are clearly not in NY/NJ (long/lat)
Use the latest listing if there're duplicates. Listings that have larger IDs are newer

Other:

Decide on our own if we want to use amenities or not
Amenity information was entered by agents
Possible to use same information from properties in the same building

Lingering questions:

Decision trees: separate models for numerical & categorical variables right now, creating new binary variables? (Lauren)
Organizing the project: changing the names? add everyone's "playground" files to .gitignore?

Notes from Sprint 1 retro

I wish I spent more time checking the github repo and everyone’s work before I start +
I wish I had spent more time fetching/merging/pulling changes before deciding what my next steps are
I wish I had used user stories more often to decide my next tasks
What if we varied the people who review/merge pull requests? +

Things to change

I wish we had one file for text analysis, one for EDA, one for decision trees, etc. ++
What if we made the issues/user stories on the project board even smaller/more specific, so that they’re more digestible? +++
What if we had more 30 min sessions where we got together and worked throughout the week?
What if we not only update our code but also give one sentence summary that we want the team to know? +

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Home

Welcome to the streeteasy wiki!

Notes for the team

Sprint 2

Goal:

Notes from Mar 19 meeting

Notes from Mar 15 class

Notes from Mar 12 meeting

Notes from Sprint 1 retro

Things to change

Uh oh!

Clone this wiki locally