-
Notifications
You must be signed in to change notification settings - Fork 0
Home
yanwanz edited this page Mar 19, 2021
·
9 revisions
Google doc with initial questions for Yipeng & notes taken from the meeting
Data-related: Unit type dictionary
NONE = ' '
CONDO = 'D'
COOP = 'P'
TOWNHOUSE = 'T'
LOFT = 'L'
CONDOP = 'N'
HOUSE = 'H'
MULTIFAMILY = 'M'
RENTAL = 'R'
UNKNOWN = '?'
LAND = 'A'
COMMERCIAL = 'C'
MOBILE_HOME = 'E'
BUILDING = 'B'
ESTATE = 'S'
APARTMENT = 'F' # F is for Flat. Because EVERY OTHER LETTER IN APARTMENT WAS ALREADY TAKEN!
UNCLASSIFIED = 'U'
ANYHOUSE = 'X'
AUCTION = 'Z'
FRACTIONAL = 'Y'
Different ways for featurizing text data:
- Define a target feature/tag such as “whether the apartment has been recently renovated” and infer the value from text using regular expressions
- Use bag-of-words method by counting the frequency of words observed in each document
- Use pre-trained word embeddings or train word embedding
- Create a pre-processing script to get everyone on the same page in data cleaning and wrangling
- Create a linear regression model and a decision tree model that incorporate the same set of variables.
- Create 1 or 2 text-based variables.
From Yipeng:
-
size_sqft: We have a lot of NA and 0 for this feature, because the information is not provided by the agents (and is not listed in the official documents for co-op apartments) You can use building, neighborhood, zipcode or borough median controlling for the number of rooms for imputation. -
anyrooms= # bedroom + # bathroom + # living room + # dining room + # office + # any other type of rooms. The feature should be highly correlated with the sum of bedroom count and bathroom, so it may be redundant. -
time_to_subwayis the inferred raw time to the nearest subway station, and may be large for listings in New Jersey. You can decided if you want to keep this feature or you can use other geospatial features. - Optional geospatial feature engineering: I just recalled that someone on my team has done research on geolocations and showed how rent is correlated with properties’ distance from Union Square. In addition to Union Square, you can also consider landmarks such as Central Park and Prospect Park if you want more geo features.
- Imputation for missing data
- To improve decision tree: pruning parameter, random forest?
-
here()or data folder: data folder seems to be fine for our purpose!
How to deal with outliers & multiple listings?
- Set a threshold (99% or 95% quantile) and set the exceeding values to this number
- Exclude listings that are clearly not in NY/NJ (long/lat)
- Use the latest listing if there're duplicates. Listings that have larger IDs are newer
Other:
- Decide on our own if we want to use amenities or not
- Amenity information was entered by agents
- Possible to use same information from properties in the same building
Lingering questions:
- Decision trees: separate models for numerical & categorical variables right now, creating new binary variables? (Lauren)
- Organizing the project: changing the names? add everyone's "playground" files to .gitignore?
- I wish I spent more time checking the github repo and everyone’s work before I start +
- I wish I had spent more time fetching/merging/pulling changes before deciding what my next steps are
- I wish I had used user stories more often to decide my next tasks
- What if we varied the people who review/merge pull requests? +
- I wish we had one file for text analysis, one for EDA, one for decision trees, etc. ++
- What if we made the issues/user stories on the project board even smaller/more specific, so that they’re more digestible? +++
- What if we had more 30 min sessions where we got together and worked throughout the week?
- What if we not only update our code but also give one sentence summary that we want the team to know? +