In recent years, there has been a massive rise in the usage of dating apps to find love. Many of these apps use sophisticated data science techniques to recommend possible matches to users and to optimize the user experience. These apps give us access to a wealth of information that we've never had before about how different people experience romance.
In this capstone, you will analyze some data from OKCupid, an app that focuses on using multiple choice and short answers to match users.
The dataset provided has the following columns of multiple-choice data:
- body_type
- diet
- drinks
- drugs
- education
- ethnicity
- height
- income
- job
- offspring
- orientation
- pets
- religion
- sex
- sign
- smokes
- speaks
- status
And a set of open short-answer responses to :
- essay0 - My self summary
- essay1 - What I’m doing with my life
- essay2 - I’m really good at
- essay3 - The first thing people usually notice about me
- essay4 - Favorite books, movies, show, music, and food
- essay5 - The six things I could never do without
- essay6 - I spend a lot of time thinking about
- essay7 - On a typical Friday night I am
- essay8 - The most private thing I am willing to admit
- essay9 - You should message me if…
In this capstone, you will create a presentation about your findings in this OkCupid dataset.
The purpose of this capstone is to practice formulating questions and implementing Machine Learning techniques to answer those questions. We will give you guidance about the kinds of questions we asked, and the kinds of methods we used to answer those questions. But the questions you ask and how you answer them are entirely up to you. We're excited to see what kinds of different things you explore. Compared to the other projects you have completed this far, we are requiring few restrictions on how you structure your code. The project is far more open-ended, and you should use your creativity. In addition, much of the code you write for later parts of this project will depend on how you decided to implement earlier parts. Therefore, we strongly encourage you to read through the entire assignment before writing any code.
The data is stored in profiles.csv. We can start to work with it in dating.py by using Pandas, which we have imported for you with the line:
import pandas as pd
and then loading the csv into a DataFrame:
df = pd.read_csv("profiles.csv")
Let's make sure we understand what these columns represent!
Pick some columns and call .head() on them to see the first five rows of data. For example, we were curious about job, so we called:
df.job.head()
You can also call value_counts() on a column to figure out what possible responses there are, and how many of each response there was.
We can start to build graphs from the data by first importing Matplotlib:
from matplotlib import pyplot as plt
and then making some plots!
For example, we were curious about the distribution of ages on the site, so we made a histogram of the age column:
plt.hist(df.age, bins=20)
plt.xlabel("Age")
plt.ylabel("Frequency")
plt.xlim(16, 80)
plt.show()
Try this code in your own file and take a look at the histogram it produces!
As we started to look at this data, we started to get more and more curious about Zodiac signs. First, we looked at all of the possible values for Zodiac signs:
df.sign.value_counts()
We started to wonder if there was a way to predict a user's Zodiac sign from the information in their profile. Thinking about the columns we had already explored, we thought that maybe we could classify Zodiac signs using drinking, smoking, drugs, and essays as our features.
In order to answer the question you've formulated, you will probably need to create some new columns in the DataFrame. This is especially true because so much of our data here is categorical (i.e. diet consists of the options vegan, vegetarian, anything, etc. instead of numerical values).
Categorical data is great to use as labels, but we want to create some numerical data as well to use for features.
For our question about Zodiac signs, we wanted to transform the drinks column into numerical data. We used:
drink_mapping = {"not at all": 0, "rarely": 1, "socially": 2, "often": 3, "very often": 4, "desperately": 5}
all_data["drinks_code"] = all_data.drinks.map(drink_mapping)
These lines of code created a new column called 'drinks_code' that mapped the following drinks values to these numbers:
| drinks | drinks_code |
|---|---|
| not at all | 0 |
| rarely | 1 |
| socially | 2 |
| often | 3 |
| very often | 4 |
| desperately | 5 |
We did the same for smokes and drugs.
We also wanted some numerical data about the short answer essays. We combined them all into one string, took out the NaNs, and then created a new column called essay_len:
essay_cols = ["essay0","essay1","essay2","essay3","essay4","essay5","essay6","essay7","essay8","essay9"]
# Removing the NaNs
all_essays = all_data[essay_cols].replace(np.nan, '', regex=True)
# Combining the essays
all_essays = all_essays[essay_cols].apply(lambda x: ' '.join(x), axis=1)
all_data["essay_len"] = all_essays.apply(lambda x: len(x)))
We also created a column with average word length and a column with the frequency of the words "I" or "me" appearing in the essays.
In order to get accurate results, we should make sure our numerical data all has the same weight.
For our Zodiac features, we used:
feature_data = all_data[['smokes_code', 'drinks_code', 'drugs_code', 'essay_len', 'avg_word_length']]
x = feature_data.values
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
feature_data = pd.DataFrame(x_scaled, columns=feature_data.columns)
We have learned how to perform classification in a few different ways.
- We learned about K-Nearest Neighbors by exploring IMDB ratings of popular movies
- We learned about Support Vector Machines by exploring baseball statistics
- We learned about Naive Bayes by exploring Amazon Reviews
Some questions we used classification to tackle were:
- Can we predict sex with education level and income??
- Can we predict education level with essay text word counts?
We have learned how to perform Multiple Linear Regression by playing with StreetEasy apartment data. Is there a way we can apply the techniques we learned to this dataset?
Some questions we used regression to tackle were:
- Predict income with length of essays and average word length?
- Predict age with the frequency of "I" or "me" in essays?
We also learned about K-Nearest Neighbors Regression. Which form of regression works better to answer your question?
After you have trained your model and run it, you will probably be curious about how well it did.
Find the accuracy, precision, and recall of each model you used, and create graphs showing how they changed.
For our question of classifying Zodiac signs, one graph we produced showed classification accuracy versus k (for K-Nearest Neighbors):
The accuracy we would expect from predicting a Zodiac sign by randomly selecting one would be 1/12, or 0.0833. Our model did not significantly outperform this number. We were unimpressed.
We want to see:
- at least two graphs containing exploration of the dataset
- a statement of your question (or questions!) and how you arrived there
- the explanation of at least two new columns you created and how you did it
- the comparison between two classification approaches, including a qualitative discussion of simplicity, time to run the model, and accuracy, precision, and/or recall
- the comparison between two regression approaches, including a qualitative discussion of simplicity, time to run the model, and accuracy, precision, and/or recall
- an overall conclusion, with a preliminary answer to your initial question(s), next steps, and what other data you would like to have in order to better answer your question(s)
Good luck!
