Students will walk away with introductory & practical knowledge of the Python Data Science stack:
- pandas
- matplotlib
- scikit-learn
- Go to tmpnb.org.
- Select
New>Python 2to create a Python notebook. - Follow along with me:
>>> import pandas
>>> import sklearn
>>> import matplotlib- Python Warm Up
- Intro to Pandas
- Intro to Data Science with Sci-kit Learn
- Visualization with Matplotlib
- More Practice
This problem will give us a review of lists, for loops and lambda functions
Given the following list,
names = ["Michael Fassbender", "Karlie Kloss", "Taylor Swift", "Justin Bieber"]
- print out the names that contain the letter "l"
- turn all of the names lowercase
- sort the list of names alphabetically using the built-in
sortedfunction (HINT: Use Google) - sort the list of names by length using the built-in
sortedfunction
Question What is pandas?
Here is our first data set. Let's download it and upload it to the datasets folder within the notebook.
With your partner 0. Download and open the dataset
- What is this dataset about?
- What are some questions you might ask about the data?
- Let's read in the data.
- How do we see what columns are available?
- How do we look at just the head or tail of the dataset?
- How do we look at only a few rows?
- How do we only look at certain columns?
- How do we pull out a column and look at it as a series?
- How do we look at only those rows that have Status = won
- Exercise: How many accounts have a price greater than $12,000?
- How do we get the maximum value of a certain column?
- Exercise: What is the minimum account price? The mean? The sum? The standard deviation?
LUNCH
What is the total dollar amount pending?
- How do we add columns?
- Let's add a column called Amount that is equal to Quantity * Price
- Exercise: Let's select just those rows where status is pending and sum up those amounts.
Question: What are pivot tables? Why are they useful?
Let's take a look at the documentation here.
- Let's pivot using one index.
- Let's pivot on multiple indexes
- Let's reverse those indexes
- Let's specify which values we care about
- Let's specify which columns we want broken down
- Let's specify how we want the values to be aggregated (
aggfunc) - Let's fill N/A values
- Let's get subtotals
(Creative) Exercise: with a partner, use pivot tables to play around with the data. What pivots do you find particularly interesting or useful for this dataset?
- Read in this dataset
- What is this dataset about?
- Let's delete the Unnamed: 0 column.
- Let's compute the duration by turning starttime and stopttime into datetime objects and computing their difference.
- What is the average trip time? What is the minimum and maximum trip time? What is the standard deviation?
- What is the average trip time by station? (Hint: Use pivot tables)
STRETCH/BIO BREAK
Think/Pair/Share: What is data science? What are some examples of datascience
- features
- target
Examples:
- spam
- netflix
With a partner,
- Read the data description
- Discuss the data and what we could use this data for
- Upload to datasets/ in the notebook and read in the data with pandas.
Together, (190)
- Let's use pandas's built-in descriptive statistics method to get a statistical summary of the data.
- Let's plot CRIM against MEDV
- By yourself, generate the remaining 12 plots (ZN against MEDV, ..., LSTAT against MEV)
- Which feature looks to be most predictive
Think/Pair/Share: What is linear regression? (It's machine learning!) (210)
- cross validation
- training set
- validation set
Together, (215)
- Let's separate the data into feature and target.
- Let's separate the feature and target into training and validation set.
- Let's fit the linear regression model using 3 columns.
- Let's plot the linear regression model.
- Let's plot the predictions.
- Let's measure the accuracy.
- Let's see which columns were most predictive.
- Let's use
cross_val_predictas a shortcut to get the predicted values. - Let's use
cross_val_scoreas a shortcut for the R^2 values. What doescvdo?
On your own, (230)
- Run the regression using all of the feature columns.
- How does the model improve/worsen?
- regression
- classification Question: What are some more examples of regression applications? classification applications?
- RandomForest
- Logistic Regression (poorly named, I know!)
- Support Vector Machines
- Neural Networks (Deep Learning...)
- k Nearest Neighbors
- simple model
- works well when there aren't too many different features
- works well when the scale of each feature is similar (why?). we'll see this in our example.
By yourself, take 5 minutes to do the following:
- Read the dataset description. What is this dataset about?
- Upload the dataset to datasets/ in our notebook and read the dataset into pandas
- Separate into feature and target
- Use cross val to run
KNeighborsClassifier - Plot these values of n_neighbors 2, 3, 4, 5, 10 against accuracy score. How did it do?
- Let's describe the data.
- Let's normalize the data using
normalize - Try KNeighbors again for the different values of n_neighbors. How did it do? Which value of n_neighbors was best?
- Let's manually use
train_test_splitand compare the predicted values with the true values in the test set to more concretely see the output of the model.
- Every data science model (algorithm) has parameters you can tune to improve the accuracy of the model.
- For kNN, what can/did we tune?
- Download your notebook
- Open it up in a text editor
- Copy all the text
- Paste it into a gist
- Create a secret gist
- Copy the browser url
- Go here and paste that url
- Voila!
- Build a Django app
- Run through more pandas tutorials
- Run through some sci-kit learn tutorials and examples
- Take the GA Data Science part time course
- To up your pure Python fluency, do tons of Euler problems