Course materials for General Assembly's Data Science course in San Francisco, CA (10/4/16 - 12/13/16).
Instructors: Sinan Ozdemir
Teaching Assistants: George McIntire Cari Levay
Course Times
Tuesday/Thursday: 6:30pm - 9:30pm
Office hours:
Tue/Thurs: 5:30pm - 6:30pm (right before class)
Wed: 6pm - 8pm
Sat: 10am - 12pm
All courses / office hours will be held at GA, 225 Bush Street
| Tuesday | Thursday | Project Milestone | HW |
|---|---|---|---|
| 10/4: Introduction / Expectations / Intro to Data Science | 10/6: Pandas | ||
| 10/11: APIs / Web Scraping 101 | 10/13: Intro to Machine Learning / KNN | HW 1 Assigned (Th) | |
| 10/18: Model Evaluation / Linear Regression Part 1 | 10/20: Linear Regression Part 2 / Logistic Regression | Three Potential Project Ideas (Th) | |
| 10/25: Natural Language Processing | 10/27: NLP continued | HW 1 Due (Th) | |
| 11/1: Naive Bayes Classification | 11/3: Advanced Sklearn (Pipeline and Feaure Unions) / Review | ||
| 11/8: Decision Trees | 11/10: Ensembling Techniques | HW 2 Assigned (Th) | |
| 11/15: Dimension Reduction | 11/17: Clustering / Topic Modelling | First Draft Due (Th) | |
| 11/22: Stochastic Gradient Descent | 11/23: No Class | Peer Review Due (T) | |
| 11/29: Neural Networks / Deep Learning | 12/1: Recommendation Engines | HW 2 Due (Th) | |
| 12/6: Web Development with Flask | 12/8: Projects | ||
| 12/13: Projects |
- Install the Anaconda distribution of Python 2.7x.
- Setup a conda virtual environment
- Install Git and create a GitHub account.
- Once you receive an email invitation from Slack, join our "SFDAT28 team" and add your photo!
- PEP 8 - Style Guide for Python
- Learn How to Think Like a Computer Scientist
- Potential book for course? :)
##Introduction / Expectations / Intro to Data Science
Agenda
- Introduction to General Assembly slides
- Course overview: our philosophy and expectations (slides)
- Ice Breaker
Break -- Command Line Tutorial
- Figure out office hours
- Intro to Data Science: slides
Homework
- Setup a conda virtual environment
- Install Git and create a GitHub account.
- Read my intro to Git and be sure to come back on thursday with your very own repository called "sfdat28-lastname"
- Once you receive an email invitation from Slack, join our "SFDAT28 team" and add your photo!
- Introduction on how to read and write iPython notebooks tutorial
####Goals
- Feel comfortable importing, manipulating, and graphing data using Python's Pandas
- Be able to find missing values and begin to have a sense of how to deal with them
####Agenda
- Don't forget to
git pullin the sfdat26 repo in your command line - Intro to Pandas walkthrough here
- Pandas Lab 2 Solutions here
####Homework
- Go through the python class/lab work and finish any exercise you weren't able to in class
- Make sure you have all of the repos cloned and ready to go
- You should have both "sfdat28" and "sfdat28-lastname"
- Read Greg Reda's Intro to Pandas
- Take a look at Kaggle's Titanic competition
- I will be using a module called
tweepynext time.- To install please type into your console
pip install tweepy
- To install please type into your console
- Another Git tutorial here
- In depth Git/Github tutorial series made by a GA_DC Data Science Instructor here
- Another Intro to Pandas (Written by Wes McKinney and is adapted from his book)
- Here is a video of Wes McKinney going through his ipython notebook!
- Examples of joins in Pandas
- For more on Pandas plotting, read the visualization page from the official Pandas documentation.
-
Maria finds out that Sancho has been cheating on her with her.. mother!
-
We will use python to programatically obtain data via open sources on the internet
- We will be scraping the National UFO reporting center
- We will be collecting tweets regarding Donald Trump and Hilary Clinton
- We will be examining What people are really looking for in a data scientist..
-
We will continue to use pandas to investigate missing values in data and have a sense of how to deal with them
####Agenda
- To install tweepy please type into your console
pip install tweepy - Slides on Getting Data here
- Intro to Regular Expressions here
- Getting Data from the open web here
- Getting Data from an API here
- LAB on getting data here
####Homework
- The first homework will be assigned by Friday morning (in a homework folder) and it is due in two Thursdays
- It is a combo of pandas question with a bit of API/scraping
- Please push your completed work to your sfdat28_work repo for grading
- Your first project milestone is due next Thursday. It is the first three ideas you have for your project. Think about potential interesting sources of data you would like to work with. This can come from work, hobby, or elsewhere!
####Resources:
- Mashape allows you to explore tons of different APIs. Alternatively, a Python API wrapper is available for many popular APIs.
- The Data Science Toolkit is a collection of location-based and text-related APIs.
- API Integration in Python provides a very readable introduction to REST APIs.
- Microsoft's Face Detection API, which powers How-Old.net, is a great example of how a machine learning API can be leveraged to produce a compelling web application. Web Scraping Resources:
- For a much longer web scraping tutorial covering Beautiful Soup, lxml, XPath, and Selenium, watch Web Scraping with Python (3 hours 23 minutes) from PyCon 2014. The slides and code are also available.
- import.io and Kimono claim to allow you to scrape websites without writing any code. Its alrighhhtttttt
- How a Math Genius Hacked OkCupid to Find True Love and How Netflix Reverse Engineered Hollywood are two fun examples of how web scraping has been used to build interesting datasets.
####Agenda
-
Iris pre-work code
-
Intro to Machine Learning and KNN slides
- Supervised vs Unsupervised Learning
- Regression vs. Classification
-
Lab to use KNN models to investigate accelerometer data
####Homework
- The one page project milestone as well as the pandas homework! See requirements here
- Read this excellent article, Understanding the Bias-Variance Tradeoff, and be prepared to discuss it in class on Tuesday. (You can ignore sections 4.2 and 4.3.) Here are some questions to think about while you read:
- In the Party Registration example, what are the features? What is the response? Is this a regression or classification problem?
- In the interactive visualization, try using different values for K across different sets of training data. What value of K do you think is "best"? How do you define "best"?
- In the visualization, what do the lighter colors versus the darker colors mean? How is the darkness calculated?
- How does the choice of K affect model bias? How about variance?
- As you experiment with K and generate new training data, how can you "see" high versus low variance? How can you "see" high versus low bias?
- Why should we care about variance at all? Shouldn't we just minimize bias and ignore variance?
- Does a high value for K cause over-fitting or under-fitting?
- For our talk on linear regression, read:
Resources:
- For a more in-depth look at machine learning, read section 2.1 (14 pages) of Hastie and Tibshirani's excellent book, An Introduction to Statistical Learning. (It's a free PDF download!)
- Stackoverflow article on the difference between generative and discriminative models here
Agenda
- Model evaluation procedures (slides, code)
- Linear regression (notebook)
- To run this, I use a module called "seaborn"
- To install to anywhere in your terminal (git bash) and type in
sudo pip install seaborn - In depth slides here
- LAB -- Yelp dataset here with the Yelp reviews data. It is not required but your next homework will involve this dataset so it would be helpful to take a look now!
- Discuss the article on the bias-variance tradeoff
- Look as some code on the bias variace tradeoff
Homework:
- Please upload your three potential ideas for your final project to your personal sfdat28_work repo
- Watch these videos on probability and odds (8 minutes) if you're not familiar with either of those terms.
- Read these excellent articles from BetterExplained: An Intuitive Guide To Exponential Functions & e and Demystifying the Natural Logarithm (ln).
- Homework 1 is due in 10 days!
Resources:
- Correlation does not imply Causation
- P-values can't always be trusted
- Setosa has an excellent interactive visualization of linear regression.
- To go much more in-depth on linear regression, read Chapter 3 of An Introduction to Statistical Learning, from which this lesson was adapted. Alternatively, watch the related videos or read my quick reference guide to the key points in that chapter.
- To learn more about Statsmodels and how to interpret the output, DataRobot has some decent posts on simple linear regression and multiple linear regression.
- This introduction to linear regression is much more detailed and mathematically thorough, and includes lots of good advice.
- This is a relatively quick post on the assumptions of linear regression.
- John Rauser's talk on Statistics Without the Agonizing Pain (12 minutes) gives a great explanation of how the null hypothesis is rejected.
- A major scientific journal recently banned the use of p-values:
- Scientific American has a nice summary of the ban.
- This response to the ban in Nature argues that "decisions that are made earlier in data analysis have a much greater impact on results".
- Andrew Gelman has a readable paper in which he argues that "it's easy to find a p < .05 comparison even if nothing is going on, if you look hard enough".
- An article on "P Hacking" the idea that you can alter data in order to achieve good p values
- Here's a great 30-second explanation of overfitting.
- For more on today's topics, these videos from Hastie and Tibshirani are useful: overfitting and train/test split (14 minutes), cross-validation (14 minutes). (Note that they use the terminology "validation set" instead of "test set".)
- Alternatively, read section 5.1 (12 pages) of An Introduction to Statistical Learning, which covers the same content as the videos.
- This video from Caltech's machine learning course presents an excellent, simple example of the bias-variance tradeoff (15 minutes) that may help you to visualize bias and variance.
####Agenda
- Discusss with people at your table about your three potential ideas.
- Try to figure out which kinds of machine learning would be appropiate
- supervised
- unsupervised
- Try to figure out which kinds of machine learning would be appropiate
- Linear regression (Continued) notebook
- Logistic regression notebook and slides
- Confusion matrix slides
- LAB -- Exercise with Titanic data instructions
Homework:
-
Homework due in 7 days!!!!
-
If you aren't yet comfortable with all of the confusion matrix terminology, watch Rahul Patwari's videos on Intuitive sensitivity and specificity (9 minutes) and The tradeoff between sensitivity and specificity (13 minutes).
Resources:
- To go deeper into logistic regression, read the first three sections of Chapter 4 of An Introduction to Statistical Learning, or watch the first three videos (30 minutes) from that chapter.
- For a math-ier explanation of logistic regression, watch the first seven videos (71 minutes) from week 3 of Andrew Ng's machine learning course, or read the related lecture notes compiled by a student.
- For more on interpreting logistic regression coefficients, read this excellent guide by UCLA's IDRE and these lecture notes from the University of New Mexico.
- The scikit-learn documentation has a nice explanation of what it means for a predicted probability to be calibrated.
- Supervised learning superstitions cheat sheet is a very nice comparison of four classifiers we cover in the course (logistic regression, decision trees, KNN, Naive Bayes) and one classifier we do not cover (Support Vector Machines).
- This simple guide to confusion matrix terminology may be useful to you as a reference.
##Class 7: Natural Language Processing
pre-work
- Download all of the NLTK collections.
- In Python, use the following commands to bring up the download menu.
import nltknltk.download()- Choose "all".
- Alternatively, just type
nltk.download('all')
- Install two new packages:
yahoo_finance,textblob.- Open a terminal or command prompt.
- Type
pip install yahoo_financepip install textblob.
Agenda
-
Quick recap of what we've done so far
-
Logistic regression con't notebook and slides
- Confusion matrix slides
-
Naural Language Processing is the science of turning words and sentences into data and numbers. Today we will be exploring techniques into this field
-
code showing topics in NLP
-
lab analyzing tweets about the stock market
Homework:
- Read Paul Graham's A Plan for Spam and be prepared to discuss it in class on Tursday!. Here are some questions to think about while you read:
- Should a spam filter optimize for sensitivity or specificity, in Paul's opinion?
- Before he tried the "statistical approach" to spam filtering, what was his approach?
- How exactly does his statistical filtering system work?
- What did Paul say were some of the benefits of the statistical approach?
- How good was his prediction of the "spam of the future"?
- Below are the foundational topics upon which Wednesday's class will depend. Please review these materials before class:
- Confusion matrix: a good guide
- Sensitivity and specificity: Rahul Patwari has an excellent video (9 minutes).
- Basics of probability: These introductory slides (from the OpenIntro Statistics textbook) are quite good and include integrated quizzes. Pay specific attention to these terms: probability, sample space, mutually exclusive, independent.
##Class 9: Naive Bayes Classifier
Today we are going over advanced metrics for classification models and learning a brand new classification model called naive bayes!
Agenda
- Are you smart enough to work at Facebook?
- Learn about Naive Bayes and ROC/AUC curves
- Work on Homework / previous labs
Resources
- Bayes Theorem as applied to Monty Hall here and here
- Video on ROC Curves (12 minutes).
- My good buddy's blog post about the ROC video includes the complete transcript and screenshots, in case you learn better by reading instead of watching.
- Accuracy vs AUC discussions here and here
##Class 10: Advanced Sklearn Modules / Review
Agenda
Today we are going to talk about four major things as related to advanced sklearn features and modules:
- We will use sklearn's Pipeline feature to chain together multiple sklearn modules
- We will look at the Feature Selection module to automatically find the most effective features in our dataset
- We can use Feature Unions to combine several feature extraction techniques
- More on StandardScalar as well
- Find the notebook here!
- Review on the board
- Review part deux (notebook)
- Decision trees (notebook)
- Bonus content deals with the algorithm behind building trees