GitHub - djrajdev/grokking-ds-wwc: Grokking data science slides and code from the Women Who Code talk

Grokking Data Science with Women Who Code

@Author: DJ Rajdev
@LastUpdate: Sept 3, 2019
@Purpose: slides for WWC presentation and supporting code stubs

How do I make the most of this project?

Check the sources in pdf slides, a lot of them have detailed tutorials
Refer to Speaker notes section below for some more talking points around the slides
Most code is snippet form or uses academic sample data, I encourage you to expand it with real world data and send me a PR if you do

Questions and staying in touch

I can be reached by

Twitter: @DivjyotiRajdev
Email: dj@djrajdev.com
Biweekly Data Science Meetup link
Monthly Data Viz Meetup link

Video walkthrough

Link. Some of the above topics may be covered in a different but related presentation I did as a Data Scientist at TIBCO Software.

Speaker notes:

Should I even solve it?

black swans: events that don’t have enough data to be predictable. Eg natural calamities. Recent bitcoin surge was a self-fulfilling prophecy
right problem: customer is unhappy about service disruptions due to failures you can have 2 solutions. 1) fix the product to have lower failures 2) create backups to reduce disruption. #2 is better but doesn't need data science
ideal solution: how will it be used? real time vs offline, dashboard vs spreadsheet vs script, core or edge. This affects model selection
prediction or insight: what are you trying to find? eg. if stock market up or down? by how much? to what accuracy?
representative data: if time dependent target, do you have time in dataset? correctness from domain perspective (ph >14)? check to ensure no bias in collection.
goal post for success: if your goal is to increase conversion rate, for how long? 2 weeks, 30days, 6months?

Data or dirt?

data summary: look through measures like %missing, invariant, extreme variation, entropy, (visual) distribution, mean median
data relationships: remove redundant columns, use this for feature selection.
featurization: transformations, scaling, one-hot encoding, frequency encoding, aggregation (eg. amazon collects data as 1 row per transactions but the models usually work on 1 row per customer)
outliers: filter them? what about fraud.
imputing missing values: choose the unbiased estimator (often median, but also prev value, next value, knn based imputation)
decompose time series: pollution example. How would you forecast given series? or find trend? or correlate.

Is the model good enough?

baseline model: accuracy vs interpretation. Interpretation focus post GDPR. What is the cost of incorrect / large error prediction
error metric: default is mse. if your estimator is better suited by median vs mean, mae is better. similarly for categorical consider precision, recall etc.
sampling: prototype on 5% for big data, oversample for unbalanced class, 0/1 sampling (eg only customers who bought pants)
algo family: many techniques to accomplish the same result. eg customer segmentation using clustering or multiclass classification. neural nets have holes in param space RF does not
model breaking point: LM assumes linear dependence. RF doesn’t work with sparse data. centering + scaling for svm
domain methods: Often specialized of a particular type of problem that's not trivially solvavble by the usual ensemble / NN methods. eg. A/B test for experimentation, graph methods, linear programming multi constraint optimization, autoencoders for unsupervised anomaly
stats intuition: CNN used for image classification. for detecting damage either classify but easier to use sobel filter (intermediate step in CNN). stats.stackexchange 'intuitive' tag.

What is driving my model?

var imp: relative influence look for bias
var imp with noise: where does the significance drop off
cheating algo: date the most imp feature? 90% variation by one feature?
time split: random sampling looks into the future
local interpretation: fit weak learners to understand behavior at row level. H2O, k-lime example
sensitivity analysis: for each imp variable how much perturbation is too much?

What's the long term plan?

accessible: how will you distribute your model or results? scheduled job & write to table, api & real time query, streaming analytics
testing: similar to software engineering track functional, security, specially for APIs
rule based: most fraud models begin with rules that have been derived using complex ML models. this is due to ease of deployment, speed. Same for control charts
right audience: technical deepdive & filters vs OKRs for executives
model retrain: based on residual drift for machine data, seasonally for human data
side effects: boosting sale of chanel noir might draw customers away from chanel no 5. cannibalisation and survival analysis

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
Grokking Data Science.pdf		Grokking Data Science.pdf
README.md		README.md
pyspark_entropy_stub.ipynb		pyspark_entropy_stub.ipynb
python_MIC_stub.ipynb		python_MIC_stub.ipynb
python_cramersV_stub.ipynb		python_cramersV_stub.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Grokking Data Science with Women Who Code

How do I make the most of this project?

Questions and staying in touch

Video walkthrough

Speaker notes:

Should I even solve it?

Data or dirt?

Is the model good enough?

What is driving my model?

What's the long term plan?

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Grokking Data Science with Women Who Code

How do I make the most of this project?

Questions and staying in touch

Video walkthrough

Speaker notes:

Should I even solve it?

Data or dirt?

Is the model good enough?

What is driving my model?

What's the long term plan?

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages