Data Science Study Guide

Introduction

Welcome to the Data Science Study Guide! This comprehensive topic covers the essential tools and statistical methods for modern data analysis. You'll learn how to manipulate, visualize, and draw meaningful conclusions from data using industry-standard Python libraries.

Data Science combines:

Data manipulation tools (NumPy, Pandas) for handling structured data
Visualization techniques (Matplotlib, Seaborn) for exploring patterns
Statistical inference (scipy, statsmodels) for drawing valid conclusions
Practical applications through hands-on projects

This topic is designed to take you from basic data manipulation through exploratory data analysis (EDA) to rigorous statistical inference and advanced modeling techniques.

Learning Roadmap

The 29 lessons follow a structured progression:

Phase 1: Data Tools (L01-L06)

Master the fundamental libraries for data manipulation and preprocessing.

Phase 2: Exploratory Data Analysis (L07-L09)

Learn to visualize, summarize, and explore data patterns.

Phase 3: Bridge to Inference (L10) 🌉

Critical transition: Understand when and why to move beyond descriptive statistics to formal statistical testing.

Phase 4: Statistical Foundations (L11-L14)

Build probability foundations and learn hypothesis testing frameworks.

Phase 5: Advanced Inference (L15-L24)

Master specialized techniques: ANOVA, regression, Bayesian methods, time series, and experimental design.

Phase 6: Practical Integration (L25)

Apply everything in comprehensive real-world projects.

Phase 7: Advanced Methods (L26-L29)

Bayesian advanced topics, causal inference, survival analysis, and modern data tools.

Lesson List

Lesson	Title	Difficulty	Topics
01	NumPy Basics	⭐	Arrays, indexing, broadcasting, basic operations
02	NumPy Advanced	⭐⭐	Vectorization, linear algebra, random sampling
03	Pandas Basics	⭐	Series, DataFrames, reading/writing data
04	Pandas Data Manipulation	⭐⭐	Filtering, groupby, merging, reshaping
05	Pandas Advanced	⭐⭐⭐	MultiIndex, time series, categorical data
06	Data Preprocessing	⭐⭐	Missing data, outliers, scaling, encoding
07	Descriptive Statistics & EDA	⭐⭐	Summary statistics, distributions, correlation
08	Data Visualization Basics	⭐⭐	Matplotlib fundamentals, plot types
09	Data Visualization Advanced	⭐⭐⭐	Seaborn, complex plots, interactive viz
10	From EDA to Inference	⭐⭐	Bridge lesson: population vs sample, statistical thinking, choosing tests
11	Probability Review	⭐⭐	Random variables, distributions, expectation
12	Sampling and Estimation	⭐⭐	Sampling methods, point estimation, bias/variance
13	Confidence Intervals	⭐⭐⭐	CI construction, interpretation, margin of error
14	Hypothesis Testing Advanced	⭐⭐⭐	p-values, Type I/II errors, power analysis
15	ANOVA	⭐⭐⭐	One-way, two-way, post-hoc tests
16	Regression Analysis Advanced	⭐⭐⭐	Multiple regression, diagnostics, regularization
17	Generalized Linear Models	⭐⭐⭐⭐	Logistic regression, Poisson regression, GLM theory
18	Bayesian Statistics Basics	⭐⭐⭐	Bayes theorem, prior/posterior, conjugacy
19	Bayesian Inference	⭐⭐⭐⭐	MCMC, PyMC, credible intervals
20	Time Series Basics	⭐⭐⭐	Trends, seasonality, decomposition
21	Time Series Models	⭐⭐⭐⭐	ARIMA, SARIMA, forecasting, diagnostics
22	Multivariate Analysis	⭐⭐⭐	PCA, factor analysis, clustering
23	Nonparametric Statistics	⭐⭐⭐	Rank tests, bootstrap, permutation tests
24	Experimental Design	⭐⭐⭐	A/B testing, randomization, DOE principles
25	Practical Projects	⭐⭐⭐⭐	End-to-end data science projects
26	Bayesian Advanced	⭐⭐⭐⭐	HMC/NUTS, variational inference, hierarchical models, model comparison
27	Causal Inference	⭐⭐⭐⭐	DAGs, propensity score, DID, RDD, instrumental variables
28	Survival Analysis	⭐⭐⭐⭐	Kaplan-Meier, Cox PH, parametric models, competing risks
29	Modern Data Tools	⭐⭐⭐	Polars, DuckDB, Arrow interoperability, performance benchmarks

Prerequisites

Required Knowledge

Python Basics: Variables, functions, loops, conditionals
Basic Math: Algebra, basic calculus (helpful but not required)
Curiosity: Willingness to ask "why?" and "how can I test this?"

Recommended (but not required)

Familiarity with Jupyter notebooks
Basic understanding of scientific notation
Experience with any programming language

Environment Setup

Installation

Install the required libraries using pip:

# Core data science stack
pip install numpy pandas matplotlib seaborn

# Statistical libraries
pip install scipy statsmodels

# Optional: Bayesian inference
pip install pymc arviz

# Optional: Machine learning integration
pip install scikit-learn

# Optional: Interactive visualization
pip install plotly

Verify Installation

Run this Python snippet to verify all libraries are installed:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import statsmodels.api as sm

print("NumPy version:", np.__version__)
print("Pandas version:", pd.__version__)
print("Matplotlib version:", plt.__version__)
print("Seaborn version:", sns.__version__)
print("SciPy version:", stats.__version__)
print("Statsmodels version:", sm.__version__)

Recommended IDE

Jupyter Notebook or JupyterLab: Best for exploratory analysis
VS Code with Python extension: Good for script development
Google Colab: Free cloud environment (no installation needed)

How to Use This Guide

For Beginners

Start with L01-L06 to build data manipulation skills
Practice with the provided exercises and datasets
Move to L07-L09 for visualization
Don't skip L10! It's the critical bridge to inference
Progress through inference topics (L11-L24) at your own pace

For Intermediate Learners

Skim L01-L09 if you know NumPy/Pandas
Study L10 carefully to solidify your statistical thinking
Focus on inference topics (L11-L24) based on your interests
Complete L25 projects to integrate knowledge

For Advanced Users

Use as a reference for specific techniques
Review L10 for decision frameworks on choosing tests
Dive into advanced topics (Bayesian, GLM, time series)
Adapt L25 projects to your domain

Learning Tips

Active Learning

Code along: Don't just read—run every code example
Modify examples: Change parameters, try different datasets
Ask "what if?": Test edge cases and assumptions

Practice Datasets

Use these built-in datasets for practice:

import seaborn as sns

# Load sample datasets
tips = sns.load_dataset('tips')
iris = sns.load_dataset('iris')
titanic = sns.load_dataset('titanic')
diamonds = sns.load_dataset('diamonds')

Key Habits

Always visualize before running statistical tests
Check assumptions (normality, independence, etc.)
Report effect sizes, not just p-values
Document your reasoning in comments/markdown

Assessment and Projects

Self-Assessment

Each lesson includes:

Exercises: Practice problems with solutions
Conceptual questions: Test your understanding
Code challenges: Apply techniques to new scenarios

Capstone Projects (L25)

The final lesson includes complete projects:

Retail Sales Analysis: Time series forecasting
A/B Test Evaluation: Hypothesis testing workflow
Survey Data Analysis: Multivariate techniques
Predictive Modeling: Regression and classification

Additional Resources

Books

"Python for Data Analysis" by Wes McKinney (Pandas creator)
"The Art of Statistics" by David Spiegelhalter
"Statistical Rethinking" by Richard McElreath

Online Courses

Kaggle Learn: Free interactive tutorials
StatQuest: Video explanations of statistics
Seeing Theory: Visual probability/statistics

Documentation

Getting Help

During Study

Check official documentation first
Use help() function or ? in Jupyter
Search Stack Overflow for pandas/numpy questions
Ask on Cross Validated for statistics questions

Common Issues

ImportError: Reinstall library with pip install --upgrade <library>
DeprecationWarning: Check library versions for compatibility
MemoryError: Use smaller samples or chunking for large datasets

Philosophy of This Guide

Balancing Rigor and Intuition

We aim to:

Build intuition first: Visual and conceptual understanding before formulas
Connect theory to practice: Every concept with code examples
Emphasize critical thinking: Know when to use techniques, not just how

The EDA-Inference Connection

Lesson 10 is the heart of this guide. Most courses treat EDA and inference as separate topics. We emphasize the transition:

EDA generates questions → Inference answers them rigorously
Visualization suggests patterns → Tests confirm them with controlled error
Descriptive stats describe your sample → Inference generalizes to populations

Navigation

Start here: 01_NumPy_Basics
Critical bridge: 10_From_EDA_to_Inference
Final projects: 25_Practical_Projects

Ready to begin your data science journey? Let's start with NumPy fundamentals!

FilesExpand file tree

00_Overview.md

Latest commit

History