Welcome to the Data Science Study Guide! This comprehensive topic covers the essential tools and statistical methods for modern data analysis. You'll learn how to manipulate, visualize, and draw meaningful conclusions from data using industry-standard Python libraries.
Data Science combines:
- Data manipulation tools (NumPy, Pandas) for handling structured data
- Visualization techniques (Matplotlib, Seaborn) for exploring patterns
- Statistical inference (scipy, statsmodels) for drawing valid conclusions
- Practical applications through hands-on projects
This topic is designed to take you from basic data manipulation through exploratory data analysis (EDA) to rigorous statistical inference and advanced modeling techniques.
The 29 lessons follow a structured progression:
Master the fundamental libraries for data manipulation and preprocessing.
Learn to visualize, summarize, and explore data patterns.
Critical transition: Understand when and why to move beyond descriptive statistics to formal statistical testing.
Build probability foundations and learn hypothesis testing frameworks.
Master specialized techniques: ANOVA, regression, Bayesian methods, time series, and experimental design.
Apply everything in comprehensive real-world projects.
Bayesian advanced topics, causal inference, survival analysis, and modern data tools.
| Lesson | Title | Difficulty | Topics |
|---|---|---|---|
| 01 | NumPy Basics | ⭐ | Arrays, indexing, broadcasting, basic operations |
| 02 | NumPy Advanced | ⭐⭐ | Vectorization, linear algebra, random sampling |
| 03 | Pandas Basics | ⭐ | Series, DataFrames, reading/writing data |
| 04 | Pandas Data Manipulation | ⭐⭐ | Filtering, groupby, merging, reshaping |
| 05 | Pandas Advanced | ⭐⭐⭐ | MultiIndex, time series, categorical data |
| 06 | Data Preprocessing | ⭐⭐ | Missing data, outliers, scaling, encoding |
| 07 | Descriptive Statistics & EDA | ⭐⭐ | Summary statistics, distributions, correlation |
| 08 | Data Visualization Basics | ⭐⭐ | Matplotlib fundamentals, plot types |
| 09 | Data Visualization Advanced | ⭐⭐⭐ | Seaborn, complex plots, interactive viz |
| 10 | From EDA to Inference | ⭐⭐ | Bridge lesson: population vs sample, statistical thinking, choosing tests |
| 11 | Probability Review | ⭐⭐ | Random variables, distributions, expectation |
| 12 | Sampling and Estimation | ⭐⭐ | Sampling methods, point estimation, bias/variance |
| 13 | Confidence Intervals | ⭐⭐⭐ | CI construction, interpretation, margin of error |
| 14 | Hypothesis Testing Advanced | ⭐⭐⭐ | p-values, Type I/II errors, power analysis |
| 15 | ANOVA | ⭐⭐⭐ | One-way, two-way, post-hoc tests |
| 16 | Regression Analysis Advanced | ⭐⭐⭐ | Multiple regression, diagnostics, regularization |
| 17 | Generalized Linear Models | ⭐⭐⭐⭐ | Logistic regression, Poisson regression, GLM theory |
| 18 | Bayesian Statistics Basics | ⭐⭐⭐ | Bayes theorem, prior/posterior, conjugacy |
| 19 | Bayesian Inference | ⭐⭐⭐⭐ | MCMC, PyMC, credible intervals |
| 20 | Time Series Basics | ⭐⭐⭐ | Trends, seasonality, decomposition |
| 21 | Time Series Models | ⭐⭐⭐⭐ | ARIMA, SARIMA, forecasting, diagnostics |
| 22 | Multivariate Analysis | ⭐⭐⭐ | PCA, factor analysis, clustering |
| 23 | Nonparametric Statistics | ⭐⭐⭐ | Rank tests, bootstrap, permutation tests |
| 24 | Experimental Design | ⭐⭐⭐ | A/B testing, randomization, DOE principles |
| 25 | Practical Projects | ⭐⭐⭐⭐ | End-to-end data science projects |
| 26 | Bayesian Advanced | ⭐⭐⭐⭐ | HMC/NUTS, variational inference, hierarchical models, model comparison |
| 27 | Causal Inference | ⭐⭐⭐⭐ | DAGs, propensity score, DID, RDD, instrumental variables |
| 28 | Survival Analysis | ⭐⭐⭐⭐ | Kaplan-Meier, Cox PH, parametric models, competing risks |
| 29 | Modern Data Tools | ⭐⭐⭐ | Polars, DuckDB, Arrow interoperability, performance benchmarks |
- Python Basics: Variables, functions, loops, conditionals
- Basic Math: Algebra, basic calculus (helpful but not required)
- Curiosity: Willingness to ask "why?" and "how can I test this?"
- Familiarity with Jupyter notebooks
- Basic understanding of scientific notation
- Experience with any programming language
Install the required libraries using pip:
# Core data science stack
pip install numpy pandas matplotlib seaborn
# Statistical libraries
pip install scipy statsmodels
# Optional: Bayesian inference
pip install pymc arviz
# Optional: Machine learning integration
pip install scikit-learn
# Optional: Interactive visualization
pip install plotlyRun this Python snippet to verify all libraries are installed:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import statsmodels.api as sm
print("NumPy version:", np.__version__)
print("Pandas version:", pd.__version__)
print("Matplotlib version:", plt.__version__)
print("Seaborn version:", sns.__version__)
print("SciPy version:", stats.__version__)
print("Statsmodels version:", sm.__version__)- Jupyter Notebook or JupyterLab: Best for exploratory analysis
- VS Code with Python extension: Good for script development
- Google Colab: Free cloud environment (no installation needed)
This topic connects closely with other areas in the study guide:
- Python: Learn Python fundamentals first
- Programming: Core programming concepts
- Machine Learning: Predictive modeling with scikit-learn
- Deep Learning: Neural networks with PyTorch
- Data Engineering: Large-scale data pipelines
- MLOps: Deploying models to production
- Start with L01-L06 to build data manipulation skills
- Practice with the provided exercises and datasets
- Move to L07-L09 for visualization
- Don't skip L10! It's the critical bridge to inference
- Progress through inference topics (L11-L24) at your own pace
- Skim L01-L09 if you know NumPy/Pandas
- Study L10 carefully to solidify your statistical thinking
- Focus on inference topics (L11-L24) based on your interests
- Complete L25 projects to integrate knowledge
- Use as a reference for specific techniques
- Review L10 for decision frameworks on choosing tests
- Dive into advanced topics (Bayesian, GLM, time series)
- Adapt L25 projects to your domain
- Code along: Don't just read—run every code example
- Modify examples: Change parameters, try different datasets
- Ask "what if?": Test edge cases and assumptions
Use these built-in datasets for practice:
import seaborn as sns
# Load sample datasets
tips = sns.load_dataset('tips')
iris = sns.load_dataset('iris')
titanic = sns.load_dataset('titanic')
diamonds = sns.load_dataset('diamonds')- Always visualize before running statistical tests
- Check assumptions (normality, independence, etc.)
- Report effect sizes, not just p-values
- Document your reasoning in comments/markdown
Each lesson includes:
- Exercises: Practice problems with solutions
- Conceptual questions: Test your understanding
- Code challenges: Apply techniques to new scenarios
The final lesson includes complete projects:
- Retail Sales Analysis: Time series forecasting
- A/B Test Evaluation: Hypothesis testing workflow
- Survey Data Analysis: Multivariate techniques
- Predictive Modeling: Regression and classification
- "Python for Data Analysis" by Wes McKinney (Pandas creator)
- "The Art of Statistics" by David Spiegelhalter
- "Statistical Rethinking" by Richard McElreath
- Kaggle Learn: Free interactive tutorials
- StatQuest: Video explanations of statistics
- Seeing Theory: Visual probability/statistics
- Check official documentation first
- Use
help()function or?in Jupyter - Search Stack Overflow for pandas/numpy questions
- Ask on Cross Validated for statistics questions
- ImportError: Reinstall library with
pip install --upgrade <library> - DeprecationWarning: Check library versions for compatibility
- MemoryError: Use smaller samples or chunking for large datasets
We aim to:
- Build intuition first: Visual and conceptual understanding before formulas
- Connect theory to practice: Every concept with code examples
- Emphasize critical thinking: Know when to use techniques, not just how
Lesson 10 is the heart of this guide. Most courses treat EDA and inference as separate topics. We emphasize the transition:
- EDA generates questions → Inference answers them rigorously
- Visualization suggests patterns → Tests confirm them with controlled error
- Descriptive stats describe your sample → Inference generalizes to populations
- Start here: 01_NumPy_Basics
- Critical bridge: 10_From_EDA_to_Inference
- Final projects: 25_Practical_Projects
Ready to begin your data science journey? Let's start with NumPy fundamentals!