Interpretable AI studies how to understand, explain, and audit machine learning model decisions. As AI systems increasingly affect hiring, lending, healthcare, and criminal justice, the ability to explain why a model made a prediction is no longer optional — it is a legal, ethical, and engineering requirement. This topic covers advanced interpretability methods beyond the fundamentals introduced in Machine Learning Lesson 16, including gradient-based attribution, concept-based explanations, causal inference for explainability, advanced algorithmic fairness, AI regulation, and the emerging field of mechanistic interpretability.
- ML engineers who need to explain model predictions to stakeholders
- Data scientists working in regulated industries (finance, healthcare, insurance)
- AI researchers interested in model understanding and safety
- Software engineers building production explanation systems
- Policy professionals evaluating AI governance frameworks
- Machine_Learning: Especially Lesson 16 (Model Explainability) — SHAP, LIME, PDP/ICE basics
- Deep_Learning: Neural network architectures, backpropagation, CNNs, Transformers
- Python: Comfortable with PyTorch, NumPy, scikit-learn
Section 1: Foundations
L01 Interpretability Foundations ──► L02 Gradient Attribution
│
Section 2: Deep Learning Explanations ▼
L03 Class Activation Mapping ──► L04 Attention Interpretation ──► L05 Probing & Representations
│
Section 3: Advanced Methods ▼
L06 Advanced SHAP ──► L07 Concept-Based Explanations ──► L08 Counterfactual Explanations
│
Section 4: Causal & Theory ▼
L09 Causal Inference for Interpretability ──► L10 Evaluating Explanations
│
Section 5: Fairness ▼
L11 Advanced Algorithmic Fairness ──► L12 Fairness Mitigation
│
Section 6: Regulation & Production ▼
L13 AI Regulation & Governance ──► L14 Production Interpretability
│
Section 7: Domain & Frontier ▼
L15 Domain-Specific Interpretability ──► L16 Mechanistic Interpretability
| Lesson | File | Difficulty | Description |
|---|---|---|---|
| L01 | 01_Interpretability_Foundations.md | ⭐⭐ | Lipton's taxonomy, explanation desiderata, research landscape |
| L02 | 02_Gradient_Attribution.md | ⭐⭐⭐ | Saliency maps, Integrated Gradients, SmoothGrad, sanity checks |
| L03 | 03_Class_Activation_Mapping.md | ⭐⭐⭐ | CAM, GradCAM, GradCAM++, Score-CAM, Eigen-CAM |
| L04 | 04_Attention_Interpretation.md | ⭐⭐⭐⭐ | BertViz, attention rollout, "Attention is not Explanation" debate |
| L05 | 05_Probing_and_Representation_Analysis.md | ⭐⭐⭐⭐ | Probing classifiers, Network Dissection, logit lens, CKA |
| L06 | 06_Advanced_SHAP.md | ⭐⭐⭐ | DeepSHAP, SHAP interactions, Causal SHAP, optimization |
| L07 | 07_Concept_Based_Explanations.md | ⭐⭐⭐⭐ | TCAV, Concept Bottleneck Models, ACE |
| L08 | 08_Counterfactual_Explanations.md | ⭐⭐⭐ | Wachter formulation, DiCE, actionability constraints |
| L09 | 09_Causal_Inference_for_Interpretability.md | ⭐⭐⭐⭐ | SCMs, do-calculus, causal feature importance, DoWhy |
| L10 | 10_Evaluating_Explanations.md | ⭐⭐⭐ | Faithfulness, stability, ROAR benchmark, human evaluation |
| L11 | 11_Advanced_Algorithmic_Fairness.md | ⭐⭐⭐⭐ | Individual fairness, counterfactual fairness, impossibility theorem |
| L12 | 12_Fairness_Mitigation.md | ⭐⭐⭐⭐ | Pre/in/post-processing, Pareto frontiers, Fairlearn, AIF360 |
| L13 | 13_AI_Regulation_and_Governance.md | ⭐⭐⭐ | EU AI Act, GDPR Art. 22, NIST AI RMF, model cards |
| L14 | 14_Production_Interpretability.md | ⭐⭐⭐⭐ | Explanation serving, caching, drift monitoring, MLOps integration |
| L15 | 15_Domain_Specific_Interpretability.md | ⭐⭐⭐ | Healthcare, finance, NLP, computer vision applications |
| L16 | 16_Mechanistic_Interpretability.md | ⭐⭐⭐⭐ | Superposition, sparse autoencoders, circuit discovery, activation patching |
- ⭐⭐ Builds on ML L16 foundations with deeper taxonomy and frameworks
- ⭐⭐⭐ Requires comfortable PyTorch/math skills; implementation-focused
- ⭐⭐⭐⭐ Research-level methods; requires strong math and DL background
# Core
pip install torch>=2.0 torchvision transformers>=4.36
# Explainability libraries
pip install shap>=0.43 lime captum
# Visualization
pip install matplotlib seaborn
# Fairness toolkits
pip install fairlearn aif360
# Counterfactual explanations
pip install dice-ml
# Causal inference
pip install dowhy
# Attention visualization
pip install bertviz
# Mechanistic interpretability (L16)
pip install transformer-lens- Machine_Learning: Lesson 16 covers SHAP/LIME/PDP fundamentals (prerequisite)
- Deep_Learning: CNN and Transformer architectures explained here
- Foundation_Models: Scaling and fine-tuning context for mechanistic interpretability
- MLOps: Production deployment context for explanation serving
- Probability_and_Statistics: Statistical testing used in evaluation and fairness
- Complete ML L16 first — this topic assumes you already understand SHAP, LIME, and PDP basics
- Run the code — interpretability is best understood by visualizing explanations on real models
- Compare methods — apply multiple methods to the same prediction and note where they agree/disagree
- Think about the audience — different stakeholders need different types of explanations
- Stay current — mechanistic interpretability is evolving rapidly; check recent papers
- Practice the fairness math — impossibility theorems are counterintuitive until you work through proofs
After completing this topic, you will be able to:
- Implement gradient-based attribution methods (Integrated Gradients, GradCAM) from scratch in PyTorch
- Critically evaluate whether attention weights constitute valid explanations
- Generate and evaluate counterfactual explanations for tabular and image data
- Apply causal reasoning to distinguish genuine feature effects from spurious correlations
- Audit ML models for fairness using multiple definitions and implement mitigation strategies
- Design production systems that serve explanations alongside predictions
- Navigate AI regulatory frameworks (EU AI Act, GDPR) and create compliant documentation
- Apply mechanistic interpretability techniques to understand neural network internals
- Foundation_Models: Explore how interpretability scales to large models
- MLOps: Integrate explanation systems into ML pipelines
- Research: Follow Anthropic's mechanistic interpretability research, Google DeepMind's XAI work
License: CC BY-NC 4.0