Skip to content

Latest commit

 

History

History
268 lines (193 loc) · 9.85 KB

File metadata and controls

268 lines (193 loc) · 9.85 KB

Week 1 packet for Amber: T2D phenotyping + EHR + neural networks

This packet is designed for a one-week bridge from applied math into biomedical informatics. The goal is not mastery in one week. The goal is to build a working mental model of:

  1. what EHR / real-world data are,
  2. what a computable phenotype is,
  3. how T2D is defined clinically vs operationally in data,
  4. why comorbidity and coding systems matter,
  5. how a basic neural-network pipeline works in R.

Suggested effort: 10-12 hours total.

How to use AI productively during the rotation

Use ChatGPT Study Mode or any strong reasoning assistant you prefer as a coach, not as a replacement for reading.

Good uses:

  • “Explain the PheKB T2D algorithm as a decision tree.”
  • “Quiz me on diabetes diagnosis thresholds.”
  • “Turn this notebook into a step-by-step checklist.”
  • “Explain what overfitting would look like in my training curves.”
  • “Give me a plain-English explanation of Charlson comorbidity.”
  • “Compare logistic regression vs a 2-layer MLP for tabular EHR data.”

Less useful uses:

  • asking for final answers without checking sources,
  • pasting code blindly,
  • trusting clinical claims without verification.

Best habit: ask the model to explain, then verify against the assigned source.


What I want you to get out of Week 1

By the end of the week, you should be able to answer:

  • What is the difference between data from routine care and data from an RCT?
  • What is a computable phenotype?
  • Why can T2D not be identified reliably from a single field alone?
  • What do age, sex/gender, race/ethnicity, ICD codes, and comorbidities contribute to a model?
  • What are the main moving parts in a basic neural network pipeline: preprocessing, architecture, regularization, training, and evaluation?

Core packet (do these first)

1) EHRs and real-world data

Watch first

  1. The Future of Health Care: Electronic Health Records (HealthIT.gov)

  2. Protecting Health Information Video (ASTP/HealthIT.gov)

Read next

  1. FDA: Real-World Evidence

  2. HHS: HIPAA Privacy Rule (skim)

  3. HHS: De-identification guidance (skim just the overview)


2) Computable phenotyping

Read/watch

  1. PheKB: What is the Phenotype KnowledgeBase?

    • https://phekb.org/
    • Read the front-page description only.
    • Goal: understand that a phenotype algorithm is a transportable definition built from data elements.
  2. PheKB: Type 2 Diabetes Mellitus phenotype

  3. OHDSI 2019 Cohort Definition & Phenotyping Tutorial (Part 1)

  4. OHDSI Cohort Definition and Phenotyping Tutorial (2018)

Write 3 bullets after this section:

  • what a computable phenotype is,
  • why “case” and “control” need explicit rules,
  • one reason phenotype transport across sites is hard.

3) Type 2 diabetes basics

Read

  1. CDC: Diabetes Basics

  2. NIDDK: Type 2 Diabetes

  3. NIDDK: Diabetes Tests & Diagnosis

  4. NIDDK: The A1C Test

  5. CDC: A1C Test for Diabetes and Prediabetes


4) Comorbidity, Charlson, and ICD codes

Read/watch

  1. NIH glossary: Comorbidity

  2. NCI Comorbidity Index Overview

  3. CDC: ICD-10-CM

  4. CMS web training: Diagnosis Coding, Using the ICD-10-CM

  5. AHRQ HCUP Software Tools Tutorial

Why this matters for our project

  • In your modeling dataset, many comorbidity variables are already present as patient-level binary features.
  • Think of these as a compressed summary of disease burden and clinical history.

5) Neural networks: intuition first

Watch in this order

  1. 3Blue1Brown: Neural networks topic page

  2. But what is a Neural Network?

  3. Gradient descent, how neural networks learn

  4. What is backpropagation really doing?

    • Also from the same topic page.
  5. Backpropagation calculus

    • Optional for this week.

What to pay attention to

  • input features,
  • hidden layers,
  • weights and biases,
  • nonlinear activation,
  • loss function,
  • gradient descent,
  • train vs validation performance.

6) Neural networks in R

Pick one main path and skim the other.

Path A (recommended): torch for R

  1. torch for R: Start here

  2. Guess the correlation

  3. torch technical tutorial series

Path B (also good): Keras / TensorFlow for R

  1. TensorFlow for R tutorials hub

  2. Basic Regression

  3. Basic Image Classification

  4. Overfit and underfit

  5. Sequential model guide

  6. Training & evaluation with built-in methods


What not to over-focus on this week

Do not spend much time on:

  • obscure clinical edge cases,
  • memorizing code lists,
  • fancy deep architectures,
  • squeezing out benchmark performance.

This week is about building intuition and vocabulary.


Light exploratory data analysis (EDA) on the T2D dataset

When you start EDA, check these first:

  1. sample size and class balance (i.e., prevalence of T2D),
  2. age distribution,
  3. counts by gender / race / ethnicity,
  4. prevalence of each comorbidity feature,
  5. missingness or odd category values,
  6. duplicate rows or identifier issues,
  7. whether any text-like medication column should be dropped for the first model.

For this rotation, demographics are not the hard concept. They are mostly an EDA / encoding issue. The harder concept is how demographics, comorbidities, and phenotype rules interact.


Optional R packages to know about later

These are not required for Week 1, but useful later:

  • arrow for parquet files
  • tidyverse for data wrangling and visualization
  • recipes / rsample / yardstick for preprocessing and evaluation
  • torch and luz for neural networks in R
  • keras3 for simple deep learning in R
  • comorbidity for computing Charlson / Elixhauser scores from ICD data