Week 1 packet for Amber: T2D phenotyping + EHR + neural networks

This packet is designed for a one-week bridge from applied math into biomedical informatics. The goal is not mastery in one week. The goal is to build a working mental model of:

what EHR / real-world data are,
what a computable phenotype is,
how T2D is defined clinically vs operationally in data,
why comorbidity and coding systems matter,
how a basic neural-network pipeline works in R.

Suggested effort: 10-12 hours total.

How to use AI productively during the rotation

Use ChatGPT Study Mode or any strong reasoning assistant you prefer as a coach, not as a replacement for reading.

Good uses:

“Explain the PheKB T2D algorithm as a decision tree.”
“Quiz me on diabetes diagnosis thresholds.”
“Turn this notebook into a step-by-step checklist.”
“Explain what overfitting would look like in my training curves.”
“Give me a plain-English explanation of Charlson comorbidity.”
“Compare logistic regression vs a 2-layer MLP for tabular EHR data.”

Less useful uses:

asking for final answers without checking sources,
pasting code blindly,
trusting clinical claims without verification.

Best habit: ask the model to explain, then verify against the assigned source.

What I want you to get out of Week 1

By the end of the week, you should be able to answer:

What is the difference between data from routine care and data from an RCT?
What is a computable phenotype?
Why can T2D not be identified reliably from a single field alone?
What do age, sex/gender, race/ethnicity, ICD codes, and comorbidities contribute to a model?
What are the main moving parts in a basic neural network pipeline: preprocessing, architecture, regularization, training, and evaluation?

Core packet (do these first)

1) EHRs and real-world data

Watch first

The Future of Health Care: Electronic Health Records (HealthIT.gov)
- https://www.healthit.gov/video/future-health-care-electronic-health-records
- Goal: get a fast, nontechnical picture of what EHRs are and why they matter.
Protecting Health Information Video (ASTP/HealthIT.gov)
- https://www.healthit.gov/resources/video-protecting-health-information/
- Goal: remember that health data work always sits inside privacy/security constraints.

2) Computable phenotyping

Read/watch

PheKB: What is the Phenotype KnowledgeBase?
- https://phekb.org/
- Read the front-page description only.
- Goal: understand that a phenotype algorithm is a transportable definition built from data elements.
PheKB: Type 2 Diabetes Mellitus phenotype
- https://www.phekb.org/phenotype/type-2-diabetes-mellitus
- Do not try to memorize all the code lists.
- Focus on the logic: diagnoses + meds + labs + exclusions.
OHDSI 2019 Cohort Definition & Phenotyping Tutorial (Part 1)
- https://www.ohdsi.org/2019-tutorials-cohort-definition-phenotyping/
- Watch the introductory video.
- Goal: see how phenotype definitions are treated as explicit cohort logic rather than vague clinical intuition.
OHDSI Cohort Definition and Phenotyping Tutorial (2018)
- https://www.ohdsi.org/past-events/cohort-definitionphenotyping-tutorial/
- Optional this week; useful later for vocabulary basics and phenotype evaluation.

Write 3 bullets after this section:

what a computable phenotype is,
why “case” and “control” need explicit rules,
one reason phenotype transport across sites is hard.

3) Type 2 diabetes basics

Read

CDC: Diabetes Basics
- https://www.cdc.gov/diabetes/about/index.html
- Goal: get the broad picture and public-health framing.
NIDDK: Type 2 Diabetes
- https://www.niddk.nih.gov/health-information/diabetes/overview/what-is-diabetes/type-2-diabetes
- Focus on: what T2D is, symptoms, causes, and why diagnosis can be delayed.
NIDDK: Diabetes Tests & Diagnosis
- https://www.niddk.nih.gov/health-information/diabetes/overview/tests-diagnosis
- Focus on the thresholds for A1C, fasting plasma glucose, OGTT, and random glucose.
NIDDK: The A1C Test
- https://www.niddk.nih.gov/health-information/diagnostic-tests/a1c-test
- Goal: understand what A1C measures and why it is useful.
CDC: A1C Test for Diabetes and Prediabetes
- https://www.cdc.gov/diabetes/diabetes-testing/prediabetes-a1c-test.html
- Quick reinforcement of thresholds.

4) Comorbidity, Charlson, and ICD codes

Read/watch

NIH glossary: Comorbidity
- https://clinicalinfo.hiv.gov/en/glossary/comorbidity
- 1-minute definition.
NCI Comorbidity Index Overview
- https://healthcaredelivery.cancer.gov/seermedicare/considerations/comorbidity.html
- Focus on the history of the Charlson index and why ICD-coded data are used to derive comorbidity burden.
CDC: ICD-10-CM
- https://www.cdc.gov/nchs/icd/icd-10-cm/
- Read the “What to know” section only.
CMS web training: Diagnosis Coding, Using the ICD-10-CM
- https://gov-mirror.org/www.cms.gov/Outreach-and-Education/MLN/WBT/MLN6447308-ICD-10-CM/icd10cm/index.html
- Optional this week. Do just Lesson 1 (ICD-10 Basics) if time permits.
AHRQ HCUP Software Tools Tutorial
- https://hcup-us.ahrq.gov/tech_assist/software/508course.jsp
- Optional. Useful later for seeing how ICD diagnosis codes are operationalized into comorbidity systems.

Why this matters for our project

In your modeling dataset, many comorbidity variables are already present as patient-level binary features.
Think of these as a compressed summary of disease burden and clinical history.

5) Neural networks: intuition first

Watch in this order

3Blue1Brown: Neural networks topic page
- https://www.3blue1brown.com/topics/neural-networks
But what is a Neural Network?
- https://www.3blue1brown.com/lessons/neural-networks
Gradient descent, how neural networks learn
- https://www.3blue1brown.com/topics/neural-networks
- Use the topic page above to access the lesson.
What is backpropagation really doing?
- Also from the same topic page.
Backpropagation calculus
- Optional for this week.

What to pay attention to

input features,
hidden layers,
weights and biases,
nonlinear activation,
loss function,
gradient descent,
train vs validation performance.

6) Neural networks in R

Pick one main path and skim the other.

Path A (recommended): torch for R

torch for R: Start here
- https://torch.mlverse.org/start/
Guess the correlation
- https://torch.mlverse.org/start/guess_the_correlation/
- End-to-end example with data loading, model definition, training, and evaluation.
torch technical tutorial series
- https://torch.mlverse.org/technical/
- Great for tensors, autograd, modules, and optimizers.

Path B (also good): Keras / TensorFlow for R

TensorFlow for R tutorials hub
- https://tensorflow.rstudio.com/tutorials/
Basic Regression
- https://tensorflow.rstudio.com/tutorials/keras/regression
- Good first example for tabular features.
Basic Image Classification
- https://tensorflow.rstudio.com/tutorials/keras/classification.html
- Good for learning the standard training pattern.
Overfit and underfit
- https://tensorflow.rstudio.com/tutorials/keras/overfit_and_underfit.html
- Important for dropout, L2 regularization, early stopping, and why bigger is not always better.
Sequential model guide
- https://tensorflow.rstudio.com/guides/keras/sequential_model.html
- Good reference for how to build a simple stack of dense layers.
Training & evaluation with built-in methods
- https://tensorflow.rstudio.com/guides/keras/training_with_built_in_methods.html
- Useful when you start changing callbacks and learning-rate schedules.

What not to over-focus on this week

Do not spend much time on:

obscure clinical edge cases,
memorizing code lists,
fancy deep architectures,
squeezing out benchmark performance.

This week is about building intuition and vocabulary.

Light exploratory data analysis (EDA) on the T2D dataset

When you start EDA, check these first:

sample size and class balance (i.e., prevalence of T2D),
age distribution,
counts by gender / race / ethnicity,
prevalence of each comorbidity feature,
missingness or odd category values,
duplicate rows or identifier issues,
whether any text-like medication column should be dropped for the first model.

For this rotation, demographics are not the hard concept. They are mostly an EDA / encoding issue. The harder concept is how demographics, comorbidities, and phenotype rules interact.

Optional R packages to know about later

These are not required for Week 1, but useful later:

arrow for parquet files
tidyverse for data wrangling and visualization
recipes / rsample / yardstick for preprocessing and evaluation
torch and luz for neural networks in R
keras3 for simple deep learning in R
comorbidity for computing Charlson / Elixhauser scores from ICD data

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Week 1 packet for Amber: T2D phenotyping + EHR + neural networks

How to use AI productively during the rotation

What I want you to get out of Week 1

Core packet (do these first)

1) EHRs and real-world data

Watch first

Read next

2) Computable phenotyping

Read/watch

3) Type 2 diabetes basics

Read

4) Comorbidity, Charlson, and ICD codes

Read/watch

5) Neural networks: intuition first

Watch in this order

6) Neural networks in R

Path A (recommended): torch for R

Path B (also good): Keras / TensorFlow for R

What not to over-focus on this week

Light exploratory data analysis (EDA) on the T2D dataset

Optional R packages to know about later

FilesExpand file tree

week1.md

Latest commit

History

week1.md

File metadata and controls

Week 1 packet for Amber: T2D phenotyping + EHR + neural networks

How to use AI productively during the rotation

What I want you to get out of Week 1

Core packet (do these first)

1) EHRs and real-world data

Watch first

Read next

2) Computable phenotyping

Read/watch

3) Type 2 diabetes basics

Read

4) Comorbidity, Charlson, and ICD codes

Read/watch

5) Neural networks: intuition first

Watch in this order

6) Neural networks in R

Path A (recommended): torch for R

Path B (also good): Keras / TensorFlow for R

What not to over-focus on this week

Light exploratory data analysis (EDA) on the T2D dataset

Optional R packages to know about later