This packet is designed for a one-week bridge from applied math into biomedical informatics. The goal is not mastery in one week. The goal is to build a working mental model of:
- what EHR / real-world data are,
- what a computable phenotype is,
- how T2D is defined clinically vs operationally in data,
- why comorbidity and coding systems matter,
- how a basic neural-network pipeline works in R.
Suggested effort: 10-12 hours total.
Use ChatGPT Study Mode or any strong reasoning assistant you prefer as a coach, not as a replacement for reading.
Good uses:
- “Explain the PheKB T2D algorithm as a decision tree.”
- “Quiz me on diabetes diagnosis thresholds.”
- “Turn this notebook into a step-by-step checklist.”
- “Explain what overfitting would look like in my training curves.”
- “Give me a plain-English explanation of Charlson comorbidity.”
- “Compare logistic regression vs a 2-layer MLP for tabular EHR data.”
Less useful uses:
- asking for final answers without checking sources,
- pasting code blindly,
- trusting clinical claims without verification.
Best habit: ask the model to explain, then verify against the assigned source.
By the end of the week, you should be able to answer:
- What is the difference between data from routine care and data from an RCT?
- What is a computable phenotype?
- Why can T2D not be identified reliably from a single field alone?
- What do age, sex/gender, race/ethnicity, ICD codes, and comorbidities contribute to a model?
- What are the main moving parts in a basic neural network pipeline: preprocessing, architecture, regularization, training, and evaluation?
-
The Future of Health Care: Electronic Health Records (HealthIT.gov)
- https://www.healthit.gov/video/future-health-care-electronic-health-records
- Goal: get a fast, nontechnical picture of what EHRs are and why they matter.
-
Protecting Health Information Video (ASTP/HealthIT.gov)
- https://www.healthit.gov/resources/video-protecting-health-information/
- Goal: remember that health data work always sits inside privacy/security constraints.
-
FDA: Real-World Evidence
- https://www.fda.gov/science-research/science-and-research-special-topics/real-world-evidence
- Read only the definitions and examples.
- Focus question: what counts as real-world data, and how is it different from evidence from a randomized trial?
-
HHS: HIPAA Privacy Rule (skim)
- https://www.hhs.gov/hipaa/for-professionals/privacy/index.html
- Focus question: what is protected health information (PHI), in plain language?
-
HHS: De-identification guidance (skim just the overview)
- https://www.hhs.gov/hipaa/for-professionals/privacy/special-topics/de-identification/index.html
- Focus question: why is “de-identified” not the same as “zero risk”?
-
PheKB: What is the Phenotype KnowledgeBase?
- https://phekb.org/
- Read the front-page description only.
- Goal: understand that a phenotype algorithm is a transportable definition built from data elements.
-
PheKB: Type 2 Diabetes Mellitus phenotype
- https://www.phekb.org/phenotype/type-2-diabetes-mellitus
- Do not try to memorize all the code lists.
- Focus on the logic: diagnoses + meds + labs + exclusions.
-
OHDSI 2019 Cohort Definition & Phenotyping Tutorial (Part 1)
- https://www.ohdsi.org/2019-tutorials-cohort-definition-phenotyping/
- Watch the introductory video.
- Goal: see how phenotype definitions are treated as explicit cohort logic rather than vague clinical intuition.
-
OHDSI Cohort Definition and Phenotyping Tutorial (2018)
- https://www.ohdsi.org/past-events/cohort-definitionphenotyping-tutorial/
- Optional this week; useful later for vocabulary basics and phenotype evaluation.
Write 3 bullets after this section:
- what a computable phenotype is,
- why “case” and “control” need explicit rules,
- one reason phenotype transport across sites is hard.
-
CDC: Diabetes Basics
- https://www.cdc.gov/diabetes/about/index.html
- Goal: get the broad picture and public-health framing.
-
NIDDK: Type 2 Diabetes
- https://www.niddk.nih.gov/health-information/diabetes/overview/what-is-diabetes/type-2-diabetes
- Focus on: what T2D is, symptoms, causes, and why diagnosis can be delayed.
-
NIDDK: Diabetes Tests & Diagnosis
- https://www.niddk.nih.gov/health-information/diabetes/overview/tests-diagnosis
- Focus on the thresholds for A1C, fasting plasma glucose, OGTT, and random glucose.
-
NIDDK: The A1C Test
- https://www.niddk.nih.gov/health-information/diagnostic-tests/a1c-test
- Goal: understand what A1C measures and why it is useful.
-
CDC: A1C Test for Diabetes and Prediabetes
- https://www.cdc.gov/diabetes/diabetes-testing/prediabetes-a1c-test.html
- Quick reinforcement of thresholds.
-
NIH glossary: Comorbidity
- https://clinicalinfo.hiv.gov/en/glossary/comorbidity
- 1-minute definition.
-
NCI Comorbidity Index Overview
- https://healthcaredelivery.cancer.gov/seermedicare/considerations/comorbidity.html
- Focus on the history of the Charlson index and why ICD-coded data are used to derive comorbidity burden.
-
CDC: ICD-10-CM
- https://www.cdc.gov/nchs/icd/icd-10-cm/
- Read the “What to know” section only.
-
CMS web training: Diagnosis Coding, Using the ICD-10-CM
- https://gov-mirror.org/www.cms.gov/Outreach-and-Education/MLN/WBT/MLN6447308-ICD-10-CM/icd10cm/index.html
- Optional this week. Do just Lesson 1 (ICD-10 Basics) if time permits.
-
AHRQ HCUP Software Tools Tutorial
- https://hcup-us.ahrq.gov/tech_assist/software/508course.jsp
- Optional. Useful later for seeing how ICD diagnosis codes are operationalized into comorbidity systems.
Why this matters for our project
- In your modeling dataset, many comorbidity variables are already present as patient-level binary features.
- Think of these as a compressed summary of disease burden and clinical history.
-
3Blue1Brown: Neural networks topic page
-
But what is a Neural Network?
-
Gradient descent, how neural networks learn
- https://www.3blue1brown.com/topics/neural-networks
- Use the topic page above to access the lesson.
-
What is backpropagation really doing?
- Also from the same topic page.
-
Backpropagation calculus
- Optional for this week.
What to pay attention to
- input features,
- hidden layers,
- weights and biases,
- nonlinear activation,
- loss function,
- gradient descent,
- train vs validation performance.
Pick one main path and skim the other.
-
torch for R: Start here
-
Guess the correlation
- https://torch.mlverse.org/start/guess_the_correlation/
- End-to-end example with data loading, model definition, training, and evaluation.
-
torch technical tutorial series
- https://torch.mlverse.org/technical/
- Great for tensors, autograd, modules, and optimizers.
-
TensorFlow for R tutorials hub
-
Basic Regression
- https://tensorflow.rstudio.com/tutorials/keras/regression
- Good first example for tabular features.
-
Basic Image Classification
- https://tensorflow.rstudio.com/tutorials/keras/classification.html
- Good for learning the standard training pattern.
-
Overfit and underfit
- https://tensorflow.rstudio.com/tutorials/keras/overfit_and_underfit.html
- Important for dropout, L2 regularization, early stopping, and why bigger is not always better.
-
Sequential model guide
- https://tensorflow.rstudio.com/guides/keras/sequential_model.html
- Good reference for how to build a simple stack of dense layers.
-
Training & evaluation with built-in methods
- https://tensorflow.rstudio.com/guides/keras/training_with_built_in_methods.html
- Useful when you start changing callbacks and learning-rate schedules.
Do not spend much time on:
- obscure clinical edge cases,
- memorizing code lists,
- fancy deep architectures,
- squeezing out benchmark performance.
This week is about building intuition and vocabulary.
When you start EDA, check these first:
- sample size and class balance (i.e., prevalence of T2D),
- age distribution,
- counts by gender / race / ethnicity,
- prevalence of each comorbidity feature,
- missingness or odd category values,
- duplicate rows or identifier issues,
- whether any text-like medication column should be dropped for the first model.
For this rotation, demographics are not the hard concept. They are mostly an EDA / encoding issue. The harder concept is how demographics, comorbidities, and phenotype rules interact.
These are not required for Week 1, but useful later:
arrowfor parquet filestidyversefor data wrangling and visualizationrecipes/rsample/yardstickfor preprocessing and evaluationtorchandluzfor neural networks in Rkeras3for simple deep learning in Rcomorbidityfor computing Charlson / Elixhauser scores from ICD data