Electronic Health Record (EHR) data is a rapidly growing source of unstructured biomedical data. This data is extremely rich, often capturing a patient’s phenotype. In a clinical context, phenotype refers to the medical conditions, diseases, and disorders of a patient. These records can capture data in higher detail compared to structured encodings such as the International Classification of Diseases (ICD). Traditional methods for extracting phenotypes from this data typically relies on manual review or processing the data through rule-based expert systems. Both approaches are time intensive, rely heavily on human expertise, and scale poorly. This project proposes an automated approach to identifying phenotypes in EHR data through machine learning.
** Data files have been excluded due to size and security. Please contact renzeer@berkeley.edu to request access the the data files **