Skip to content

Analysis pipeline for modeling and interpreting treatment-resistant late-life depression (TRLLD) using UK Biobank data.

Notifications You must be signed in to change notification settings

fretbret/ukb-trlld

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 

Repository files navigation

UK Biobank TRLLD Analysis

This repository contains R and Python scripts used to prepare, model, and interpret the Treatment-Resistant Late-Life Depression (TRLLD) phenotype using UK Biobank data.

The workflow integrates phenotype curation, propensity score matching, covariate scoring and imputation, principal component analysis, nested-cross-validation modeling, and SHapley Additive exPlanations (SHAP) analysis to identify biomarkers and key predictors of TRLLD.


Repository Structure

R Scripts (/R_scripts)

File Purpose
01_stratification.R Performs participant stratification and phenotype QC.
02_psm.R Computes propensity scores and baseline matching variables.
03_covariate_imputation.R Handles missing data and prepares merged covariate dataset.

Python Scripts (/python_scripts)

File Purpose
01_pca.py Generates and integrates principal components for biomarker data.
02_preprocessing.py Cleans dataset, drops higher PCs, defines covariates and biomarkers.
03_model_nested_cv.py Runs Elastic Net logistic regression with nested cross-validation to evaluate predictive performance.
04_feature_importance.py Computes average model coefficients and SHAP-based feature importance summaries.

Execution Order

  1. Run R scripts (01–03) to build the cleaned and merged input dataset
    → outputs combined_pca_covariates.tsv in your local data/processed/ folder.
  2. Run Python scripts (01–04) sequentially to perform data preparation, model training, and interpretation.
  3. Model results (e.g., results/all_models.pkl, results/feature_importance.csv, results/shap_summary.csv) are saved locally when each script completes.

Environment

R ≥ 4.3
Python ≥ 3.9

Python packages: pandas numpy scikit-learn matplotlib seaborn scipy shap


Key Features

  • Nested 5-fold cross-validation for robust model performance estimates
  • Elastic Net regularization for feature selection and multicollinearity control
  • Coefficient- and SHAP-based feature importance interpretation
  • Modular design (each step reproducible independently)

Notes

  • UK Biobank data are not included due to access restrictions.
  • Results directories and figures are generated locally when scripts are executed.
  • To reproduce, users must supply appropriately preprocessed UK Biobank data.

Ayesha Syeda
Krembil Centre for Neuroinformatics – Whole Person & Population Modelling Lab

About

Analysis pipeline for modeling and interpreting treatment-resistant late-life depression (TRLLD) using UK Biobank data.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published