UK Biobank TRLLD Analysis

This repository contains R and Python scripts used to prepare, model, and interpret the Treatment-Resistant Late-Life Depression (TRLLD) phenotype using UK Biobank data.

The workflow integrates phenotype curation, propensity score matching, covariate scoring and imputation, principal component analysis, nested-cross-validation modeling, and SHapley Additive exPlanations (SHAP) analysis to identify biomarkers and key predictors of TRLLD.

Repository Structure

R Scripts (`/R_scripts`)

File	Purpose
01_stratification.R	Performs participant stratification and phenotype QC.
02_psm.R	Computes propensity scores and baseline matching variables.
03_covariate_imputation.R	Handles missing data and prepares merged covariate dataset.

Python Scripts (`/python_scripts`)

File	Purpose
01_pca.py	Generates and integrates principal components for biomarker data.
02_preprocessing.py	Cleans dataset, drops higher PCs, defines covariates and biomarkers.
03_model_nested_cv.py	Runs Elastic Net logistic regression with nested cross-validation to evaluate predictive performance.
04_feature_importance.py	Computes average model coefficients and SHAP-based feature importance summaries.

Execution Order

Run R scripts (01–03) to build the cleaned and merged input dataset
→ outputs combined_pca_covariates.tsv in your local data/processed/ folder.
Run Python scripts (01–04) sequentially to perform data preparation, model training, and interpretation.
Model results (e.g., results/all_models.pkl, results/feature_importance.csv, results/shap_summary.csv) are saved locally when each script completes.

Environment

R ≥ 4.3
Python ≥ 3.9

Python packages: pandas numpy scikit-learn matplotlib seaborn scipy shap

Key Features

Nested 5-fold cross-validation for robust model performance estimates
Elastic Net regularization for feature selection and multicollinearity control
Coefficient- and SHAP-based feature importance interpretation
Modular design (each step reproducible independently)

Notes

UK Biobank data are not included due to access restrictions.
Results directories and figures are generated locally when scripts are executed.
To reproduce, users must supply appropriately preprocessed UK Biobank data.

Ayesha Syeda
Krembil Centre for Neuroinformatics – Whole Person & Population Modelling Lab

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
R_scripts		R_scripts
python_scripts		python_scripts
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

UK Biobank TRLLD Analysis

Repository Structure

R Scripts (`/R_scripts`)

Python Scripts (`/python_scripts`)

Execution Order

Environment

Key Features

Notes

About

Uh oh!

Releases

Packages

Languages

fretbret/ukb-trlld

Folders and files

Latest commit

History

Repository files navigation

UK Biobank TRLLD Analysis

Repository Structure

R Scripts (/R_scripts)

Python Scripts (/python_scripts)

Execution Order

Environment

Key Features

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

R Scripts (`/R_scripts`)

Python Scripts (`/python_scripts`)

Packages