This repository contains R and Python scripts used to prepare, model, and interpret the Treatment-Resistant Late-Life Depression (TRLLD) phenotype using UK Biobank data.
The workflow integrates phenotype curation, propensity score matching, covariate scoring and imputation, principal component analysis, nested-cross-validation modeling, and SHapley Additive exPlanations (SHAP) analysis to identify biomarkers and key predictors of TRLLD.
| File | Purpose |
|---|---|
| 01_stratification.R | Performs participant stratification and phenotype QC. |
| 02_psm.R | Computes propensity scores and baseline matching variables. |
| 03_covariate_imputation.R | Handles missing data and prepares merged covariate dataset. |
| File | Purpose |
|---|---|
| 01_pca.py | Generates and integrates principal components for biomarker data. |
| 02_preprocessing.py | Cleans dataset, drops higher PCs, defines covariates and biomarkers. |
| 03_model_nested_cv.py | Runs Elastic Net logistic regression with nested cross-validation to evaluate predictive performance. |
| 04_feature_importance.py | Computes average model coefficients and SHAP-based feature importance summaries. |
- Run R scripts (01–03) to build the cleaned and merged input dataset
→ outputscombined_pca_covariates.tsvin your localdata/processed/folder. - Run Python scripts (01–04) sequentially to perform data preparation, model training, and interpretation.
- Model results (e.g.,
results/all_models.pkl,results/feature_importance.csv,results/shap_summary.csv) are saved locally when each script completes.
R ≥ 4.3
Python ≥ 3.9
Python packages: pandas numpy scikit-learn matplotlib seaborn scipy shap
- Nested 5-fold cross-validation for robust model performance estimates
- Elastic Net regularization for feature selection and multicollinearity control
- Coefficient- and SHAP-based feature importance interpretation
- Modular design (each step reproducible independently)
- UK Biobank data are not included due to access restrictions.
- Results directories and figures are generated locally when scripts are executed.
- To reproduce, users must supply appropriately preprocessed UK Biobank data.
Ayesha Syeda
Krembil Centre for Neuroinformatics – Whole Person & Population Modelling Lab