This repository contains a comprehensive analytical pipeline for predicting pediatric heart transplant graft loss using data from the Pediatric Heart Transplant Society (PHTS). The workflow replicates and extends the methodology from Wisotzkey et al. (2023), incorporating multiple survival modeling approaches with robust feature selection and evaluation.
The PHTS Graft Loss Prediction Pipeline is a complete end-to-end analytical framework for:
- Data preprocessing and feature engineering from PHTS registry data
- Feature selection using multiple methods (RSF, CatBoost, AORSF)
- Survival model fitting with multiple algorithms
- Model evaluation using dual C-index calculations (time-dependent and time-independent)
- Interactive risk calculator with web-based dashboard and causal analysis
- Comprehensive reporting with tables, figures, and documentation
graph TB
ROOT[phts] --> SCRIPTS[scripts/]
ROOT --> GL[graft-loss]
ROOT --> CI[concordance_index]
ROOT --> EDA[eda]
ROOT --> LMTP[lmtp-workshop]
ROOT --> DL[survival_analysis_deep_learning_asa]
SCRIPTS --> SCRIPTS_R[R/ - R scripts]
SCRIPTS --> SCRIPTS_PY[py/ - Python scripts]
SCRIPTS --> SCRIPTS_BASH[bash/ - Bash scripts]
GL --> GL_feat[feature_importance: Global MC-CV]
GL_feat --> GL_nb[graft_loss_feature_importance_20_MC_CV.ipynb]
GL_feat --> GL_docs[MC-CV READMEs + outputs]
GL --> GL_cohort[cohort_analysis: Clinical Cohort Analysis]
GL_cohort --> GL_cohort_nb[graft_loss_clinical_cohort_analysis.ipynb]
GL_cohort --> GL_cohort_outputs[cohort outputs: survival + classification]
GL_cohort --> GL_calc[calculator: Interactive Risk Calculator]
GL_calc --> GL_calc_workflow[calculator_workflow.ipynb<br/>Dual Model Training]
GL_calc --> GL_calc_train[train_python_models.py<br/>Parallel MC-CV Training]
GL_calc --> GL_calc_shap[run_shap_ffa_workflow.py<br/>SHAP + FFA on Test Set]
GL_calc --> GL_calc_models[models/<br/>Baseline + Extended]
GL_calc_models --> GL_calc_base[Combined_base/<br/>Base Features Only]
GL_calc_models --> GL_calc_enhanced[Combined_enhanced/<br/>Base + Recommended]
GL_calc --> GL_calc_dash[risk_dashboard: Web Dashboard]
GL_calc_dash --> GL_calc_html[phts_dashboard.html<br/>Dual Model Tabs]
GL_calc_dash --> GL_calc_lambda[Lambda Function + Docker]
GL_calc_dash --> GL_calc_deploy[AWS: S3 + Lambda + API Gateway]
File Organization:
- Scripts: All executable scripts are in
scripts/organized by language (R/,py/,bash/) - Notebooks: Remain in their respective analysis directories:
graft-loss/feature_importance/- Global feature importance analysis (MC-CV)graft-loss/cohort_analysis/- Clinical cohort analysis with dynamic survival/classification modes (MC-CV)
- Documentation: Centralized in
docs/folder, with root READMEs in each workflow directory - EC2 Compatibility: Structure matches EC2 file layout for seamless deployment
graph TB
A[Data Preparation] --> B[Analysis Pipelines]
B --> C1[1. Global Feature Importance]
B --> C2[2. Clinical Cohort Analysis]
B --> C3[3. Interactive Risk Calculator]
C1 --> C1a[MC-CV: RSF/CatBoost/AORSF]
C1 --> C1b[3 Time Periods]
C1 --> C1c[Global Feature Rankings]
C2 --> C2a[MC-CV: Survival/Classification]
C2 --> C2b[CHD vs MyoCardio]
C2 --> C2c[Modifiable Clinical Features]
C2 --> C2d[Dynamic Mode Selection]
C3 --> C3a[Web Dashboard: S3 Hosted]
C3 --> C3b[Lambda API: Docker Container]
C3 --> C3c[Risk Prediction: Real-time]
C3 --> C3d[Causal Analysis: Interactive]
C3 --> C3e[SHAP + FFA: Feature Attribution]
| Pipeline | Location | Type | Methods | Key Features |
|---|---|---|---|---|
| 1. Global Feature Importance | graft-loss/feature_importance/ |
MC-CV Notebook | RSF, CatBoost, AORSF | 3 time periods, 100-1000 splits, global feature rankings |
| 2. Clinical Cohort Analysis | graft-loss/cohort_analysis/ |
MC-CV Notebook (Dynamic) | Survival: RSF, AORSF, CatBoost-Cox, XGBoost-Cox Classification: CatBoost, CatBoost RF, Traditional RF, XGBoost, XGBoost RF |
CHD vs MyoCardio, modifiable clinical features |
| 3. Interactive Risk Calculator | graft-loss/cohort_analysis/calculator/ |
Web Dashboard + Lambda API | CatBoost-Cox, XGBoost-Cox, XGBoost-Cox RF | Dual models (Baseline + Extended), parallel training, real-time risk prediction, test set causal analysis, SHAP/FFA attribution, AWS deployment |
Comprehensive Monte Carlo cross-validation feature-importance workflow replicating the original Wisotzkey study and extending it:
-
Notebook:
graft_loss_feature_importance_20_MC_CV.ipynb- Runs RSF, CatBoost, and AORSF with stratified 75/25 train/test MC-CV splits.
- Supports 100-split development runs and 1000-split publication-grade runs.
- Analyzes three time periods: Original (2010-2019), Full (2010-2024), Full No COVID (2010-2024 excluding 2020-2023).
- Extracts top 20 features per method per period.
- Calculates C-index with 95% CI across MC-CV splits.
-
Scripts (in
scripts/R/):create_visualizations.R: Creates feature importance heatmaps, C-index heatmaps, and bar chartsreplicate_20_features_MC_CV.R: Monte Carlo cross-validation scriptcheck_variables.R: Variable validationcheck_cpbypass_iqr.R: CPBYPASS statistics
-
Outputs (
graft-loss/feature_importance/outputs/):plots/- Feature importance visualizationscindex_table.csv- C-index table with confidence intervalstop_20_features_*.csv- Top 20 features per method and period
-
Documentation:
- See
graft-loss/feature_importance/README.mdfor quick start - Detailed docs in
docs/feature_importance/
- See
Dynamic Analysis Pipeline supporting both survival analysis and event classification with MC-CV:
Key Innovation: Cohort-specific analysis using modifiable clinical features for two distinct etiologic cohorts (CHD vs MyoCardio), enabling targeted clinical interventions.
-
Notebook:
graft_loss_clinical_cohort_analysis.ipynb- Mode Selection: Set
ANALYSIS_MODE <- "survival"or"classification"at top of notebook - Defines two etiologic cohorts:
- CHD:
primary_etiology == "Congenital HD" - MyoCardio:
primary_etiology %in% c("Cardiomyopathy", "Myocarditis")
- CHD:
- Restricts predictors to a curated set of modifiable clinical features (renal, liver, nutrition, respiratory, support devices, immunology).
Survival Analysis Mode (
ANALYSIS_MODE = "survival"):- Runs within-cohort MC‑CV (75/25 train/test splits, stratified by outcome) with:
- RSF (ranger)
- AORSF
- CatBoost‑Cox
- XGBoost‑Cox (boosting)
- XGBoost‑Cox RF mode (many trees via
num_parallel_tree)
- Selects the best‑C‑index model per cohort and reports its top clinical features
- Evaluation: C-index with 95% CI across MC-CV splits
Event Classification Mode (
ANALYSIS_MODE = "classification"):-
Runs within-cohort MC‑CV (75/25 train/test splits, stratified by outcome) with:
- CatBoost (classification)
- CatBoost RF (classification)
- Traditional RF (classification)
- XGBoost (classification)
- XGBoost RF (classification)
-
Target: Binary classification at 1 year (event by 1 year vs no event with follow-up >= 1 year)
-
Evaluation: AUC, Brier Score, Accuracy, Precision, Recall, F1 with 95% CI across MC-CV splits
-
Sources visualization scripts from
scripts/R/create_visualizations_cohort.R
- Mode Selection: Set
-
Scripts (in
scripts/R/):create_visualizations_cohort.R: Creates cohort-specific visualizations including Sankey diagramsclassification_helpers.R: Helper functions for classification analysis
-
Outputs (
graft-loss/cohort_analysis/outputs/):- Survival Mode:
cohort_model_cindex_mc_cv_modifiable_clinical.csv– C‑index summary per cohort × modelbest_clinical_features_by_cohort_mc_cv.csv– Top modifiable clinical features for the best model in each cohortplots/- Visualizations (heatmaps, bar charts, Sankey diagrams)
- Classification Mode:
classification_mc_cv/cohort_classification_metrics_mc_cv.csv– Classification metrics (AUC, Brier, Accuracy, Precision, Recall, F1) per cohort × model
- Survival Mode:
-
Documentation:
- See
graft-loss/cohort_analysis/README.mdfor quick start - Detailed docs in
docs/cohort_analysis/
- See
Production-ready web-based risk calculator with interactive dashboard and causal analysis capabilities:
-
Dual Model Architecture:
- Baseline Model (
Combined_base): Uses base calculator features only (~104 features) - Extended Model (
Combined_enhanced): Uses base features + recommended additional features (~120 features) - Both models trained for all cohorts (CHD, Cardiomyopathy, Myocarditis) using Combined cohort
- Users can compare predictions from both models side-by-side in dashboard
- Baseline Model (
-
Web Dashboard (
risk_dashboard/phts_dashboard.html):- Baseline Model Tab: Real-time risk prediction using base calculator features
- Extended Model Tab: Real-time risk prediction using base + recommended features
- Causal Analysis Tab: Interactive exploration of causal factors with dynamic visualizations
- Documentation Tab: Comprehensive documentation and model details
- Hosted on AWS S3:
s3://jerome-dixon.io/uva/phts-risk-calculator/
-
Lambda API (
risk_dashboard/phts_lambda_function.py):- Serverless backend deployed as Docker container on AWS Lambda
- REST API endpoints:
/risk,/causal,/metadata,/model_features - Model caching for fast inference
- Risk score normalization (percentile-based, 0-100 scale)
- Supports both baseline and extended models via model variant selection
-
Model Training (
train_python_models.py):- Parallel Processing: Both baseline and enhanced models use parallel processing for MC-CV (uses all CPUs minus 1)
- Trains CatBoost-Cox, XGBoost-Cox, and XGBoost-Cox RF models
- 25 Monte Carlo Cross-Validation splits with parallel execution
- Best model selection per variant based on C-index (primary), then AU-PRC (tiebreaker)
- Temporal 80/20 split for final model training (train on earlier years, test on later years)
- Excludes non-modifiable features (e.g.,
lscntry,prim_dx) - Idempotent Training: Skips retraining if all outputs already exist
-
SHAP + FFA Analysis (
run_shap_ffa_workflow.py):- Test Set Application: All causal analysis performed on test set (unseen data)
- Rules extracted from trained model (trained on training set)
- Rules applied to test set instances to count actual rule firings
- SHAP values computed on test set only (
txpl_year > cutoff_year) - Rule frequencies counted from test set rule firings (not from rule definitions)
- Temporal split cutoff matches training (dynamic 80/20 split, falls back to 2021)
- SHAP (SHapley Additive exPlanations) for feature importance
- FFA (Formal Feature Attribution) for causal analysis
- Extracts top K causal factors with importance and responsibility scores
- Generates dashboard data with feature metadata
- Causal Responsibility Formula:
(rule_frequency_from_test_set / total_rule_firings) × SHAP_importance
- Test Set Application: All causal analysis performed on test set (unseen data)
-
Deployment:
- Frontend: S3 static website hosting
- Backend: Lambda function with Docker container (ECR)
- API Gateway: REST API with CORS enabled
- Models: Baked into Docker image for fast loading
-
Key Features:
- Real-time Risk Prediction: Instant risk scores with percentile normalization
- Dual Model Comparison: Side-by-side comparison of baseline vs extended model predictions
- Causal Analysis: Interactive factor adjustment with real-time risk updates
- Test Set Validation: Causal factors validated on unseen test data for realistic assessment
- Feature Metadata: Automatic detection of binary vs numeric features
- Risk Bands: Low/Medium/High/Very High risk classification
- Multiple Cohorts: CHD, Combined, Myocardio with cohort-specific models
-
Outputs (
graft-loss/cohort_analysis/calculator/outputs/):models/- Trained models (CatBoost, XGBoost, XGBoost RF)shap_ffa/- SHAP values, FFA rules, causal factors, dashboard datarisk_distributions/- Risk score distributions for normalization
-
Documentation:
- See
graft-loss/cohort_analysis/calculator/README.mdfor overview risk_dashboard/README_MODELS.md- Model performance and risk calculationrisk_dashboard/README_CAUSAL_ANALYSIS.md- Causal analysis workflowrisk_dashboard/README_DEPLOYMENT.md- AWS deployment guiderisk_dashboard/README_ARCHITECTURE.md- System architectureREADME_SHAP_FFA.md- SHAP and FFA integration details
- See
Robust C-index calculation with manual implementation:
- Time-Dependent C-index: Matches
riskRegression::Score()behavior for direct comparison with original study - Time-Independent C-index: Standard Harrell's C-index for general discrimination assessment
- Documentation: Comprehensive README explaining methodology, issues, and validation
- Test Files: Extensive testing of
riskRegression::Score()format requirements
scripts/R/: Helper functions and utilitiesscripts/py/: Python scripts for specialized analyses (e.g., FFA)scripts/bash/: Bash scripts for automation
- Data Source:
graft-loss/data/phts_txpl_ml.sas7bdat(matches original study) - Censoring Implementation: The original study's censoring handling:
- Sets event times of 0 to 1/365 (prevents invalid zero times for survival analysis)
- Properly maintains censored observations (status = 0) throughout the analysis
- Ensures consistent survival structure matching the original Wisotzkey study
Data Coverage: 2010-2024 (TXPL_YEAR)
Filtering Options:
EXCLUDE_COVID=1: Excludes 2020-2023 (approximate COVID period)ORIGINAL_STUDY=1: Restricts to 2010-2019 (original study period)
Variable Processing (applied before modeling):
- CPBYPASS: Removed (not available in all time periods, high missingness)
- DONISCH: Dichotomized (>4 hours = 1, ≤4 hours = 0) for consistency
- RSF Permutation Importance: Matches original Wisotzkey study methodology
- CatBoost Feature Importance: Captures non-linear relationships
- AORSF Feature Importance: Matches original study's final model approach
- Top 20 Selection: Selects top 20 features per method per period
Survival Models:
- RSF: Random Survival Forest with permutation importance
- AORSF: Accelerated Oblique Random Survival Forest (matches original study)
- CatBoost-Cox: Gradient boosting with Cox loss
- XGBoost-Cox: Gradient boosting with Cox loss (boosting and RF modes)
Classification Models:
- CatBoost: Gradient boosting classification
- CatBoost RF: CatBoost configured as Random Forest
- Traditional RF: Classic Random Forest classification
- XGBoost: Gradient boosting classification
- XGBoost RF: XGBoost configured as Random Forest
Evaluation Metrics:
- Time-Dependent C-index: At 1-year horizon (matches original study)
- Time-Independent C-index: Harrell's C-index (general discrimination)
- Classification Metrics: AUC, Brier Score, Accuracy, Precision, Recall, F1
- Calibration: Gronnesby-Borgan test (survival)
- Feature Importance: Multiple methods (permutation, negate, gain-based)
Our feature selection workflow matches the original repository (bcjaeger/graft-loss):
- Feature Selection from ALL Variables: Uses all available variables (not pre-filtered to Wisotzkey variables)
- Recipe Preprocessing: Applies
make_recipe()→prep()→juice()with median/mode imputation - Top 20 Selection: Selects top 20 features using permutation importance (RSF) or feature importance (CatBoost, AORSF)
- Wisotzkey Identification: After selecting top 20, identifies which of those features are Wisotzkey variables (15 core variables from original study)
This workflow ensures:
- Unbiased feature selection: Not constrained to pre-defined variable set
- Reproducibility: Matches original study methodology exactly
- Transparency: Clear identification of Wisotzkey overlap in selected features
Key Implementation Details:
- Data Source: Uses
phts_txpl_ml.sas7bdat(matches original study) with proper censoring implementation - Censoring Handling: Event times of 0 are set to 1/365 to prevent invalid survival times
- Excludes outcome/leakage variables (
int_dead,int_death,graft_loss,txgloss,death,event) - Uses
dummy_code = FALSEfor recipe preprocessing (preserves categorical structure) - Applies same RSF parameters as original:
num.trees = 500,importance = 'permutation',splitrule = 'extratrees'
- Method: Random Survival Forest with permutation importance
- Parameters:
num.trees = 500,importance = 'permutation',splitrule = 'extratrees',num.random.splits = 10,min.node.size = 20 - Use: Matches original Wisotzkey study methodology and repository implementation
- Output: Top 20 features ranked by permutation importance
- Method: CatBoost gradient boosting with signed-time labels
- Parameters:
iterations = 2000,depth = 6,learning_rate = 0.05 - Use: Captures non-linear relationships and interactions
- Output: Top 20 features ranked by gain-based importance
- Method: Accelerated Oblique Random Survival Forest (negate method)
- Parameters:
n_tree = 100,na_action = 'impute_meanmode' - Use: Matches original study's final model approach
- Output: Top 20 features ranked by negate importance
The pipeline calculates both time-dependent and time-independent C-indexes for comprehensive evaluation:
- Method: Matches
riskRegression::Score()behavior - Evaluation: At specific time horizon (default: 1 year)
- Logic: Compares patients with events before horizon vs patients at risk at horizon
- Use: Direct comparison with original study (~0.74)
- Method: Standard Harrell's C-index formula
- Evaluation: Uses all comparable pairs regardless of time
- Logic: Pairwise comparisons where one patient has event and another has later time
- Use: General measure of discrimination across entire follow-up
- Primary: Attempts
riskRegression::Score()for time-dependent (matching original study) - Fallback: Manual calculation if
Score()fails - Always Calculates: Time-independent C-index using manual Harrell's C
- Consistency: All three methods (RSF, CatBoost, AORSF) use same approach
See concordance_index/concordance_index_README.md for detailed documentation.
The pipeline supports analysis across multiple time periods:
- Set:
ORIGINAL_STUDY=1 - Matches: Original Wisotzkey et al. (2023) publication
- Use: Direct replication and comparison
- Default: All available data
- Use: Maximum sample size and contemporary analysis
- Set:
EXCLUDE_COVID=1 - Use: Sensitivity analysis excluding COVID-affected years
- Navigate to
graft-loss/feature_importance/ - Open
graft_loss_feature_importance_20_MC_CV.ipynb - Set
DEBUG_MODE <- FALSEfor full analysis - Set
n_mc_splits <- 100for development or1000for publication - Run notebook from top to bottom
- Results saved to
outputs/directory
- Navigate to
graft-loss/cohort_analysis/ - Open
graft_loss_clinical_cohort_analysis.ipynb - Set
ANALYSIS_MODE <- "survival"or"classification" - Set
DEBUG_MODE <- FALSEfor full analysis - Run notebook from top to bottom
- Results saved to
outputs/directory
-
Train Models: Navigate to
graft-loss/cohort_analysis/calculator/and use the Jupyter workflow:- Open
calculator_workflow.ipynb - Baseline Model: Train with base calculator features only (uses parallel processing)
- Extended Model: Train with base + recommended features (uses parallel processing)
- Both models use 25 MC-CV splits with parallel execution for faster training
- Training is idempotent - skips if outputs already exist
Or use command line:
# Baseline model (base features only) python train_python_models.py --cohort Combined_base --n-jobs <num_cpus> # Extended model (base + recommended features) python train_python_models.py --cohort Combined_enhanced --n-jobs <num_cpus> --include-recommended-features
- Open
-
Run SHAP/FFA Analysis: Generate causal factors and dashboard data (applies rules to test set):
# Baseline model python run_shap_ffa_workflow.py --cohort Combined_base --top-k 20 # Extended model python run_shap_ffa_workflow.py --cohort Combined_enhanced --top-k 20
Note: SHAP/FFA analysis automatically:
- Computes SHAP values on test set only (unseen data)
- Applies rules extracted from trained model to test set instances
- Counts rule firings from test set (not from rule definitions)
- Ensures temporal split cutoff matches training
-
Deploy Dashboard:
- Prepare Lambda directory:
python risk_dashboard/prepare_lambda_dir_phts.py - Build Docker image:
bash risk_dashboard/docker_build_phts.sh - Upload HTML to S3:
aws s3 cp risk_dashboard/phts_dashboard.html s3://jerome-dixon.io/uva/phts-risk-calculator/index.html - Update Lambda:
aws lambda update-function-code --function-name phts-risk-calculator --image-uri <ECR_URI>
- Prepare Lambda directory:
-
Access Dashboard: Visit
https://jerome-dixon.io/uva/phts-risk-calculator/
See graft-loss/cohort_analysis/calculator/risk_dashboard/README_DEPLOYMENT.md for detailed deployment instructions.
plots/- Feature importance visualizationscindex_table.csv- C-index table with confidence intervalstop_20_features_*.csv- Top 20 features per method and period
- Survival Mode:
cohort_model_cindex_mc_cv_modifiable_clinical.csv- C-index summary per cohort × modelbest_clinical_features_by_cohort_mc_cv.csv- Top modifiable clinical features per cohortplots/cohort_clinical_feature_sankey.html- Sankey diagram of cohort → clinical features
- Classification Mode:
classification_mc_cv/cohort_classification_metrics_mc_cv.csv- Classification metrics (AUC, Brier, Accuracy, Precision, Recall, F1) per cohort × model
models/- Trained models (CatBoost.cbm, XGBoost.ubj, JSON models)shap_ffa/- SHAP values, FFA rules, causal factors, dashboard data (dashboard_data.json,top_causal_factors.csv)risk_distributions/- Risk score distributions for normalization (risk_distributions.json)lambda_dir_phts/- Prepared Lambda deployment directory (models, dashboard data, feature metadata)
- Centralized Documentation: All detailed documentation in
docs/folder - Workflow-Specific: Root READMEs in each workflow directory
- Shared Documentation: Common topics in
docs/shared/ - Standards: Scripts standards in
docs/scripts/
- Dual Implementation: Both time-dependent and time-independent
- Reliable Fallback: Manual calculation when
riskRegression::Score()fails - Comprehensive Documentation: See
concordance_index/concordance_index_README.md
- RSF: Permutation importance (original study method)
- CatBoost: Gain-based importance
- AORSF: Negate importance (original study's final model)
- Multiple Algorithms: RSF, AORSF, CatBoost, XGBoost, Cox PH (survival); CatBoost, CatBoost RF, Traditional RF, XGBoost, XGBoost RF (classification)
- Multiple Time Periods: Original study, full period, COVID-excluded
- Multiple Metrics: Time-dependent and time-independent C-indexes; AUC, Brier, Accuracy, Precision, Recall, F1
- Monte Carlo Cross-Validation: Robust evaluation with many train/test splits
- Stratified Sampling: Maintains event distribution across splits
- Parallel Processing:
- R workflows: Fast execution with furrr/future
- Python calculator: Parallel MC-CV training (uses all CPUs minus 1)
- Both baseline and enhanced models use parallel processing
- 95% Confidence Intervals: Narrow, precise estimates
- Test Set Validation: Causal analysis validated on unseen test data
- Survival Analysis: Time-to-event analysis with survival models
- Event Classification: Binary classification at 1 year
- Easy Mode Switching: Single configuration flag
- Main Documentation Index:
docs/README.md - Workflow READMEs:
graft-loss/feature_importance/README.mdgraft-loss/cohort_analysis/README.md
- Scripts Documentation:
scripts/README.md - Shared Documentation:
docs/shared/(validation, leakage, variable mapping) - Standards:
docs/scripts/README_standards.md(logging, outputs, script organization)
- Wisotzkey et al. (2023). Risk factors for 1-year allograft loss in pediatric heart transplant. Pediatric Transplantation.
- Original Repository: bcjaeger/graft-loss
For questions or issues, please refer to the documentation in each component directory or review the inline code comments.
Note: The pipeline is modular; each notebook can be run independently. For detailed usage, refer to the README files in each workflow directory and the detailed documentation in docs/.