End-to-end workflow for feature discovery, noise reduction, and causal-oriented modeling using drug exposures, ICD/CPT codes, and classification outcomes.
This project is organized into four main sections:
- Overview - Project structure, components, and high-level workflow
- Data Pipeline - Data processing, cohort creation, and data flow
- Analysis Workflow - Feature importance, pattern mining, and final model development
- Data Visualizations - Visualization approaches, interpretation, and network analysis
# Install dependencies
pip install -r requirements.txt
# Configure AWS credentials for S3 access
aws configure0_config_and_pipeline.ipynb lets you clear EC2 NVMe and project pipeline output directories for a fresh run, and contains step-by-step instructions for running the pipeline (prerequisites, notebook order, cohorts). Use it to reset local/NVMe data and project outputs; S3 checkpoints are not cleared by default (see the notebook for an optional full reset).
The workflow notebooks are the primary way to run the pipeline. They sync required inputs from S3 to local and use S3 checkpoints so steps are skipped when already completed. Run in order: 1_cohort_workflow.ipynb β 2_feature_importance.ipynb β 3_model_train_shap_ffa.ipynb β 4_dashboard_visuals.ipynb β 5_build_and_deploy.ipynb.
Legacy shell scripts and the former combined notebooks (3, 4) are in archived/; use the five notebooks above.
Both cohorts use the full set of age bands: 0-12, 13-24, 25-44, 45-54, 55-64, 65-74, 75-84, 85-114 (last band 85-114 combines former 85-94 and 95-114).
opioid_ed: Opioid ED cohort (F11.20 target) β all age bands abovenon_opioid_ed: Polypharmacy cohort (HCG ED target) β all age bands above
- Step 1a: APCD input data (1a_apcd_input_data) β bronze β silver β gold.
- Step 1b: Event filter (1b_apcd_event_filter) β Aggregated FI + ICD/administrative code filtering.
- Step 2: Cohort creation (2_create_cohort) β 5:1 target:control cohorts.
- Step 3a: Feature importance (3a_feature_importance) β MC-CV aggregated importances (CatBoost, XGBoost, XGBoost RF).
- Step 3b: Feature Importance EDA (3b_feature_importance_eda) β BupaR post-target analysis, code research; outputs refined
cohort_feature_importance.csv. - Step 3c: Final update to features (2_feature_importance.ipynb) β Strip remaining BupaR-identified leakage from
cohort_feature_importance.csv; these CSVs are the only input to Step 4. - Step 4: Model data (4_model_data) β
model_events.parquetfrom refined features; removes target leakage (events on/after target date) for case events. - Step 5: PGx feature engineering (5_pgx_analysis).
- Step 6: Final model (6_final_model) β training and selection (Recall / AUC-PR).
- Step 7: SHAP analysis (7_shap_analysis).
- Step 8: FFA analysis (8_ffa_analysis) β XGBoost only, SHAP-prioritized rules.
- Step 9: Risk dashboard (10_risk_dashboard) β deployment (Lambda, dashboard).
The scripts are idempotent and will skip completed steps automatically.
The pipeline is split into five workflow notebooks. Run in order:
| Notebook | Purpose |
|---|---|
| 1_cohort_workflow.ipynb | Steps 1β2: Cohorts (APCD input, event filter, cohort creation). |
| 2_feature_importance.ipynb | Steps 3aβ3c: Feature importance (3a MC-CV), EDA (3b BupaR), final feature update (3c). |
| 3_model_train_shap_ffa.ipynb | Model data β PGx β final model β SHAP/FFA β combine. No deploy. |
| 4_dashboard_visuals.ipynb | Dashboard visuals: BupaR, DTW, FP-Growth (SHAP/FFA-driven). |
| 5_build_and_deploy.ipynb | Build and deploy: Lambda dir β Docker β ECR β Lambda β S3 frontend. Run once. |
ICD filtering moved earlier: Administrative/ICD code filtering runs in 1b_apcd_event_filter (before cohort creation). That reduces downstream data volume and ensures feature importance (Step 3a/3b) is computed on the same filtered event set, capturing true predictive features. After moving ICD filtering earlier, feature importances must be rerun once cohorts are rebuilt.
pgx-analysis/
βββ 1a_apcd_input_data/ # Step 1a: APCD data preprocessing (bronze β silver β gold)
βββ 1b_apcd_event_filter/ # Step 1b: Event filtering (ICD/administrative codes; runs before cohorts)
βββ 2_create_cohort/ # Step 2: Cohort creation and QA (5:1 target:control)
βββ 3a_feature_importance/ # Step 3a: MC-CV feature importance (aggregated importances)
βββ 3b_feature_importance_eda/ # Step 3b: Feature refinement (BupaR post-target, code research)
βββ 4_model_data/ # Step 4: Model-ready event datasets (cases + controls)
βββ 5_pgx_analysis/ # Step 5: PGx feature engineering
βββ 6_final_model/ # Step 6: Final model training and selection
βββ 7_shap_analysis/ # Step 7: SHAP post-model analysis (CatBoost + XGBoost)
βββ 8_ffa_analysis/ # Step 8: Formal Feature Attribution (uses SHAP to prioritize rules)
βββ 10_risk_dashboard/ # Step 9: Risk dashboard deployment (Lambda, dashboard UI)
βββ 0_config_and_pipeline.ipynb # Config: clear NVMe/project dirs, pipeline run instructions
βββ 1_cohort_workflow.ipynb # Workflow notebook: Steps 1β2 (cohorts)
βββ 2_feature_importance.ipynb # Workflow notebook: Steps 3aβ3c (feature importance + final feature update)
βββ 3_model_train_shap_ffa.ipynb # Workflow: model data, PGx, final model, SHAP/FFA
βββ 4_dashboard_visuals.ipynb # Workflow: BupaR, DTW, FP-Growth visuals
βββ 5_build_and_deploy.ipynb # Workflow: build and deploy (Lambda, S3)
βββ archived/ # Legacy notebooks (3, 4) and scripts (see archived/README.md if present)
β βββ 3_pgx_calculator_workflow.ipynb
β βββ 4_pgx_dashboard_visuals.ipynb
β βββ utility_scripts/ # Old workflow shell scripts
β βββ qa/ # Check/validate/clear/diagnose scripts
β βββ testing/ # Test scripts
βββ py_helpers/ # Shared Python helper utilities
βββ r_helpers/ # Shared R helper utilities
βββ docs/ # Documentation
Execution model: Each workflow notebook syncs required inputs from S3 to NVMe (or local) via aws s3 sync (idempotent) and uses S3 checkpoints so steps are skipped when already completed. Run order: 1 β 2 β 3 β 4 β 5.
flowchart TD
subgraph W1["1_cohort_workflow.ipynb (Steps 1-2)"]
A1[1a: APCD Input Data] --> A2[Data Cleaning]
A2 --> A1b[1b: Event Filter ICD/Admin]
A1b --> A3[2: Cohort Creation]
A3 --> A4[Quality Assurance]
end
subgraph W2["2_feature_importance.ipynb (Steps 3a-3c)"]
A4 --> B1[3a: Monte Carlo CV]
B1 --> B2[Aggregated Feature Importance]
B2 --> B3[Top Features Selection]
B3 --> B4[3b: BupaR Post-Target + Code Research]
B4 --> B5[3c: Final update to features]
B5 --> B6[Refined cohort_feature_importance.csv]
end
subgraph W3["3_model_train_shap_ffa.ipynb"]
B6 --> C1[4: Model Data]
C1 --> D1[5: PGx]
D1 --> E1[6: Final Model]
E1 --> E4[7: SHAP]
E4 --> F2[8: FFA]
F2 --> F1[Combine SHAP/FFA]
end
subgraph W4["4_dashboard_visuals.ipynb"]
F1 --> G0[BupaR, DTW, FP-Growth]
end
subgraph W5["5_build_and_deploy.ipynb"]
G0 --> G1[9: Risk Dashboard]
G1 --> G5[Deploy: S3 + Lambda + API Gateway]
end
style A1 fill:#f9f,stroke:#333
style A1b fill:#e9c,stroke:#333
style B2 fill:#bbf,stroke:#333
style C1 fill:#bfb,stroke:#333
style E4 fill:#fbb,stroke:#333
style G1 fill:#ffb,stroke:#333
- Feature Screening with a focused model ensemble (CatBoost, XGBoost boosted trees, XGBoost RF mode) + Monte Carlo cross-validation
- Feature Refinement (Feature Importance EDA) using BupaR post-target analysis; Step 4 removes target leakage when building model data
- Event filtering (Step 1b) β Aggregated FI + ICD/administrative code filtering (1b_apcd_event_filter)
- Structure Discovery via FP-Growth, process mining (BupaR), and dynamic time warping (DTW) for dashboard visualizations only (Step 9 - not used as model features)
- Final Model Development combining refined feature importances (from Feature Importance EDA) with PGx features for prediction and causal inference
- Model Selection based on Recall (primary) and AUC-PR (secondary) metrics, selecting best model from CatBoost, XGBoost, or XGBoost RF
- Console output (crossβplatform): Avoid nonβASCII characters (for example, unicode arrows like
β) in Python/R scripts that may run on Windows consoles. Use plain ASCII (e.g.->) inprint()/logging messages to prevent encoding errors undercp1252and similar code pages.
Because full Monte Carlo CV + permutation importance is computationally intensive, the project focuses the heaviest, publication-grade feature-importance analysis on two clinically motivated cohort groups:
-
Cohort Group 1 β Opioid ED (
opioid_ed)- Age bands: <65 (e.g., 0β12, 13β24, 25β44, 45β54, 55β64).
- Feature space: drugs + ICD codes + CPT codes + event type.
- Use case: detailed feature discovery for opioid-related ED visits and opioid use disorder.
-
Cohort Group 2 β Polypharmacy ED (
non_opioid_ed)- Age bands: β₯65 (e.g., 65β74, 75β84, 85β94, 95β114).
- Feature space for MCβCV feature importance: drugs only (polypharmacy focus), with downstream pattern mining and trajectory methods layering on additional structure.
Other cohort/age-band combinations can be explored with lighter configurations, but publication-grade, health outcomesβoriented modeling is anchored on these two groups.
3a_feature_importance/README.mdβ Feature importance methodology and cohort configuration4_model_data/README_model_data.mdβ Model-ready events and target vs control extraction- Event filtering:
1b_apcd_event_filter/filter_protocol_events.pyβ Aggregated FI + ICD/administrative codes 6_final_model/README.mdβ Final model training and selection5_pgx_analysis/README.mdβ Pharmacogenomics (PGx) feature engineeringstatus/WORKFLOW_STATUS.mdβ Per-cohort workflow execution status and checkpointsstatus/WORKFLOW_COMPLETE_SUMMARY.mdβ High-level summary of workflow completion across cohorts and age bands