DAB Website: https://ctn-0094.github.io/Pipeline/
- Purpose: The purpose of this project is to establish a modular, scalable data pipeline for statistical modeling and machine learning on the CTN-0094 database. The pipeline will support various modeling strategies and evaluation metrics to deliver insights into the relationship between demographics and different target outcomes.
- Team: this work is led by Prof. Laura Brandt (clinical arm) and Prof. Gabriel Odom (computational arm); Mr. Ganesh Jainarain is the primary data scientist and statistical programmer.
- Funding: this work is towards the successful completion of "Towards a framework for algorithmic bias and fairness in predicting treatment outcome for opioid use disorder" (NIH AIM-AHEAD 1OT2OD032581-02-267) with contact PI Laura Brandt, City College of New York.
Enter the command python3 run_pipelineV2.py --help for the list of arguments. Predictions, logs, and evaluations folders will be created in the directory specified by -d. When running multiple tests at once, specify a a minimum and maximum seed to loop through using the -l flag. The default outcomes when using the -l flag are all outcomes, if you would like to use only a subset of the outcomes, you can list them after the -o flag.
- Example:
python3 run_pipelineV2.py -d "C:\Users\John\Desktop\Results" -l 5 10loops through all integer seeds between 5 and 10 and saves the results folders in the specified path on the desktop.
- Task: Keep the "master" dataset that was created by joining tables from the CTN-0094 database.
- Note: This dataset remains unaltered throughout the pipeline.
- Task: Build a dataset of 1000 samples with a specified demographic distribution.
- Methods:
- Random sampling
- Partial matching
- Sophisticated matching (to be implemented)
- Task: Perform standard feature engineering and data preparation.
- Note: This pre-processing script will likely remain consistent.
- Task: Join a chosen target variable (dependent variable) from the list of 11 in the Latin Square setup with the processed independent variables.
- Selection: Choose from 11 target variables identified by Laura and the team.
- Task: Select a machine learning model suited for the target variable and use it to predict target values.
-
Task: Evaluate the machine learning model using various metrics:
- AUC
- F1 score
- RMSE
- Fairness (currently undefined)
-
Output: Return a tuple containing the demographic makeup, target variable, machine learning model, and metrics.
- Task: Repeat Steps 1-5 over various design points to explore the tuple space.
- Binary
- Count (with a fixed max)
- Proportion
The statistical models will be implemented according to the following priority:
-
Logistic LASSO (Binary Outcomes)
- Port existing code into the pipeline.
- Save the
.jobfile trained on the full cohort for later.
-
Negative Binomial Regression (Count Outcomes)
-
Sigmoidal Regression (Proportion Outcomes)
-
Beta Regression (Proportion Outcomes)
The immediate objective is to build a proof-of-concept pipeline with logistic LASSO, then extend it to incorporate random forests. Other models can be implemented as required in the future.
Luo SX, Feaster DJ, Liu Y et al. Individual‑Level Risk Prediction of Return to Use During Opioid Use Disorder Treatment. JAMA Psychiatry. 2024;81(1):45–56. doi:10.1001/jamapsychiatry.2023.3596
https://jamanetwork.com/journals/jamapsychiatry/fullarticle/2810311
Multicenter decision‑analytic prediction model using CTN trial data.