Running the RCDML Code

After the python environment is set up now it is time to run the RCDML pipeline. The pipeline has multiple run configurations that can be selected by providing the parameters that best fit your experiment needs in the configuration file.

Configuration File

Here are the configuration file options and a description of what they do:

run_name: Name of the directory where the results will be stored
result_path: Path to where the results will be stored
dataset_path: Path where the drug response and RNA-seq count data is stored
drug_name: Name of drug for which the model will classify drug response for
project: If given BeatAML as the option, proceeds to use the drug response and RNA-seq count data. Otherwise, use dataset and labels available in dataset_path
normalization: For BeatAML project, one of cpm or rpkm needs to be provided to choose the corresponding normalization of the RNA-seq data.
feature_selection: Feature selection techniques. Options available: [shap, pca, dge, random, swap, none]
- shap: SHAP Value Feature Selection Technique
- pca: Principal Component Analysis
- dge: Differential Gene Expression Analysis (Needs to provide path to the dge files using the dge_path option)
- random: Randomly selects features
- none: No feature selection
- swap: Swapping feature selection experiment. (Needs feature list paths and swapped drug name)
Multiple options can be passed at the same time by separating the feature selection techniques using a comma (shap,pca,random,etc).
feature_size: Number of features/genes that will be selected by the feature selection technique
classifiers: Classifiers. Options available: [rf, gdb]
- rf: Random Forest
- gdb: Gradient Boosting (XGBOOST)
Multiple options can be passed at the same time by separating the clasifiers using a comma (rf,gdb,etc)
validation: Validation modes. Options available: [cv, loo, bootstrap]
- cv: Cross-Validation. Breaks the dataset into folds based on the number of validation_iterations
- loo: Leave-One-Out. Iterates through the dataset, always leaving one sample for testing.
- bootstrap: Bootstrapping. Breaks the dataset into a training and test set based on the train_test_split value
validation_iterations: Number of iterations used by the validation mode. Does not matter for validation mode loo
train_test_split: If the validation option is bootstrap, a number from 0-1 here determines the size of the train/test dataset split
debug: Debug verbosity. Options available: [0, 1, 2, 3, 4]
- 0: No debug mode.
- 1: Saves the input and output of the dataset splitting process.
- 2: Saves the input and output of the feature selection technique process.
- 3: Saves the input and output of the classifiers process.
- 4: Saves the input and output of the feature counter option.
Each debug mode option also performs the actions of the previous options.
feature_counter: If set to 1, creates a feature counter that shows how many times a feature/gene was selected during the pipeline run
dge_path: If feature selection option is dge, this is the path to where the differential gene expression files are stored
drug_feature_path: If feature selection option is swap, this is the path to where the feature list for the drug model is stored
swapped_label: If feature selection option is swap, this is the name of the drug used to swap the feature list
swapped_path: If feature selection option is swap, this is the path to where the feature list for the swapped drug is stored

Preset configuration files are found in the setup/config/ directory.

Running the Pipeline

After the configuration file has the desired parameters parameters, run the pipeline using the command: python main.py -f <configuration file name>

If no file is provided the pipeline will use the parameters.cfg by the default.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Running the RCDML Code

Running the RCDML Code

Configuration File

Running the Pipeline

Uh oh!

Uh oh!

Clone this wiki locally