ARDaC ETL (Extract Transform Load) workflows for DCC data release version 2.0.0. Currently, the workflows transform observational and clinical data stored in CSV files, to ARDaC node files in TSV format.
The python workflow is implemented and tested with python 3.13.2. Furthermore, the Nextflow workflow scripts utilize Anaconda Python virtual environments. Anaconda is chosen over the Python built-in virtual environment manager because Nextflow has a strong preference for using Anaconda virtual environments.
If Anaconda is not already installed on your system, the small-scale Anaconda environment manager called miniconda can be installed easily using the following installation instructions
The Anaconda base installation should be availble in your shell's search path after installation. Open a new terminal window so that you are using the updated environment.
To show your current search path:
echo $PATHThe path should contain the installation location of Anaconda. If not, you should seek help installing Anaconda from from your system administrator.
The next task is to create the virtual environment using Anaconda.
- If
baseenvironment is not activated, doconda activate base. If you are in another virtual environment, then deactivate it first. cd /path/to/ardac-etl/pythonconda env create -p ../venv -f conda.ymlcreates a virtual environment at/path/to/project/ardac-etl/venvwith packages and python installed accoridng toconda.yml
The created Python virtual environment at /path/to/project/ardac-etl/venv should be used for development of the
ardac-etl project and used with the Nextflow workflow. If you are using VSCode as your IDE, you should set the
following configuration fields in the VSCode settings accordingly:
Python: Venv Path = ./venv
Python: Conda Path = /path/to/your/conda/bin/conda
You may also need to set Python: Locator = js if the IDE cycles indefinitely on Reactivating Terminals.
Follow the instructions here to install Nextflow
The mapper workflow is responsible for tranforming the observational and clinical data stored in CSV format to ARDaC node files which conform to the ARDaC common data model (CDM).
The ARDaC mapper workflow is implemented using python scripts to generate the individual node files, and a Nextflow script which is used to manage the execution of the python scripts and manage any dependencies needed by each python process. The workflow can be configured to read the input and write the output from and to a file system approved for sensitive clinical data. The workflow can also be configured to store any intermediate data generated by the workflow on the same approved file system. Logs for each process in the workflow are also created on the approved file system, to reduce chance of spilling sensitive information to an unapproved file system.
The workflow configuration is set by the nextflow.config file. A template is provided by nextflow.config.template. In this configuration file you may specify the location of the input observational and clinical data and the output directory which contains the ARDaC CDM nodes and process log files.
The input directory, given by the parameter params.input_directory which must specify the full path, must contain three subdirectories: one given by the parameter params.obs_input_directory containing the observational data, one give by params.rct_input_directory containing the clinical data, and one given by the parameter params.node_templates_directory containing the ARDaC node template TSV file. The output directory, given by the parameter params.output_directory which must specify the full path, must contain a node subdirectory given by params.ardac_nodes_directory` where the ARDaC node files generated by the workflow will be saved. Detailed comments are provided for each parameter in the configuration file.
The Nextflow workflow to generate the ARDaC nodes from the observational and clinical data sets is performed by the run_observational_workflow.bash and run_clinical_workflow.bash scripts in the nextflow subdirectory. These scripts can be run directly in that same subdirectory. A hidden log file .nextflow.log will be generated describing the run and any problems that may have occurred.
The workflow can be executed in HPC batch processing environments under the control of a job scheduler like SLURM. The workflow can run on a workstation or laptop, a single node in an HPC cluster, or on multiple nodes in an HPC cluster. The jobs may be interactive jobs used for debugging and development, or they may be queued to run the workflow in production. If you have access to an HPC resource, you can use that system for development and production execution of the workflow.
The Git project should be cloned and the appropriate branch checked out for testing before execution. The Nextflow configuration file will need to be updated so that the input and output data locations can specified. The following subsections show the steps for running in different modes on HPC systems governed by the SLURM scheduler.
To execute the workflow as an interactive job:
- Change to the
nextflowdirectory - Create an interactive job and log into the interactive node:
srun -A xxxxxx -p debug -N 1 --time=01:00:00 --pty bashwhere--timeindicates how long the node will be reserved for use, and wherexxxxxxis an account the hour of run time will be charged to - Execute one of the workflow scripts
run_clinical_workflow.bashorrun_observational_workflow.bash
The workflow script should be executed through a SLURM batch submission script that can specify how the job is to be executed. This script is the same as a shell script, but with directives for how SLURM should select resources to run the job. Below is an example batch submission script:
#!/bin/bash
#SBATCH -J <job_name>
#SBATCH -p <queue_name>
#SBATCH -o %x_%j.txt
#SBATCH -e %x_%j.err
#SBATCH --mail-type=ALL
#SBATCH --mail-user=<email_address>
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=3
#SBATCH --time=00:15:00
#SBATCH --mem=8G
#SBATCH -A <project_account>
conda activate ../venv
./run_clinical_workflow.bashThe directives to SLURM are given as bash shell comments. The -J specifies the name assigned to the job. The -p specifies the name of the queue the job will be submitted to. The -o and -e directives are files the standard output and standard error produced by the job will be redirected to. The mail directives specify which statuses will trigger an email notification and the e-mail account the notifications will be sent to. The --nodes directive specifies that a single node will be used for executing the job. The --ntasks-per-node directive specifies how many instances of Nextflow be executed on each node. The --cpus-per-task directive is how many cores each Nextflow instance can utilize. The --time directive sets the maximum amount of time the job can run before it is terminated by the scheduler. The --mem directive specifies the maximum amount of RAM can be used at any time by the workflow, this includes the RAM consumed by the Nextflow Java Virtual Machine and any Python processes. The -A directive specifies the project account the total node-hours will be charged to.
To submit the job, execute the command:
sbatch <submission_script>It is important to note that the total cores allocated per node will be the number of SLURM tasks per node (given by --ntasks-per-node) multiplied by the number of cores made available to each SLURM task (given by --cpus-per-task). For single node execution of the workflow, it makes sense to run only a single Nextflow instance on the node and provide as many cores as the workflow can utilize. Currently, each Python process utilizes only a singe core. Given the current structure of the workflow, only three of the Python processes can run concurrently, so it makes sense to provide no more than three cores for the single Nextflow instance.
The multinode workflow deployment is launched similarly to the single-node deployment; however, There are some important differences in how the workflow executes in this mode. The batch script will start a Nextflow instance on a single node which launches additional single-node jobs which execute each workflow task. The multinode workflow can be deployed with the following submission script:
#!/bin/bash
#SBATCH -J <job_name>
#SBATCH -p <queue_name>
#SBATCH -o %x_%j.txt
#SBATCH -e %x_%j.err
#SBATCH --mail-type=ALL
#SBATCH --mail-user=<email_address>
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --time=00:15:00
#SBATCH --mem=8G
#SBATCH -A <project_account>
nextflow -C nextflow.config run ardac_etl_workflow.nf -profile conda,hpc_cluster -with-trace -with-timeline ardac_etl_timeline.html --subjects_type clinicalTo submit the job, execute the command:
sbatch <submission_script>Note that only one core is needed for the Nextflow instance to manage the launching of the batch jobs which will execute the workflow tasks.
Performance profiling is provided in the form of traces and timelines. Traces provide process level resource utilization and runtime performance in CSV format. Timelines are HTML graphical representations of when processes are executed.
The following is a description of the trace option from the Nextflow documentation:
Nextflow creates an execution tracing file that contains some useful information about each process executed in your pipeline script, including: submission time, start time, completion time, cpu and memory used.
In order to create the execution trace file add the -with-trace command line option when launching the pipeline execution. For example:
nextflow run <pipeline> -with-traceThe options are enabled by the ARDaC shell scripts for running the workflows. A trace file created will be of the form trace-*-*.txt
An execution timeline can be produced showing when a process executes and for how long. The ARDaC shell scripts for running the workflow are configured to produce a timeline file named ardac_etl_timeline.html. The following is taken from the Nextflow documentation and describes the timeline chart:
Each bar represents a process run in the pipeline execution. The bar length represents the task duration time (wall-time). The colored area in each bar represents the real execution time. The grey area to the left of the colored area represents the task scheduling wait time. The grey area to the right of the colored area represents the task termination time (clean-up and file un-staging). The numbers on the x-axis represent the time in absolute units e.g. minutes, hours, etc.
Each bar displays two numbers: the task duration time and the virtual memory size peak.
As each process can spawn many tasks, colors are used to identify those tasks belonging to the same process. To enable the creation of the execution timeline add the -with-timeline command line option when launching the pipeline execution. For example:
nextflow run <pipeline> -with-timeline [file name]The Python scripts perform the mapping from the observational and clinical Data Coordinating Center (DCC) format to the ARDaC CDM node format. The DCC format currently supported by the mappers is set in the _constants.py file in the python/ardac script directory under the parameter __dcc_data_release__. The parameter __mapping_version__ sets the version of the mapping software which implements the mapping for the current DCC release. If the mapping for a particular DCC data release is to be updated, then the mapping version should be increased. If support for a new DCC data release is to be implemented, then the DCC release version should be increased and the mapping version reset to 1.0.0.
The Nextflow workflow script is built for a particular DCC data release. The DCC data release version expected by the Nextflow workflow is set in the nextflow.config file under the params.dcc_release configuration parameter.
The main branch is expected to contain the latest tested mapping implementation for production use. Git tags should be used to indicate which main branch version corresponds to a particular DCC release and mapping version. Development for a new mapping version should occur in the develop branch. Any development of new features should occur in subbranches of the develop branch and merged into the develop branch as they are completed. The develop branch should also be used for integration of multiple new features through merging of the feature branches into develop. Once the develop branch is fully tested, the version parameters can be updated accordingly and the branch merged into main and tagged for production release. This process is detailed in the development.md file in the docs subdirectory.