ardac-etl

ARDaC ETL (Extract Transform Load) workflows for DCC data release version 2.0.0. Currently, the workflows transform observational and clinical data stored in CSV files, to ARDaC node files in TSV format.

Installation

The python workflow is implemented and tested with python 3.13.2. Furthermore, the Nextflow workflow scripts utilize Anaconda Python virtual environments. Anaconda is chosen over the Python built-in virtual environment manager because Nextflow has a strong preference for using Anaconda virtual environments.

If Anaconda is not already installed on your system, the small-scale Anaconda environment manager called miniconda can be installed easily using the following installation instructions

The Anaconda base installation should be availble in your shell's search path after installation. Open a new terminal window so that you are using the updated environment.

To show your current search path:

echo $PATH

The path should contain the installation location of Anaconda. If not, you should seek help installing Anaconda from from your system administrator.

The next task is to create the virtual environment using Anaconda.

Create the virtual environment

If base environment is not activated, do conda activate base. If you are in another virtual environment, then deactivate it first.
cd /path/to/ardac-etl/python
conda env create -p ../venv -f conda.yml creates a virtual environment at /path/to/project/ardac-etl/venv with packages and python installed accoridng to conda.yml

The created Python virtual environment at /path/to/project/ardac-etl/venv should be used for development of the ardac-etl project and used with the Nextflow workflow. If you are using VSCode as your IDE, you should set the following configuration fields in the VSCode settings accordingly:

Python: Venv Path = ./venv
Python: Conda Path = /path/to/your/conda/bin/conda

You may also need to set Python: Locator = js if the IDE cycles indefinitely on Reactivating Terminals.

Installing Nextflow

Follow the instructions here to install Nextflow

The ARDaC mapper workflow

The mapper workflow is responsible for tranforming the observational and clinical data stored in CSV format to ARDaC node files which conform to the ARDaC common data model (CDM).

Configuring the ARDaC mapper workflow

The ARDaC mapper workflow is implemented using python scripts to generate the individual node files, and a Nextflow script which is used to manage the execution of the python scripts and manage any dependencies needed by each python process. The workflow can be configured to read the input and write the output from and to a file system approved for sensitive clinical data. The workflow can also be configured to store any intermediate data generated by the workflow on the same approved file system. Logs for each process in the workflow are also created on the approved file system, to reduce chance of spilling sensitive information to an unapproved file system.

The workflow configuration is set by the nextflow.config file. A template is provided by nextflow.config.template. In this configuration file you may specify the location of the input observational and clinical data and the output directory which contains the ARDaC CDM nodes and process log files.

The input directory, given by the parameter params.input_directory which must specify the full path, must contain three subdirectories: one given by the parameter params.obs_input_directory containing the observational data, one give by params.rct_input_directory containing the clinical data, and one given by the parameter params.node_templates_directory containing the ARDaC node template TSV file. The output directory, given by the parameter params.output_directory which must specify the full path, must contain a node subdirectory given by params.ardac_nodes_directory` where the ARDaC node files generated by the workflow will be saved. Detailed comments are provided for each parameter in the configuration file.

Running the ARDaC mapper workflow

The Nextflow workflow to generate the ARDaC nodes from the observational and clinical data sets is performed by the run_observational_workflow.bash and run_clinical_workflow.bash scripts in the nextflow subdirectory. These scripts can be run directly in that same subdirectory. A hidden log file .nextflow.log will be generated describing the run and any problems that may have occurred.

Execution modes

The workflow can be executed in HPC batch processing environments under the control of a job scheduler like SLURM. The workflow can run on a workstation or laptop, a single node in an HPC cluster, or on multiple nodes in an HPC cluster. The jobs may be interactive jobs used for debugging and development, or they may be queued to run the workflow in production. If you have access to an HPC resource, you can use that system for development and production execution of the workflow.

The Git project should be cloned and the appropriate branch checked out for testing before execution. The Nextflow configuration file will need to be updated so that the input and output data locations can specified. The following subsections show the steps for running in different modes on HPC systems governed by the SLURM scheduler.

Interactive execution on HPC cluster

To execute the workflow as an interactive job:

Change to the nextflow directory
Create an interactive job and log into the interactive node: srun -A xxxxxx -p debug -N 1 --time=01:00:00 --pty bash where --time indicates how long the node will be reserved for use, and where xxxxxx is an account the hour of run time will be charged to
Execute one of the workflow scripts run_clinical_workflow.bash or run_observational_workflow.bash

HPC batch job execution -- single-node deployment

The workflow script should be executed through a SLURM batch submission script that can specify how the job is to be executed. This script is the same as a shell script, but with directives for how SLURM should select resources to run the job. Below is an example batch submission script:

#!/bin/bash
#SBATCH -J <job_name>
#SBATCH -p <queue_name>
#SBATCH -o %x_%j.txt
#SBATCH -e %x_%j.err
#SBATCH --mail-type=ALL
#SBATCH --mail-user=<email_address>
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=3
#SBATCH --time=00:15:00
#SBATCH --mem=8G
#SBATCH -A <project_account>

conda activate ../venv
./run_clinical_workflow.bash

The directives to SLURM are given as bash shell comments. The -J specifies the name assigned to the job. The -p specifies the name of the queue the job will be submitted to. The -o and -e directives are files the standard output and standard error produced by the job will be redirected to. The mail directives specify which statuses will trigger an email notification and the e-mail account the notifications will be sent to. The --nodes directive specifies that a single node will be used for executing the job. The --ntasks-per-node directive specifies how many instances of Nextflow be executed on each node. The --cpus-per-task directive is how many cores each Nextflow instance can utilize. The --time directive sets the maximum amount of time the job can run before it is terminated by the scheduler. The --mem directive specifies the maximum amount of RAM can be used at any time by the workflow, this includes the RAM consumed by the Nextflow Java Virtual Machine and any Python processes. The -A directive specifies the project account the total node-hours will be charged to.

To submit the job, execute the command:

sbatch <submission_script>

It is important to note that the total cores allocated per node will be the number of SLURM tasks per node (given by --ntasks-per-node) multiplied by the number of cores made available to each SLURM task (given by --cpus-per-task). For single node execution of the workflow, it makes sense to run only a single Nextflow instance on the node and provide as many cores as the workflow can utilize. Currently, each Python process utilizes only a singe core. Given the current structure of the workflow, only three of the Python processes can run concurrently, so it makes sense to provide no more than three cores for the single Nextflow instance.

HPC batch job execution -- multinode deployment

The multinode workflow deployment is launched similarly to the single-node deployment; however, There are some important differences in how the workflow executes in this mode. The batch script will start a Nextflow instance on a single node which launches additional single-node jobs which execute each workflow task. The multinode workflow can be deployed with the following submission script:

#!/bin/bash
#SBATCH -J <job_name>
#SBATCH -p <queue_name>
#SBATCH -o %x_%j.txt
#SBATCH -e %x_%j.err
#SBATCH --mail-type=ALL
#SBATCH --mail-user=<email_address>
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --time=00:15:00
#SBATCH --mem=8G
#SBATCH -A <project_account>

nextflow -C nextflow.config run ardac_etl_workflow.nf -profile conda,hpc_cluster -with-trace -with-timeline ardac_etl_timeline.html --subjects_type clinical

To submit the job, execute the command:

sbatch <submission_script>

Note that only one core is needed for the Nextflow instance to manage the launching of the batch jobs which will execute the workflow tasks.

Performance profiling

Performance profiling is provided in the form of traces and timelines. Traces provide process level resource utilization and runtime performance in CSV format. Timelines are HTML graphical representations of when processes are executed.

Workflow performance trace

The following is a description of the trace option from the Nextflow documentation:

Nextflow creates an execution tracing file that contains some useful information about each process executed in your pipeline script, including: submission time, start time, completion time, cpu and memory used.

In order to create the execution trace file add the -with-trace command line option when launching the pipeline execution. For example:

nextflow run <pipeline> -with-trace

The options are enabled by the ARDaC shell scripts for running the workflows. A trace file created will be of the form trace-*-*.txt

Workflow timeline

An execution timeline can be produced showing when a process executes and for how long. The ARDaC shell scripts for running the workflow are configured to produce a timeline file named ardac_etl_timeline.html. The following is taken from the Nextflow documentation and describes the timeline chart:

Each bar represents a process run in the pipeline execution. The bar length represents the task duration time (wall-time). The colored area in each bar represents the real execution time. The grey area to the left of the colored area represents the task scheduling wait time. The grey area to the right of the colored area represents the task termination time (clean-up and file un-staging). The numbers on the x-axis represent the time in absolute units e.g. minutes, hours, etc.

Each bar displays two numbers: the task duration time and the virtual memory size peak.

As each process can spawn many tasks, colors are used to identify those tasks belonging to the same process. To enable the creation of the execution timeline add the -with-timeline command line option when launching the pipeline execution. For example:

nextflow run <pipeline> -with-timeline [file name]

Workflow versioning

The Python scripts perform the mapping from the observational and clinical Data Coordinating Center (DCC) format to the ARDaC CDM node format. The DCC format currently supported by the mappers is set in the _constants.py file in the python/ardac script directory under the parameter __dcc_data_release__. The parameter __mapping_version__ sets the version of the mapping software which implements the mapping for the current DCC release. If the mapping for a particular DCC data release is to be updated, then the mapping version should be increased. If support for a new DCC data release is to be implemented, then the DCC release version should be increased and the mapping version reset to 1.0.0.

The Nextflow workflow script is built for a particular DCC data release. The DCC data release version expected by the Nextflow workflow is set in the nextflow.config file under the params.dcc_release configuration parameter.

Git branch management

The main branch is expected to contain the latest tested mapping implementation for production use. Git tags should be used to indicate which main branch version corresponds to a particular DCC release and mapping version. Development for a new mapping version should occur in the develop branch. Any development of new features should occur in subbranches of the develop branch and merged into the develop branch as they are completed. The develop branch should also be used for integration of multiple new features through merging of the feature branches into develop. Once the develop branch is fully tested, the version parameters can be updated accordingly and the branch merged into main and tagged for production release. This process is detailed in the development.md file in the docs subdirectory.

Name		Name	Last commit message	Last commit date
Latest commit History 73 Commits
.github/workflows		.github/workflows
docs		docs
nextflow		nextflow
notebooks		notebooks
python		python
test		test
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ardac-etl

Installation

Create the virtual environment

Installing Nextflow

The ARDaC mapper workflow

Configuring the ARDaC mapper workflow

Running the ARDaC mapper workflow

Execution modes

Interactive execution on HPC cluster

HPC batch job execution -- single-node deployment

HPC batch job execution -- multinode deployment

Performance profiling

Workflow performance trace

Workflow timeline

Workflow versioning

Git branch management

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ardac-etl

Installation

Create the virtual environment

Installing Nextflow

The ARDaC mapper workflow

Configuring the ARDaC mapper workflow

Running the ARDaC mapper workflow

Execution modes

Interactive execution on HPC cluster

HPC batch job execution -- single-node deployment

HPC batch job execution -- multinode deployment

Performance profiling

Workflow performance trace

Workflow timeline

Workflow versioning

Git branch management

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages